From-Scratch Build · Reinforcement Learning
One environment, many algorithms. This is a workbench for the classic reinforcement-learning methods — value iteration, Q-learning, SARSA, policy gradients — each implemented from scratch and pitted against the same control tasks so you can see exactly how they differ. I built it to turn a reading list of RL chapters into something I could run.
What it is
Reinforcement learning has a sprawling family of algorithms, and reading about them only gets you so far — they truly click when you run two side by side on the same problem and watch one learn faster, more stably, or not at all. This lab is exactly that: a shared harness where each method plugs into the same environments and reports the same metrics.
The point isn't a single clever agent — it's the comparison. Value-based vs policy-based, on-policy vs off-policy, tabular vs approximate. By holding the environment fixed and swapping only the algorithm, the trade-offs that textbooks describe in prose become curves you can actually see.
The stack
Each is implemented from first principles, not imported from a library.
When the model is known, sweep the Bellman equation to optimality — the baseline every learner is measured against.
Learn the optimal action-values directly, even while exploring — the workhorse of tabular RL.
Learn the value of the policy you're actually following — more cautious, and revealing about the difference.
Skip values; nudge the policy parameters directly in the direction of higher reward.
A common interface so any agent can run on any task without touching the algorithm code.
Reward-over-time plots and stability metrics, logged identically for every method.
Architecture
Every experiment follows the same controlled procedure so results are actually comparable:
Choose a control environment and freeze its settings for the whole sweep.
Drop in one algorithm through the shared interface — nothing else changes.
Run the agent for a fixed budget of episodes, logging reward and stability throughout.
Swap the agent and rerun under identical conditions and seeds.
Overlay the learning curves to read off speed, stability and final performance.
Reflection