From-Scratch Build · Reinforcement Learning
An underpowered car sits at the bottom of a valley. Its engine isn't strong enough to drive straight up — the only way out is to learn to rock backwards first, then forwards, building momentum. I rebuilt this classic control problem from scratch because it's the cleanest demonstration of an agent discovering a non-obvious strategy on its own.
What it is
Mountain Car is one of the oldest benchmarks in reinforcement learning, and it earns its place: the obvious action — full throttle towards the goal — never works. The agent has to learn the counter-intuitive move of accelerating away from the flag to gather enough momentum to make the climb. Nobody tells it this. It has to find it.
What makes it a perfect rebuild is that the whole problem lives in just two numbers — position and velocity — yet it still forces you to confront the central puzzle of RL: how does an agent assign credit to an early action whose payoff only arrives many steps later?
The stack
A compact RL toolkit, built to expose the learning loop rather than hide it.
The valley dynamics: gravity, a weak engine, and a reward that drips out −1 per step until the flag is reached.
Learn the value of each action in each state by bootstrapping from the agent's own evolving estimates.
The state is continuous, so discretise it into overlapping tiles — a cheap, classic way to approximate a value function.
Mostly exploit the best-known action, but sometimes try something random — without it, the car never discovers the swing.
Decay the learning rate and exploration over time so early chaos settles into a stable, repeatable policy.
Plot the learned value over position × velocity — you can literally see the agent's plan emerge as a landscape.
Architecture
Every episode is the same loop, run thousands of times until the policy converges:
Read the car's current position and velocity from the environment.
Pick an action — push left, coast, or push right — mostly greedy, sometimes exploratory.
Apply the action, let the physics step forward, and collect the reward and new state.
Adjust the value estimate for the action just taken using the temporal-difference error.
Loop until the flag is reached, then start a fresh episode — slowly sharpening the policy.
Reflection