Mountain Car Control — Built From Scratch

What it is

A car that has to go the wrong way first

Mountain Car is one of the oldest benchmarks in reinforcement learning, and it earns its place: the obvious action — full throttle towards the goal — never works. The agent has to learn the counter-intuitive move of accelerating away from the flag to gather enough momentum to make the climb. Nobody tells it this. It has to find it.

What makes it a perfect rebuild is that the whole problem lives in just two numbers — position and velocity — yet it still forces you to confront the central puzzle of RL: how does an agent assign credit to an early action whose payoff only arrives many steps later?

state variables — position and velocity — is all it takes to make delayed reward and exploration genuinely hard.

The stack

From environment to learned policy

A compact RL toolkit, built to expose the learning loop rather than hide it.

environment

Physics & reward

The valley dynamics: gravity, a weak engine, and a reward that drips out −1 per step until the flag is reached.

algorithm

Q-learning

Learn the value of each action in each state by bootstrapping from the agent's own evolving estimates.

representation

Tile coding

The state is continuous, so discretise it into overlapping tiles — a cheap, classic way to approximate a value function.

exploration

ε-greedy

Mostly exploit the best-known action, but sometimes try something random — without it, the car never discovers the swing.

tuning

Learning schedule

Decay the learning rate and exploration over time so early chaos settles into a stable, repeatable policy.

readout

Value surface

Plot the learned value over position × velocity — you can literally see the agent's plan emerge as a landscape.

Architecture

The learning loop

Every episode is the same loop, run thousands of times until the policy converges:

Observe
Read the car's current position and velocity from the environment.
Choose
Pick an action — push left, coast, or push right — mostly greedy, sometimes exploratory.
Act & sense
Apply the action, let the physics step forward, and collect the reward and new state.
Update
Adjust the value estimate for the action just taken using the temporal-difference error.
Repeat
Loop until the flag is reached, then start a fresh episode — slowly sharpening the policy.

Reflection

What rebuilding it taught me

Delayed reward is the whole game. The hard part isn't acting — it's learning that a step taken now matters because of a flag reached fifty steps later.
Continuous states need representation. You can't tabulate every position and velocity; tile coding is the bridge from infinite states to a learnable table.
Exploration is non-negotiable. A purely greedy agent gets stuck rolling gently in the valley forever; randomness is what finds the swing.
Reward shape dictates behaviour. The −1-per-step penalty quietly teaches urgency — change the reward and the whole strategy changes.
Convergence is a craft. Getting the learning rate and exploration decay right is the difference between a policy that stabilises and one that thrashes.

A car that has to go the wrong way first

From environment to learned policy

Physics & reward

Q-learning

Tile coding

ε-greedy

Learning schedule

Value surface

The learning loop

Observe

Choose

Act & sense

Update

Repeat

What rebuilding it taught me