From-Scratch Build · Reinforcement Learning

Mountain Car Control

An underpowered car sits at the bottom of a valley. Its engine isn't strong enough to drive straight up — the only way out is to learn to rock backwards first, then forwards, building momentum. I rebuilt this classic control problem from scratch because it's the cleanest demonstration of an agent discovering a non-obvious strategy on its own.

PythonQ-learningGymnasium Tile codingValue functionsControl

What it is

A car that has to go the wrong way first

Mountain Car is one of the oldest benchmarks in reinforcement learning, and it earns its place: the obvious action — full throttle towards the goal — never works. The agent has to learn the counter-intuitive move of accelerating away from the flag to gather enough momentum to make the climb. Nobody tells it this. It has to find it.

What makes it a perfect rebuild is that the whole problem lives in just two numbers — position and velocity — yet it still forces you to confront the central puzzle of RL: how does an agent assign credit to an early action whose payoff only arrives many steps later?

2
state variables — position and velocity — is all it takes to make delayed reward and exploration genuinely hard.

The stack

From environment to learned policy

A compact RL toolkit, built to expose the learning loop rather than hide it.

environment

Physics & reward

The valley dynamics: gravity, a weak engine, and a reward that drips out −1 per step until the flag is reached.

algorithm

Q-learning

Learn the value of each action in each state by bootstrapping from the agent's own evolving estimates.

representation

Tile coding

The state is continuous, so discretise it into overlapping tiles — a cheap, classic way to approximate a value function.

exploration

ε-greedy

Mostly exploit the best-known action, but sometimes try something random — without it, the car never discovers the swing.

tuning

Learning schedule

Decay the learning rate and exploration over time so early chaos settles into a stable, repeatable policy.

readout

Value surface

Plot the learned value over position × velocity — you can literally see the agent's plan emerge as a landscape.

Architecture

The learning loop

Every episode is the same loop, run thousands of times until the policy converges:

  1. Observe

    Read the car's current position and velocity from the environment.

  2. Choose

    Pick an action — push left, coast, or push right — mostly greedy, sometimes exploratory.

  3. Act & sense

    Apply the action, let the physics step forward, and collect the reward and new state.

  4. Update

    Adjust the value estimate for the action just taken using the temporal-difference error.

  5. Repeat

    Loop until the flag is reached, then start a fresh episode — slowly sharpening the policy.

Reflection

What rebuilding it taught me