An underpowered car can't climb the hill with raw engine power — the agent must
learn to build momentum by rocking back and forth. Train a tabular agent live and
watch its value surface, policy, and strategy emerge.
Agent
Episode 0 / 0ε 1.000
The climb
Learning curve (reward per episode — higher = fewer steps)
Greedy evaluation
Phase portrait (position × velocity — the momentum loops)
What the agent learned — x: position (valley → goal), y: velocity (− bottom, + top)
Value surface V(s)=maxₐ QGreedy policy
push left idle push rightVisit frequency (log)