Course Structure · Syllabus-Driven
AI: Reinforcement Learning
The complete map of the course — from the exploration–exploitation dilemma of bandits to deep value learning, policy gradients, and multi-agent dynamics. Every one of the 30 sessions is laid out below with its objective, the concepts it covers, the core mathematics, and a link to the live demo that lets you watch the algorithm run.
BCSAI · Third year
6.0 ECTS
30 sessions
Semester 2 · 25–26
Compulsory
English
Overview
Reinforcement Learning (RL) is, in the words of the course, a melting pot where the computational
techniques of dynamic programming, heuristic search in problem spaces, and the functional representation power
of neural networks have converged to build a highly scalable discipline — one that lets us efficiently attack a
wide range of problems previously considered practically unapproachable. An agent acts in an
environment; the environment returns a state and a reward; the
agent's job is to learn a policy π(a|s) that maximises the expected cumulative,
discounted return.
Gt = Rt+1 + γ Rt+2 + γ² Rt+3 + … = Σk≥0 γk Rt+k+1
— the discounted return; γ ∈ [0,1] trades off immediate vs. future reward.
How this page maps to the demos. This is the course's structural companion. Wherever a session
introduces a technique that has an interactive visualisation in this repo, you'll find a
demo link — click through to run the algorithm yourself, tweak
parameters, and watch the update rule shown on this page take effect in real time. Nothing is precomputed; every
chart is a real algorithm running in your browser.
Prerequisites & how this course connects. RL sits at the meeting point of several earlier
BCSAI strands, and the course assumes a working grasp of each. From probability & statistics you need
expectations, conditional probability, and variance — every value function is an expectation and most of the hard
engineering is variance control. From linear algebra & multivariate calculus you need vectors, the
gradient, and the chain rule — these power function approximation and every policy-gradient update. From
programming (Python) you need NumPy fluency and comfort writing small simulations; assignments are Jupyter
notebooks or pure scripts. From machine learning / deep learning you need supervised learning, loss
functions, SGD, and the basics of neural nets — Deep RL (DQN, actor-critic) is "deep learning whose targets the
agent generates for itself". Looking forward, RL is the foundation for advanced electives in robotics, autonomous
systems, operations research, and the RLHF/preference-optimisation techniques behind modern language models.
Weekly study-load note. The official estimate is 3–8 h/week (≈150 h over the semester for
6 ECTS). A realistic rhythm: ~1 h re-reading the matching Sutton & Barto chapter before the session,
~1–2 h consolidating notes and running the linked demo after it, and the rest absorbed by the four assignment
pushes (Temporal Differences S9, Actor-Critic S17, Deep RL S21, and the group project). Load is front-light and
back-heavy: expect calm foundational weeks (S1–S8) and intense implementation weeks around the assignments and the
final project. Because over half the workload (53.3%) is exercises/field work, the single highest-return habit is
to code each algorithm yourself rather than only reading it.
Learning Objectives
The main objective is to introduce students to the field of RL and to set up a framework of knowledge that lets
them make informed analysis of the opportunities and challenges for its successful application in business. The
course aims for a sound understanding of RL techniques and their variations (model-free, model-based,
online/offline RL, behavioural cloning, …) and a good intuition for how to apply them to tasks of one's own design.
It encompasses knowledge about:
01
Fundamentals & theory
The fundamentals and theoretical principles of the discipline — agents, environments, value, policy, the Bellman equations, and convergence.
02
Relevant algorithms
The most relevant algorithms used in practice — from tabular Q-value methods to Deep Q-Learning, policy gradients, actor-critic, and curiosity-driven approaches.
03
Engineering process
The engineering process for developing and managing RL projects: framing the problem, choosing a method, training, evaluation, and iteration.
04
Difficulties & challenges
An understanding of the difficulties, obstacles, and challenges along the path — sample efficiency, stability, reward design, and deployment.
The course deliberately combines a theoretical and conceptual approach with a hands-on technical understanding of
the different stages of an RL project, always keeping a clear perspective on the business issues and opportunities
and on the field's future evolution. Implementations concentrate on Python at a basic-to-intermediate
level of complexity; advanced examples may touch on environments such as LUA (robotics) or Matlab (process control)
for showcase purposes only.
Teaching Methodology
IE University's teaching method is collaborative, active, and applied: students participate throughout the process
to build their knowledge and sharpen their skills, while the professor leads and guides toward the learning
objectives. The course is both lecture- and example-based, with frequent in-class group discussions. Knowledge
blocks are organised by integrating the conceptual/theoretical background first, then moving into
practice/tutorial examples, and finally framing assignments of assorted complexity. The course has six main
elements:
Lectures
Theoretical ideas, concepts, and methods, with on-time checks of understanding through questions and feedback.
interactive dialog
Examples / Tutorials / Cases
Cases and examples tied to the theory and preparatory for assignments; what-if interactive analysis is encouraged.
hands-on
Discussions
Group discussions on the relevant concepts of each section; some are announced in advance and require preparation.
presentation skill
Assignments
Practical exercises implementing RL algorithms in Python (Jupyter or pure scripts; occasionally numba/cython for speed) and analysing results.
implement · experiment
Exams
One formal test/exam, plus a series of "test-exam" practice assignments that simulate an open-book exam to build fluency.
test-exam practice
Group work
A small-group final project presentation in the final part of the course.
final project
Effort distribution (150 hours total)
The learning-activity weighting reflects how a student is expected to spend the 150 hours of the 6-ECTS workload.
Lectures16.7%
Discussions6.7%
Exercises / async / field53.3%
Group work10.0%
Individual studying13.3%
| Lectures | 25.0 h |
| Discussions | 10.0 h |
| Exercises in class, asynchronous sessions, field work | 80.0 h |
| Group work | 15.0 h |
| Individual studying | 20.0 h |
| Total | 150.0 h |
GenAI policy. Generative-AI tools may be used for research, ideation, outlining, proofreading,
coding, diagrams, and image generation with appropriate acknowledgement. GenAI may be used for assignments
and exam preparation but not during the final exam. Students are responsible for validating any AI
outputs; using AI must always be acknowledged (a suggested disclosure format is provided in the syllabus).
Assessment & Evaluation
Roughly one third of the grade rewards attendance/participation, one third the
individual assignments, and the final third the group work and presentation plus
the test/exam. The official weighting is:
Final Exam30%
Individual Assignments30%
Participation20%
Group Assignment10%
Group Presentation10%
Note: the ongoing-grading scoring matrix in the syllabus distributes session points across Participation (20%), Individual Assignment (30%), Group Assignment (20%), and Test/Exam (30%) of the continuous-evaluation total; the criteria table above states the headline grade weights.
30%
Final Exam
One formal test/exam. Most questions are drawn from the practice "test-exams" done during the course. GenAI is not allowed during the final exam.
closed · individual
30%
Individual Assignments
Implement RL algorithms in Python and analyse results. Delivered jointly in dynamically-formed small groups (no two students paired twice); collaboration enforced within the group only.
implement · analyse
20%
Participation
Come prepared as to a company meeting. Direct interaction, in-class group discussions, and inquisitive engagement across the 30 sessions.
come prepared
10%
Group Assignment
Final group project work, with groups optimised from student-informed preferences for the final assignment.
final project
10%
Group Presentation
Presentation, discussion, and Q&A of the group project in the closing sessions (29–30).
present · defend
Attendance, passing & re-sit rules
- Attendance. Students are expected to attend all lectures and tutorials — material covered in class is not always in the readings. The 80% attendance rule is binding.
- Test-exam. The Session-28 test-exam is a concept-summary review, not a final exam; it adds to the rest of the items and has no minimum passing grade.
- Four calls. Each student has four chances to pass a course over two consecutive academic years (ordinary + extraordinary June/July re-sits).
- Attendance failure. Students who do not comply with the 80% rule fail both calls for the year and must re-enrol the next academic year — with no re-sit opportunity.
- Re-sit (June/July). A single comprehensive exam, in-person at the enrolled campus (Segovia or Madrid). The final grade depends on this exam only; continuous evaluation is not counted. Pass mark 5, maximum 8.0 ("notable").
- Retakers (3rd call). Must check the assigned professor's syllabus and contact them about specific criteria; maximum grade 10.0.
- Review session. Grade appeals require having attended the post-exam review session.
- Program rule. Failing more than 18 ECTS in the year after the June/July re-sits leads to being asked to leave the program.
Per-component rubric, deliverable format & tips
The criteria below interpret the syllabus's review-criteria language into concrete expectations. The professor reports the exact structure and review criteria for each specific job; treat this as the default standard.
| Component | What's assessed | Deliverable format | Tips to score well |
| Individual Assignments (30%) |
Correct implementation of the target algorithm; sound experimental design; clear analysis of results (learning curves, ablations); quality of written summary. |
Python — Jupyter notebook or pure script (occasionally numba/cython for speed); a short write-up summarising approach, results, and interpretation. Delivered jointly within the dynamically-formed small group. |
Make results reproducible (fix seeds, report variance over runs); plot a learning curve, not just a final number; explain why the curve looks as it does; acknowledge GenAI use explicitly with prompts. |
| Final Exam (30%) |
Conceptual command of the whole course — definitions, when-to-use-which, and reasoning about algorithm behaviour. Most questions are drawn from the practice test-exams. |
One formal, closed test (multiple-choice, true/false, open). No GenAI permitted during the exam. |
Do every practice test-exam and track your improvement; be able to state each core update rule from memory and explain its bias/variance trade-off; rehearse "compare X vs Y" answers. |
| Participation (20%) |
Coming prepared as to a company meeting; quality of contributions to discussions and in-class group interaction; inquisitiveness. |
Continuous, across all 30 live in-person sessions; some discussions announced in advance and require preparation. |
Read the mapped chapter beforehand; ask one specific question per session; meet the binding 80% attendance rule. |
| Group Assignment (10%) |
An end-to-end RL project: problem framing, method choice and justification, implementation, evaluation, and limitations. |
Group work; groups optimised from student-informed preferences. Code plus a project report. |
Pick a problem that genuinely suits sequential decision-making; justify the method against simpler baselines; report honest failure modes. |
| Group Presentation (10%) |
Clarity of communication; ability to defend design choices in Q&A; visual and narrative quality. |
Live presentation + discussion + Q&A in sessions 29–30. |
Lead with the framing and the result; show one compelling demo or plot; pre-empt the obvious "why not a simpler method?" question. |
Ongoing-grading note. The syllabus's session scoring matrix concentrates assignment points on a few sessions — Individual Assignments on Temporal Differences (S9), Actor-Critic (S17) and Deep RL Practice (S21); the Group Assignment on Summary Recap (S22) and Presentations (S29); and the Test/Exam block on Session 28. The headline criteria weights (30/30/20/10/10) are what determine the final grade.
Program — All 30 Sessions
The program is organised here into eight thematic modules that group the 30 sessions by the arc of
the field: foundations → tabular planning & learning → integration & evolution → frameworks → approximation
& deep policy methods → advanced paradigms → frontiers & applications → assessment. Sessions flagged
assignment, group, or
exam mark deliverables. The disclaimer in the syllabus applies: the coverage is
tentative and pace adapts to the group.
Module 1
Sessions 1–4
Foundations: Agents, Feedback & Bandits
What RL is, where it came from, what an intelligent agent is, and the simplest setting where learning-from-reward already bites: the multi-armed bandit. This module builds the vocabulary the rest of the course leans on.
Module learning outcomes. Articulate the agent–environment loop and the difference between evaluative and instructive feedback; explain the historical convergence of control, search, and learning into modern RL; formalise and solve a bandit problem balancing exploration and exploitation.
S01
Introduction to the Course (Syllabus)
Objective: motivate procedural learning and RL, and survey the contents, approaches, and the relevant issues in the field.
Procedural learning & RL motivation — why learning by trial-and-error with delayed, evaluative reward is a distinct paradigm from supervised and unsupervised learning.
Overview & issues — a map of the course and the recurring tensions (exploration vs. exploitation, bias vs. variance, sample efficiency vs. stability) that will return throughout.
Key idea. RL optimises behaviour from a scalar reward signal alone — there are no labelled "correct actions", only consequences to be maximised over time.
S02
RL Introduction
Objective: identify the building blocks of current RL algorithms and the convergent routes that produced them.
Building blocks — policy, reward signal, value function, and (optionally) a model of the environment.
Convergent routes — how Model Predictive Control, Markov chains, dynamic programming, heuristic search, and genetic algorithms all flow into modern RL.
vπ(s) = Eπ[ Gt | St = s ] — the state-value function: expected return from s under policy π.
Key idea. Value is "reward foresight": a state is good not for its immediate reward but for the stream of reward it leads to.
Key takeaway · Modern RL is the confluence of three historical streams — optimal control (Bellman), trial-and-error learning (Thorndike, law of effect), and temporal-difference prediction.
Reading: Sutton & Barto, Ch. 1 §1.7 — Early history of Reinforcement Learning (and §1.1 for the agent-environment intuition).
S03
RL Basic Concepts
Objective: define intelligent agents and their types, and distinguish the feedback regimes RL operates under.
Intelligent agents & types — reflex, model-based, goal-based, utility-based, and learning agents.
Evaluative vs. instructive feedback — evaluative tells you how good the action was (RL); instructive tells you the correct action regardless of what you did (supervised).
Associative vs. non-associative — whether the best action depends on the situation (associative / contextual) or not (the plain bandit).
Key idea. Evaluative feedback is what forces exploration: you can only judge an action relative to others by trying them.
Key takeaway · The four elements — policy, reward signal, value function, and (optional) model — plus the evaluative/associative axes locate any RL setting in a single map.
Reading: Sutton & Barto, Ch. 1 §1.3 — Elements of Reinforcement Learning (policy, reward, value, model).
S04
Multi-Armed Bandits
Objective: introduce (contextual) bandits as the basis for full RL and as immediately useful business tools.
The k-armed bandit — on each step the agent picks one of k arms and receives a stochastic reward; the goal is to maximise total reward over many pulls. Because there are no states and no transitions, the bandit isolates the one ingredient that makes RL hard — evaluative feedback under uncertainty. The agent keeps a running estimate Q(a) of each arm's mean reward and must decide whether to pull the current best or gather more information about the rest.
Exploration strategies — ε-greedy exploits the best arm but with probability ε picks at random; UCB adds an "optimism" bonus that shrinks as an arm is sampled; Thompson sampling keeps a posterior per arm and samples from it; gradient bandits learn action preferences via softmax. Each is a different answer to "how much should uncertainty be worth?".
Contextual bandits — the action value is conditioned on an observed context x, so Q becomes Q(x,a). This is the bridge from bandits to full RL: add state-to-state transitions and a contextual bandit becomes an MDP.
Business "low-hanging fruit" — A/B testing, ad and content selection, recommendation, and dynamic pricing are direct bandit applications where each "pull" is a user impression and the reward is a click or purchase.
Qn+1(a) = Qn(a) + α [ Rn − Qn(a) ] · UCB: at = argmaxa [ Q(a) + c √(ln t / N(a)) ]
Tiny worked example. Two arms. Arm A pays 1 with prob 0.6, arm B with prob 0.5. Start Q(A)=Q(B)=0, α=0.5. Pull A, get reward 1: Q(A) ← 0 + 0.5·(1−0) = 0.5. Pull A again, get 0: Q(A) ← 0.5 + 0.5·(0−0.5) = 0.25. With UCB and few pulls of B, the √(ln t / N(B)) bonus keeps B's index high, forcing the agent to sample it before it can safely commit to A.
Pitfall. A constant step size α makes the estimate track a non-stationary arm (good for changing environments) but never fully converges; the sample-average 1/N(a) converges but adapts slowly. Choosing ε too small starves exploration; too large wastes reward forever.
Key idea. Exploration is not waste — the √(ln t / N(a)) bonus shows that uncertainty itself has option value worth paying for.
Key takeaway · The bandit is RL stripped to its core dilemma; everything later (Q-values, ε-greedy, optimism) is this lesson generalised to states.
▶ Demo: Multi-Armed Bandits
Reading: Sutton & Barto, Ch. 2 — Multi-armed Bandits. Core sections: §2.3 incremental sample-average, §2.6 UCB, §2.7 gradient bandits. The chapter's running 10-armed testbed is exactly what the linked demo recreates.
Module 2
Sessions 5–9
Tabular Worlds: MDPs, Dynamic Programming, Monte Carlo & TD
The theoretical heart of the course. We formalise the environment as a Markov Decision Process, solve it exactly with dynamic programming when the model is known, and then learn from experience alone with Monte Carlo and temporal-difference methods.
Module learning outcomes. Formalise a problem as a finite MDP; write and apply the Bellman expectation and optimality equations; run value iteration and policy iteration; contrast Monte Carlo and TD estimation and explain bootstrapping, on-policy vs. off-policy, and the role of importance sampling.
S05
Graphs, Search, MDPs
Objective: understand the search space and its representation, the dimensions of heuristic search for prediction and control, and introduce Markov Decision Processes.
Search spaces & heuristics — a problem can be drawn as a graph with states as nodes and actions as edges; classical informed search (A*, best-first) uses a heuristic to estimate cost-to-go. RL generalises this: the value function is a learned, self-improving heuristic, and control means acting greedily with respect to it.
Markov Decision Processes — the tuple ⟨S, A, P, R, γ⟩ formalises a sequential decision problem: state set S, action set A, transition kernel P(s′|s,a), reward R, and discount γ. The defining Markov property is that the next state and reward depend only on the current state-action pair, never on the path that led there.
Policies & the two value functions — a policy π(a|s) induces a state-value vπ(s) and an action-value qπ(s,a); the Bellman expectation equation makes each value a one-step lookahead over its own successors — a recursion that is the engine of every method that follows.
vπ(s) = Σa π(a|s) Σs′,r p(s′,r | s,a) [ r + γ vπ(s′) ] — the Bellman expectation equation.
Tiny worked example. A 2-state chain: from S0 action "go" moves to S1 (reward 0), S1 is terminal with reward +10, γ=0.9. Then v(S1)=10 and v(S0)=0 + 0.9·v(S1)=9. The single discount step turns a future +10 into a present value of 9 — that 0.9 factor is the whole reason "near rewards beat far rewards".
Connects to. A contextual bandit is an MDP with one step and no transitions; dynamic programming (S7) solves this equation exactly; TD (S9) samples it; function approximation (S15) replaces the table v(s) with a parameterised v̂(s,w).
Key idea. The Markov property is what makes value functions well-defined and recursive — the present state is a sufficient statistic of history.
Key takeaway · Almost everything in RL is an attempt to compute or approximate the solution of this one recursive equation.
▶ Demo: MDPs & Gridworld
Reading: Sutton & Barto, Ch. 3 — Finite Markov Decision Processes. Anchor sections: §3.1 the agent–environment interface, §3.5 policies and value functions, §3.6 optimal value functions. The "recycling robot" and gridworld examples motivate the formalism.
S06
Review Practice — MDPs
Objective: review and apply MDP concepts on practical cases, including analytical solutions for simple problems.
Practical MDP cases — intuition-building and closed-form solutions for small, tractable MDPs.
Backup diagrams — the standard technique for representing how value flows from successor states back to the current one.
Key idea. A backup diagram is the picture of a Bellman equation — reading it left-to-right is exactly one update.
▶ Demo: MDPs & Gridworld
S07
Dynamic Programming
Objective: solve known MDPs exactly using the iterative value function and Generalized Policy Iteration.
Value iteration — repeatedly apply the Bellman optimality update until v converges to v*.
Policy iteration — alternate full policy evaluation with greedy policy improvement.
Generalized Policy Iteration (GPI) — the unifying view: evaluation and improvement chasing each other to a fixed point.
vk+1(s) = maxa Σs′,r p(s′,r | s,a) [ r + γ vk(s′) ] — the Bellman optimality backup (value iteration).
# Value iteration
init V(s) ← 0 for all s
repeat
Δ ← 0
for each state s:
v ← V(s)
V(s) ← max_a Σ_s′,r p(s′,r|s,a) · [ r + γ·V(s′) ]
Δ ← max(Δ, |v − V(s)|)
until Δ < θ # θ = small convergence threshold
π(s) ← argmax_a Σ_s′,r p(s′,r|s,a)·[ r + γ·V(s′) ] # extract greedy policy
Tiny worked example. Linear gridworld [ T(+1) | A | B ], γ=1, deterministic moves, reward 0 except entering T. Sweep 1: V(A)=max(go-right→V(B)=0)=0, V(B)=max(go-left→V(A)=0, go-right→T=+1)=1. Sweep 2: V(A)=max(→V(B)=1)=1. Values converge in two sweeps because information propagates one cell per sweep — exactly what the demo animates.
Pitfall. DP is exact but assumes a perfect known model p(s′,r|s,a) and visits every state every sweep — it does not scale to large or unknown environments (the "curse of dimensionality"). That limitation is the entire motivation for the sampling methods that follow.
Key idea. DP needs a perfect model and sweeps every state — it is the gold-standard planner that all model-free learners try to approximate from samples.
Key takeaway · Value iteration = repeatedly apply the optimality backup; policy iteration = alternate full evaluation with greedy improvement; GPI is the umbrella over both.
▶ Demo: Dynamic Programming
Reading: Sutton & Barto, Ch. 4 — Dynamic Programming. Key sections: §4.1 policy evaluation, §4.3 policy iteration, §4.4 value iteration, §4.6 Generalized Policy Iteration. Jack's-car-rental and gambler's-problem examples are the canonical DP exercises.
S08
Monte Carlo Methods
Objective: learn values from complete episodes without a model, and master importance sampling.
Monte Carlo estimation — average the actual returns observed after visiting a state; no bootstrapping, learns only from finished episodes.
Importance sampling — reweight returns generated under a behaviour policy to estimate values under a target policy — the engine of off-policy learning.
Merging with DP — how MC and DP combine as two ends of the bias–variance spectrum.
V(St) ← V(St) + α [ Gt − V(St) ] · IS ratio ρ = Πk π(Ak|Sk) / b(Ak|Sk)
Tiny worked example. One episode visits S → reward sequence (+1, 0, +2) until termination, γ=1, so G = 3. With V(S)=0 and α=0.1: V(S) ← 0 + 0.1·(3−0) = 0.3. A second episode returning G=1 gives V(S) ← 0.3 + 0.1·(1−0.3) = 0.37 — the estimate drifts toward the average return, which is exactly what MC computes in the limit.
Pitfall. MC needs complete episodes, so it cannot be used in continuing tasks, and ordinary importance sampling can have infinite variance when the ratio ρ explodes; weighted IS is biased but far more stable in practice.
Connects to. MC and DP are the two extremes — MC samples a full return with no model and no bootstrap; DP bootstraps with a full model. TD (next session) sits exactly in between, and TD(λ) slides continuously along the line connecting them.
Key idea. MC is unbiased but high-variance: it waits for the true return Gt instead of guessing it — the opposite trade-off to TD.
Key takeaway · Average actual returns = unbiased values, at the cost of high variance and an episodic-only restriction; importance sampling is what makes it work off-policy.
▶ Demo: Monte Carlo & TD
Reading: Sutton & Barto, Ch. 5 — Monte Carlo Methods. Anchor sections: §5.1 MC prediction, §5.3 MC control, §5.5 off-policy via importance sampling, §5.6 weighted importance sampling. The blackjack example runs through the whole chapter.
S09
Temporal Differences
assignment
Objective: integrate the prior concepts into the cornerstone of RL for finite-MDP-like problems; work the Bellman optimality equation through TD control.
TD(0) prediction — bootstrap: update toward an estimate (r + γV(s′)) rather than the full return.
SARSA (on-policy) & Q-learning (off-policy) — the two canonical TD control algorithms; Expected SARSA as the lower-variance middle ground.
TD(λ) & n-step bootstrapping — eligibility traces interpolate smoothly between TD(0) and Monte Carlo.
Q(s,a) ← Q(s,a) + α [ r + γ maxa′ Q(s′,a′) − Q(s,a) ] — the Q-learning update (off-policy TD control).
TD(λ): δt = rt+1 + γV(st+1) − V(st); et(s) = γλ et−1(s) + 𝟙[s=st]; V(s) ← V(s) + α δt et(s)
# Q-learning (off-policy TD control)
init Q(s,a) arbitrarily; choose α, ε, γ
for each episode:
s ← initial state
while s not terminal:
a ← ε-greedy(Q, s) # behaviour policy explores
take a, observe r, s′
Q(s,a) ← Q(s,a) + α·[ r + γ·max_a′ Q(s′,a′) − Q(s,a) ]
s ← s′ # target uses greedy max → off-policy
Tiny worked example. State s, action a, reward r=−1, γ=0.9, α=0.5. Currently Q(s,a)=2 and the best next value max Q(s′,·)=10. TD target = −1 + 0.9·10 = 8; error δ = 8 − 2 = 6; update Q(s,a) ← 2 + 0.5·6 = 5. The agent revised its estimate using a guess of the future (10) instead of waiting for the real return — that is bootstrapping.
Pitfall. On the cliff-walking task, Q-learning learns the optimal cliff-edge path but ε-greedy exploration occasionally falls off, so its online reward is worse than cautious SARSA — a classic illustration that "optimal policy" and "best behaviour while learning" differ.
Connects to. SARSA's update is the actor-critic critic in miniature; Q-learning's max becomes the DQN target (S21) once Q is a neural network; n-step and TD(λ) are the same idea that reappears as GAE in modern policy-gradient methods.
Key idea. Bootstrapping trades variance for bias — TD can learn online, step-by-step, from incomplete episodes, which is why it is the workhorse of practical RL.
Key takeaway · TD = sampled experience + bootstrapped target; SARSA learns the policy it follows, Q-learning learns the greedy optimum regardless of how it explores.
▶ Demo: SARSA / Q-Learning (Cliff Walking)
Reading: Sutton & Barto, Ch. 6 — Temporal-Difference Learning (§6.2 TD prediction, §6.4 SARSA, §6.5 Q-learning, §6.6 Expected SARSA); Ch. 7 — n-step Bootstrapping (§7.1–7.2). The cliff-walking example in §6.5 is precisely the demo.
Module 3
Sessions 10–12
Integration, Planning & Evolutionary Computation
Putting the tabular toolkit to work on a real problem, comparing model-based and model-free learning, and stepping sideways into gradient-free optimisation — genetic algorithms, Pareto optimality, and neuro-evolution.
Module learning outcomes. Solve an integrated problem with RL and justify the choice against alternatives; distinguish model-based planning from model-free learning; explain genetic algorithms, multi-objective/Pareto optimisation, and NEAT, and how neuro-evolution relates to mainstream RL.
S10
A First Practice — Integration of Concepts
Objective: apply the studied principles to an initial problem and compare RL against alternative solutions.
Comparative solutions — hardcoded heuristics, genetic algorithms, Hamiltonian paths vs. an RL approach; live charting of each.
Rationale for RL — when and why a learned policy beats a hand-engineered or search-based solution.
Key idea. RL is not always the answer — the discipline is choosing it when the problem has delayed reward, large state spaces, and no good model.
▶ Demo: MC / TD
▶ Demo: DP
S11
Learning and Planning
Objective: compare in depth the characteristics and differences of model-based vs. model-free methods.
Planning with a model — using a learned or given model to simulate experience (e.g. Dyna-style integration of planning, acting, and learning).
Model-free learning — learning directly from real experience without ever building a transition model.
Key idea. A model multiplies the value of every real sample by letting you "imagine" more — at the risk of compounding model error.
Reading: Sutton & Barto, Ch. 8 — Planning & Learning with Tabular Methods.
S12
Evolutionary Computation
Objective: explore genetic algorithms and beyond — non-dominated multi-objective optimisation, the Pareto frontier, and NEAT.
Genetic algorithms — populations of candidate policies evolved by selection, crossover, and mutation; gradient-free black-box optimisation.
Pareto frontier — non-dominated solutions in multi-objective problems where no objective can improve without hurting another.
Neuro-Evolution / NEAT — evolving the weights and topology of neural-network policies, and links to mainstream RL.
x* ∈ Pareto ⇔ ∄ x : (∀i fᵢ(x) ≥ fᵢ(x*)) ∧ (∃j fⱼ(x) > fⱼ(x*)) — non-dominance.
# Generic genetic algorithm for policy search
population ← N random policies
repeat
evaluate fitness(p) = episode return, for each p
parents ← select(population) # e.g. tournament / roulette
children ← crossover(parents) then mutate(children)
population ← elitism(best) + children
until budget exhausted
Tiny worked example (Pareto). Three policies scored on (speed, safety): A=(9,2), B=(5,5), C=(2,9). None dominates another — improving speed costs safety — so all three lie on the Pareto frontier, whereas D=(4,4) is dominated by B and is discarded. A multi-objective GA (e.g. NSGA-II) keeps the whole frontier rather than a single "best".
Key idea. Evolution needs no gradient and no Markov assumption — it optimises whole policies by survival, which can succeed where backprop-through-reward struggles.
Key takeaway · Gradient-free, embarrassingly parallel, and robust to sparse or non-differentiable rewards — the trade-off is sample inefficiency versus gradient-based RL.
▶ Demo: Evolutionary Methods
Module 4
Sessions 13–14
Tooling: RL Frameworks & Intermediate Practice
The practical engineering layer — the libraries practitioners actually use, and a comparative training practice across algorithmic methods.
Module learning outcomes. Compare major RL frameworks and select one for a task; set up a training/evaluation loop and compare algorithmic methods empirically.
S13
RL Frameworks
Objective: survey and compare the main RL frameworks in current use.
Environment & algorithm libraries — OpenAI Gym/Gymnasium (environments), Stable-Baselines (algorithms), Ray/RLlib (scalable distributed training).
Pros, cons & requisites — when each framework is the right choice and what it demands of your problem.
Key idea. A standard environment interface (reset / step → state, reward, done) is what lets one algorithm be dropped onto a thousand problems.
S14
Intermediate Practice
Objective: analyse and compare different algorithmic methods for training these models.
Empirical comparison — running several algorithms on the same task and reading learning curves, sample efficiency, and stability.
Key idea. In RL, the "best" algorithm is task-dependent — comparison on the actual problem beats reputation.
Module 5
Sessions 15–18
Function Approximation & Deep Policy Methods
Where RL goes deep. When state spaces are too large to tabulate, we approximate value and policy functions — with neural networks — and learn the policy directly via policy gradients and actor-critic.
Module learning outcomes. Explain why tabular methods fail to scale and how function approximation generalises; state the policy gradient theorem and implement REINFORCE; describe actor-critic and the advantage function; critique sample solutions and diagnose common training failures.
S15
Approximation Methods
Objective: make the key strategic shift to approximating value, q(s,a), or policy functions — with a focus on ANN approximators.
Why approximate — tabular storage and learning are infeasible for large or continuous state spaces; approximation buys generalisation across similar states.
Feature representations — tile coding, RBFs, polynomial features, and learned neural features.
Eligibility traces with approximation — extending TD(λ) into the function-approximation setting.
w ← w + α [ Ut − v̂(St,w) ] ∇w v̂(St,w) — semi-gradient value-function update.
Tiny worked example. Linear value v̂(s,w)=wᵀx(s) with feature x(s)=[1, s]. Suppose w=[0, 1], state s=3 so v̂=3, and a TD target U=5. Error = 5−3 = 2, gradient ∇wv̂ = x = [1,3]. With α=0.1: w ← [0,1] + 0.1·2·[1,3] = [0.2, 1.6]. One update nudges every state's prediction, not just s=3 — that is generalisation in action.
Pitfall — the deadly triad. Combine bootstrapping + off-policy training + function approximation and the value estimate can diverge to infinity even on tiny problems (Baird's counterexample). The "semi-gradient" name flags that we ignore the target's dependence on w, which is part of why stability is fragile.
Key idea. Generalisation is double-edged: updating one state's value changes others — the "deadly triad" (bootstrapping + off-policy + approximation) can diverge.
Key takeaway · Approximation buys scale and generalisation but forfeits the convergence guarantees of the tabular world — the rest of deep RL is about clawing stability back.
▶ Demo: Function Approximation
Reading: Sutton & Barto, Ch. 9 §9.7 (Nonlinear/ANN approximation), Ch. 10, 11, 12 (Eligibility traces).
S16
Policy Gradients
Objective: learn the theory behind policy-gradient methods and the Policy Gradient Theorem.
Parameterised policies — represent π(a|s,θ) directly and optimise θ to maximise expected return J(θ).
Policy Gradient Theorem — the gradient of return can be written without differentiating the environment dynamics.
REINFORCE — Monte Carlo policy gradient: scale ∇log π by the observed return.
∇θJ(θ) = Eπ[ ∇θ log π(a|s,θ) · Gt ] — the policy gradient theorem (REINFORCE estimator).
# REINFORCE (Monte Carlo policy gradient)
for each episode generated with π(·|·,θ):
observe trajectory s₀,a₀,r₁, … , s_{T−1},a_{T−1},r_T
for t = 0 … T−1:
G_t ← Σ_{k=t}^{T−1} γ^{k−t} · r_{k+1} # return from step t
θ ← θ + α · γ^t · G_t · ∇_θ log π(a_t | s_t, θ)
Tiny worked example. Two actions with softmax preferences; currently π(left)=0.5. An episode where we took left returns G=+4. The update multiplies ∇log π(left) by +4, raising π(left) on the next pass; had the same action returned −4, the identical gradient would be scaled negative and π(left) would fall. The sign and size of the return are the learning signal.
Pitfall. The plain REINFORCE estimator is unbiased but extremely high variance — returns vary wildly between episodes. Subtracting a state-dependent baseline b(s) (typically V(s)) leaves the gradient unbiased but slashes variance; this baseline is exactly the critic introduced next session.
Key idea. Push up the log-probability of actions that led to high return and down those that didn't — no value function strictly required, but variance is high.
Key takeaway · Optimise the policy directly by gradient ascent on expected return — works in continuous action spaces where value-based argmax is intractable, but needs variance reduction to be practical.
▶ Demo: Policy Gradients (REINFORCE)
Reading: Sutton & Barto, Ch. 13 — Policy Gradient Methods. Anchor sections: §13.1 policy parameterisation, §13.2 the policy gradient theorem, §13.3 REINFORCE, §13.4 REINFORCE with baseline.
S17
Actor-Critic
assignment
Objective: define and analyse actor-critic methods, the concept of advantage, and the main algorithms.
Actor + critic — the actor (policy) chooses actions; the critic (value function) evaluates them and provides a low-variance learning signal.
Advantage — A(s,a) = Q(s,a) − V(s): how much better an action is than the state's average; subtracting the baseline V(s) cuts variance without adding bias.
Main algorithms — A2C/A3C and pointers to PPO as the modern, stable successor.
θ ← θ + α ∇θ log π(a|s,θ) · Â, with  = r + γV(s′) − V(s) — A2C advantage update.
# One-step Actor-Critic (A2C core)
init actor θ, critic w
for each step (s, a, r, s′):
δ ← r + γ·V(s′,w) − V(s,w) # TD error = advantage estimate Â
w ← w + α_w · δ · ∇_w V(s,w) # critic: reduce TD error
θ ← θ + α_θ · δ · ∇_θ log π(a|s,θ) # actor: push policy toward good Â
Tiny worked example. Critic says V(s)=2, V(s′)=5; observed r=1, γ=0.9. Advantage  = 1 + 0.9·5 − 2 = 3.5 > 0, so the action was better than the critic expected → the actor increases its probability and the critic raises V(s) toward the target 5.5. A negative  would push the action's probability down instead.
Pitfall. Actor and critic chase a moving target each other defines; if the critic learns too slowly the advantage is noisy, if the actor moves too fast the policy collapses. Stable modern variants (PPO) add a clipped objective to bound how far the policy can shift per update.
Connects to. The critic's δ is exactly the TD error from S9; the actor's update is REINFORCE from S16 with Gt replaced by Â. Actor-critic is literally the marriage of the previous two sessions.
Key idea. The critic turns REINFORCE's noisy full-episode return into a one-step bootstrap estimate — dramatically lower variance, faster learning.
Key takeaway · Actor proposes, critic evaluates; the advantage Â=Q−V is the low-variance, zero-bias signal that makes policy gradients practical at scale.
▶ Demo: Actor-Critic (A2C)
Reading: Sutton & Barto, Ch. 13 §13.5 — Actor-Critic methods (and §13.6 for the continuing-task / average-reward form).
S18
Practice Review
Objective: dissect and discuss sample solutions and explore the main difficulties.
Solution dissection — reading others' implementations to internalise good structure and common pitfalls.
Main difficulties — instability, hyperparameter sensitivity, reward scaling, and debugging silent failures.
Key idea. Most RL "bugs" are not crashes — they are agents that learn the wrong thing; reading curves and reward signals is a core skill.
Module 6
Sessions 19–22
Advanced Paradigms: Problem Framing, Model-Based & Deep RL
Choosing the right solution method for a hard classical problem, then the advanced families — model-based, offline, and inverse RL — culminating in a deep-RL practice on an image-based game and a recap toward Rainbow-style methods.
Module learning outcomes. Frame a hard combinatorial problem across solution families; explain model-based, offline, and inverse RL and their relation to imitation learning; connect deep value learning (DQN) to function approximation and apply it to a pixel-based task.
S19
Problem Approach Framework
Objective: exhaustively analyse a classical problem (the Travelling Salesman Problem) across every line of solution.
Solution families — Christofides' approximation, genetic algorithms, ant-colony optimisation, and simple-to-complex RL.
"When, why, how" — initial framework for deciding which approach fits which problem and constraints.
Key idea. The skill is meta: matching a problem's structure (combinatorial, sequential, stochastic) to the cheapest method that solves it well enough.
S20
Model-Based, Off-line, Inverse RL
Objective: introduce inverse RL and connect it to behavioural cloning, imitation learning, and teacher agents.
Model-based RL — learning a dynamics model and planning within it.
Offline RL — learning a policy purely from a fixed dataset, with no further environment interaction.
Inverse RL — inferring an agent's reward/objectives by observing its behaviour; links to behavioural cloning and imitation learning.
IRL goal: find R such that Eπ*[ Σ γt R(st) ] ≥ Eπ[ Σ γt R(st) ] ∀π — the expert's policy should look optimal under the recovered reward.
Pitfall. Behavioural cloning copies the expert's action mapping but suffers distribution shift — once the agent drifts off the demonstrated states it has never learned to recover, and errors compound. IRL is more robust because it recovers why the expert acted, but it is ill-posed (many rewards explain the same behaviour).
Key idea. Inverse RL flips the problem — instead of "given rewards, find behaviour", it asks "given behaviour, recover the rewards" — the basis for learning from demonstration.
Key takeaway · Model-based plans inside a learned model, offline RL learns from a fixed log with no interaction, and inverse RL/imitation learns the objective itself from demonstrations — three answers to "what if real interaction is scarce or dangerous?".
S21
Deep RL Practice
assignment
Objective: connect RL and ANN approximation by applying deep RL to an image-based game-learning problem and comparing results.
Deep Q-Networks (DQN) — a neural Q-function over raw (e.g. pixel) state, stabilised by an experience replay buffer, a periodically-updated target network, and ε-decay.
From pixels to policy — the trio of tricks that made Atari-from-pixels work.
L(θ) = E[ ( r + γ maxa′ Q(s′,a′; θ⁻) − Q(s,a; θ) )² ] — DQN loss; θ⁻ is the frozen target network.
# Deep Q-Network (DQN) training loop
init Q-net θ, target θ⁻ ← θ, replay buffer D
for each step:
a ← ε-greedy(Q_θ, s); take a, observe r, s′
store (s, a, r, s′) in D
sample minibatch (s,a,r,s′) ~ D # replay → decorrelate
y ← r + γ·max_a′ Q(s′,a′; θ⁻) # target uses frozen θ⁻
θ ← θ − η·∇_θ ( y − Q(s,a; θ) )²
every C steps: θ⁻ ← θ # refresh target net
Tiny worked example. Transition (s,a,r=1,s′), γ=0.99, target net gives max Q(s′,·;θ⁻)=10, online net Q(s,a;θ)=4. Target y = 1 + 0.99·10 = 10.9; squared-error loss = (10.9−4)² = 47.6; the gradient step nudges Q(s,a;θ) up toward 10.9. Crucially y is computed from the frozen θ⁻, so the target doesn't move while we fit it.
Pitfall. The max operator makes vanilla DQN overestimate action values (maximisation bias); Double DQN fixes this by selecting the action with θ but evaluating it with θ⁻. Too-frequent target updates re-introduce the chasing-its-own-tail instability; too-infrequent slows learning.
Key idea. Replay decorrelates samples and the target network stops the regression "chasing its own tail" — the two fixes that tamed the deadly triad enough to learn from pixels.
Key takeaway · DQN = Q-learning with a neural Q-function, made stable by experience replay + a frozen target network; Rainbow (S22) stacks several further fixes on top.
▶ Demo: Deep Q-Networks
▶ Demo: Function Approximation
S22
Summary Recap
group
Objective: recap all concepts and algorithms so far and extend toward new ones.
Consolidation — tying together bandits → MDPs → DP → MC/TD → approximation → policy methods → deep RL.
Rainbow algorithms — introductory extension to combined improvements (Double DQN, Dueling, Prioritised Replay, Distributional, Noisy Nets, n-step).
Key idea. Rainbow shows the modern pattern — strong agents are usually a careful combination of several orthogonal improvements, not one silver bullet.
Module 7
Sessions 23–27
Frontiers, Applications & Multi-Agent RL
The research edge and the real world — recent papers and innovation trends, business applications across industries, multi-agent dynamics, and the landmark breakthroughs and companies that define the field.
Module learning outcomes. Read and discuss recent RL research; identify obstacles to real-world deployment; explain multi-agent RL dynamics; map flagship applications (AlphaGo, AlphaFold-2, fusion control) to the techniques behind them.
S23
Research Papers, Innovation Trends (A)
exam practice
Objective: study and discuss landmark RL papers and summarise new approaches. (Examination practice: be prepared to answer questions about RL.)
New approaches — reward shaping, finite-automata / reward-machine rewards, and hierarchical RL (options, sub-goals).
Paper discussion — reading methodology and critically appraising results.
R′(s,a,s′) = R(s,a,s′) + γΦ(s′) − Φ(s) — potential-based shaping: leaves the optimal policy unchanged while densifying reward.
Key idea. Reward design is often the highest-leverage choice in an RL project — shaping and structured rewards can make an intractable problem learnable.
Pitfall. Naïve bonus rewards invite reward hacking — the agent maximises the proxy, not the goal (e.g. looping to farm a shaping bonus). Potential-based shaping is the one form provably safe from changing the optimum.
Key takeaway · Hierarchical RL, reward machines, and shaping all attack the same bottleneck — sparse, long-horizon reward — by injecting structure the raw MDP lacks.
Reading: Sutton & Barto, Ch. 17 — Frontiers (§17.1 general value functions & auxiliary tasks, §17.4 designing reward signals).
S24
Research Papers, Innovation Trends (B)
Objective: continue the study of key papers and analyse obstacles and challenges in the field and in business application.
Open challenges — sample efficiency, safe exploration, sim-to-real transfer, reward hacking, and reproducibility.
Business obstacles — what stands between a working prototype and a deployed RL system.
Key idea. Most industrial RL failures are not algorithmic — they are about data, simulation fidelity, and aligning the reward with the true business objective.
S25
RL Applications
Objective: explore practical business cases of RL across different industries.
Cross-industry cases — recommendation, dynamic pricing, supply chain, finance, energy, robotics, and operations.
Key idea. RL fits problems framed as sequential decisions under uncertainty with a measurable objective — recognising that framing is the first business skill.
S26
Multi-Agent RL
Objective: enter the multi-agent RL (MARL) sub-field and generalise the basic principles to multiple coexisting learners.
MARL dynamics — cooperation, competition, and mixed settings; the environment becomes non-stationary as other agents learn.
Game-theoretic lens — iterated games (e.g. the Axelrod tournament / iterated prisoner's dilemma): Tit-for-Tat, Always Defect, Grim, Pavlov.
Tiny worked example. Iterated prisoner's dilemma payoffs (mutual cooperate 3/3, mutual defect 1/1, sucker/temptation 0/5). Two Tit-for-Tat agents lock into cooperate–cooperate (3 each per round); a Tit-for-Tat meeting Always-Defect loses the first round then matches defection, settling at 1 each. The demo's Axelrod tournament shows why reciprocal, forgiving strategies dominate over many rounds.
Key idea. With multiple learners the ground shifts under each agent — convergence guarantees from single-agent RL no longer hold, and equilibria replace optima.
Key takeaway · Non-stationarity is the central MARL difficulty; solution concepts borrow from game theory (Nash equilibria) rather than single-agent optimality.
▶ Demo: Multi-Agent RL
S27
Top Cases & Companies
Objective: study the field's leading companies and breakthrough examples.
Breakthroughs — AlphaGo and AlphaZero (games), AlphaFold-2 (protein structure), tokamak fusion-reactor plasma control, and beyond.
Key idea. The headline successes share a recipe — a near-perfect simulator, massive self-play or search, and deep function approximation working together.
Reading: Sutton & Barto, Ch. 16 — Applications and Case Studies.
Module 8
Sessions 28–30
Assessment: Test-Exam & Group Project Presentations
The closing module consolidates and assesses: a concept-review test-exam and the final group project presentations with discussion and Q&A.
Module learning outcomes. Demonstrate consolidated knowledge under exam conditions; present an end-to-end RL project clearly and defend design choices in Q&A.
S28
Test-Exam
test-exam
Objective: a concept-summary review examination — explicitly not a final exam.
Scope — covers the concepts and algorithms across the course; its score adds to the other evaluation items.
No minimum — no minimum passing grade is required for this item.
Key idea. The test-exam exists to make the real final exam familiar — practice the formats (multiple-choice, true/false, open) before they count fully.
S29
Group Projects Presentations (A)
group
Objective: present, discuss, and field Q&A on the group projects (first half).
Presentation & defence — communicating an RL project's framing, method, results, and limitations to peers.
Key idea. Communicating why a method was chosen and what its limits are matters as much as the result itself.
S30
Group Projects Presentations (B), Wrap-Up
group
Objective: remaining group presentations, discussion, Q&A, and course wrap-up.
Wrap-up — synthesis of the course arc and pointers to where the field is heading.
Key idea. The course ends where the field begins — with a well-rounded foundation to keep learning RL independently.
Key Concepts — Glossary
A quick-reference glossary of the ~50 core terms used throughout the course, ordered roughly from foundations to advanced methods. Notation: s state, a action, r reward, γ discount, π policy, α step size, θ/w parameters.
- Agent / Environment
- The learner/decision-maker and everything it interacts with. They exchange actions, states, and rewards in a loop.
- Policy π(a|s)
- A mapping from states to a probability distribution over actions — the agent's behaviour.
- Reward & Return
- Reward r is the immediate scalar signal; return Gt = Σ γkr is the discounted cumulative reward the agent maximises.
- Discount factor γ
- A value in [0,1] weighting future rewards; smaller γ makes the agent myopic, larger γ far-sighted.
- State-value V(s)
- Expected return starting from state s and following policy π thereafter.
- Action-value Q(s,a)
- Expected return after taking action a in state s, then following π — the basis of value-based control.
- Markov property
- The future depends only on the current state, not the history that led there — the present state is a sufficient statistic.
- MDP ⟨S,A,P,R,γ⟩
- Markov Decision Process: the formal model of a sequential decision problem with stochastic transitions and rewards.
- Bellman equations
- Recursive consistency conditions for value functions; the expectation form evaluates π, the optimality form defines v*/q*.
- Dynamic Programming
- Exact solution of a known MDP by iterating Bellman backups — value iteration and policy iteration.
- Generalized Policy Iteration
- The interplay of policy evaluation and policy improvement converging to the optimal policy.
- Exploration vs. exploitation
- The trade-off between trying uncertain actions to gain information and choosing the current best to gain reward.
- ε-greedy / UCB / Thompson
- Strategies for managing exploration in bandits: random slack, optimism under uncertainty, and posterior sampling.
- Monte Carlo methods
- Estimate values by averaging complete observed returns — unbiased but high-variance, episodic only.
- Temporal-Difference (TD)
- Learn from incomplete episodes by bootstrapping toward r + γV(s′); the workhorse of practical RL.
- Bootstrapping
- Updating an estimate toward another estimate rather than a true sampled target — trades variance for bias.
- SARSA vs. Q-learning
- On-policy TD control (learns the policy it follows) vs. off-policy (learns the greedy optimal policy).
- Expected SARSA
- Uses the expected next-action value under the policy, reducing variance versus plain SARSA.
- TD(λ) & eligibility traces
- A spectrum between TD(0) and Monte Carlo; traces credit recently-visited states for new errors.
- On-policy / Off-policy
- Whether the agent learns about the policy it is following, or about a different (e.g. greedy) target policy.
- Importance sampling
- Reweighting returns from a behaviour policy to estimate values under a target policy — enables off-policy learning.
- Function approximation
- Representing V, Q, or π with parameters (e.g. a neural net) so RL scales to large/continuous spaces via generalisation.
- Deadly triad
- The instability that can arise when bootstrapping, off-policy learning, and function approximation are combined.
- Policy Gradient Theorem
- Expresses ∇J(θ) as an expectation of ∇log π · return, enabling direct optimisation of parameterised policies.
- REINFORCE
- Monte Carlo policy gradient: update θ by ∇log π(a|s) scaled by the episode return.
- Baseline & Advantage
- Subtracting V(s) (a baseline) gives A(s,a)=Q−V, reducing gradient variance without bias.
- Actor-Critic / A2C
- An actor (policy) trained with a critic's (value) low-variance advantage estimate.
- DQN
- Deep Q-Network: a neural Q-function stabilised by experience replay and a target network.
- Experience replay
- A buffer of past transitions sampled randomly to decorrelate updates and improve data efficiency.
- Target network
- A periodically-frozen copy of the Q-network used to compute stable regression targets.
- Model-based vs. model-free
- Whether the agent learns/uses a model of dynamics to plan, or learns values/policy directly from experience.
- Inverse RL / Imitation
- Recovering the reward function (IRL) or copying behaviour (behavioural cloning) from observed demonstrations.
- Pareto frontier / NEAT
- The set of non-dominated trade-offs in multi-objective optimisation; NEAT evolves neural-net weights and topology.
- Multi-Agent RL (MARL)
- RL with several simultaneously-learning agents, making the environment non-stationary; analysed via game theory.
- Reward shaping
- Adding auxiliary reward terms to guide learning toward sparse-reward goals without changing the optimal policy (potential-based shaping is provably policy-preserving).
- Trajectory / Episode
- A trajectory is a sequence s₀,a₀,r₁,s₁,…; an episode is a trajectory that ends in a terminal state. Episodic tasks reset; continuing tasks never terminate.
- Horizon
- How far into the future returns are accumulated — finite (fixed length), infinite-discounted (via γ), or average-reward.
- Stationary vs. non-stationary
- Whether the environment's dynamics/rewards change over time; non-stationarity favours constant step sizes and is intrinsic to multi-agent settings.
- Stochastic vs. deterministic policy
- A stochastic policy outputs a distribution over actions (needed for exploration and continuous control); a deterministic one outputs a single action.
- Greedy / ε-greedy action
- Greedy picks argmaxa Q(s,a); ε-greedy does so with probability 1−ε and explores at random with probability ε.
- Softmax / Boltzmann policy
- π(a|s) ∝ exp(preference/τ): a temperature-controlled smooth alternative to ε-greedy, and the policy form used by policy-gradient methods.
- Bellman optimality equation
- v*(s) = maxa Σ p(s′,r|s,a)[r+γv*(s′)] — the fixed-point characterising the best achievable value.
- Value vs. policy iteration
- Two DP schemes: value iteration applies the optimality backup directly; policy iteration alternates full evaluation with greedy improvement.
- Backup diagram
- A small tree picturing how value flows from successor states/actions back to the current node — the visual form of a Bellman update.
- TD error δ
- δ = r + γV(s′) − V(s): the surprise between a bootstrapped target and the current estimate; the learning signal for TD and the critic.
- n-step return
- A target using n real rewards then bootstrapping: interpolates between one-step TD (n=1) and Monte Carlo (n=∞).
- Maximisation bias
- The systematic overestimation introduced by the max operator in Q-learning/DQN; mitigated by Double Q-learning.
- Semi-gradient methods
- Function-approximation updates that ignore the target's dependence on the parameters — efficient but a source of instability.
- Tile coding / RBF features
- Classic fixed feature representations for linear function approximation, partitioning the state space into overlapping regions.
- Generalized Advantage Estimation
- A λ-weighted blend of n-step advantage estimates that trades bias for variance — the modern policy-gradient standard.
- PPO / TRPO
- Trust-region / clipped-objective policy-gradient methods that bound how far the policy may move per update for stable training.
- A3C / A2C
- Asynchronous (A3C) and synchronous (A2C) advantage actor-critic — parallel actors feeding a shared advantage-based update.
- Rainbow
- A DQN agent combining six orthogonal improvements (Double, Dueling, Prioritised Replay, Distributional, Noisy Nets, n-step).
- Offline (batch) RL
- Learning a policy purely from a fixed dataset with no further interaction; the central challenge is distributional shift.
- Behavioural cloning
- Supervised imitation that maps expert states to actions; simple but brittle to compounding distribution shift.
- Hierarchical RL / options
- Temporally-extended sub-policies ("options") that let an agent plan over abstract, multi-step actions toward sub-goals.
- Reward hacking
- An agent maximising a flawed proxy reward in unintended ways instead of the designer's true objective.
- Sample efficiency
- How much environment interaction a method needs to reach a given performance — a key axis distinguishing RL algorithms.
- Sim-to-real transfer
- Carrying a policy trained in simulation over to the real world despite the "reality gap"; aided by domain randomisation.
- Credit assignment
- Attributing a delayed outcome to the earlier actions responsible for it — the core difficulty eligibility traces and advantages address.
Bibliography
Compulsory
Richard S. Sutton & Andrew G. Barto (2018). Reinforcement Learning: An Introduction, 2nd edition. MIT Press / AbeBooks. ISBN 9780262039246 (Digital).
The single required text and the backbone of the program — nearly every session's readings map to its chapters
(Ch. 1 elements & history; Ch. 2 bandits; Ch. 3 finite MDPs; Ch. 4 dynamic programming; Ch. 5 Monte Carlo;
Ch. 6 TD; Ch. 7 n-step; Ch. 8 planning & learning; Ch. 9–12 approximation & eligibility traces; Ch. 13
policy gradients & actor-critic; Ch. 16 applications; Ch. 17 frontiers). The syllabus notes it covers the ground
essential to understanding much of the published work on RL. It can be hard going without a relatively solid
mathematical background — especially the Bellman-equation and Monte-Carlo sections — but it is worth it and is a
must-read for anyone doing graduate research in reinforcement learning.
Policies. The syllabus also refers students to the University's
Code of Conduct (behaviour rules),
Attendance Policy, and
Ethics Code; the Program Director may provide further indications. See
SYLLABUS.pdf for the authoritative text of all rules.
Explore the Demos
Every technique above that has a live visualisation in this repo, in course order:
S04Multi-Armed Bandits
ε-greedy, UCB, Thompson sampling, gradient bandits — pull arms and watch regret accumulate.
S05–06MDPs & Gridworld
States, actions, transitions, rewards — the Bellman equation in action.
S07Dynamic Programming
Value & policy iteration sweeping the Bellman optimality update to v* and π*.
S08–09Monte Carlo & TD
SARSA, Q-Learning, Expected SARSA, MC control — the cliff-walking classic.
S15Function Approximation
Tile-coding, RBFs, polynomial features approximating a value function with SGD.
S16Policy Gradients
Sample episodes, scale log-probabilities by returns, steer the policy directly.
S17Actor-Critic (A2C)
A learned baseline — one network acts, one critiques; watch variance drop.
S21Deep Q-Networks
Neural Q-function, replay buffer, target network, ε-decay — the trio that beat Atari.
S12Evolutionary Methods
A genetic algorithm evolves a population of policies — gradient-free.
S26Multi-Agent RL
The Axelrod tournament: Tit-for-Tat vs. Always Defect, Grim, Pavlov.