AI: Computer Vision

Bachelor in Computer Science & Artificial Intelligence (BCSAI) · IE University · 30-session syllabus, mapped to the interactive playground.

This page is a deep, study-oriented companion to the official syllabus: every module, every session, the core mathematics, and direct links into the 17 live demos. It does not replace the official syllabus PDF — it explains and connects it.

Jump to the full program → Open the interactive demos Key-concepts glossary

Code

AICV-CSAI.3.M.A

Area

Computer Science

Credits

6.0 ECTS

Sessions

30 (live, in-person)

Year / Course

25–26 · Third year

Semester

2nd

Course overview

Vision plays an essential role in human perception, so understanding and automating the tasks that rely on it is one of the most consequential scientific challenges today — with applications across industrial automation, autonomous vehicles, precision agriculture, and e-health. Computer Vision is the engineering field that takes on this challenge: extracting a high-level understanding of the world from digital images and videos.

This course is an introduction to the most relevant topics in Computer Vision and their engineering. It deliberately integrates traditional (geometry, optics, filtering, classical detectors) and modern, machine-learning-driven approaches (CNNs, transformers, generative models). The goal is a solid foundation that lets you build sophisticated applications and manage complex CV projects — from how an image is formed and encoded, through how it is processed and analysed, to the deep models that now define the state of the art.

How to use this page The program below is organised into seven thematic modules spanning the 30 sessions. Each session lists its learning objective, an explanation of every topic, the core CV concept or formula where the material is technical, a worked mini-example, a "pitfall" or "connects to" note, a "key idea" takeaway, annotated readings, and links into the matching live demo. Read it alongside the slides; experiment in the playground; verify your intuition against the math.

Prerequisites & how this course connects

Computer Vision sits at the confluence of three mathematical streams you have already met in the BCSAI curriculum, and it leans on each in turn as the semester progresses. Coming in prepared on these makes the difference between following the derivations and merely watching them.

Linear algebra (matrices, eigenvalues/eigenvectors, SVD, change of basis) — the language of the entire geometry half: homogeneous transforms, the camera matrix $K[R\,|\,t]$, the fundamental/essential matrices, and the second-moment matrix behind Harris corners are all linear-algebra objects. SVD in particular drives DLT homography estimation and the 8-point algorithm.
Calculus & optimisation (gradients, the chain rule, gradient descent) — image gradients $\nabla I$ underlie edges, corners and optical flow; the chain rule is backpropagation; and every deep model in Modules 6–7 is trained by stochastic gradient descent on a loss surface.
Probability & statistics (distributions, expectation, maximum likelihood, KL divergence) — needed for Otsu's between-class variance, RANSAC's inlier model, softmax/cross-entropy, and the generative objectives (VAE's ELBO, the GAN min-max game, the diffusion denoising loss).
Python programming — NumPy array thinking transfers directly to images-as-tensors; OpenCV powers the classical pipelines (Modules 1–5) and PyTorch-style frameworks the learned models (Modules 6–7).

The course also builds on itself in a deliberate arc: the convolution you design by hand in Session 16 becomes the learned kernel of Session 21; the intensity maps applied in-camera in Session 4 are re-derived as point operators in Session 15; the attention introduced for captioning in Session 24 becomes the transformer of Session 29. Where a session reaches forward or back, a "connects to" note flags it so you can build one mental model rather than thirty disconnected ones.

Weekly study load Over 14–15 teaching weeks the 150-hour budget works out to roughly 10 h/week: about 4–5 h in the two live sessions, 4 h on coding exercises and async work, and 1–2 h reviewing notes and the readings below. Front-load the geometry weeks (Module 2 is nine sessions and the densest mathematically), and protect dedicated blocks for the individual and group projects rather than leaving them to the end. A realistic rhythm: re-derive each session's core formula by hand within 48 h of the lecture, then reproduce it in code in the matching demo.

Professor

Dr. Rubén Sánchez García

rsanchezgarcia@faculty.ie.edu

Office hours

On request, by email

Dr. Sánchez-García holds BScs in Biotechnology and Computer Science and an MSc in Biophysics; he completed his PhD (2020) at the CNB-CSIC / Autonomous University of Madrid, was a postdoctoral researcher in Statistics at the University of Oxford, and worked in the Sustaining-Innovation program at Astex Pharmaceuticals (Cambridge, UK). His research develops machine-learning algorithms for structural biology — deep learning for cryo-EM and structure-based drug discovery — with 30+ peer-reviewed publications and well-known tools including BIPSPI and DeepEMhancer.

Learning objectives

The course provides a solid basis for building sophisticated applications and managing complex projects in Computer Vision. Students gain a solid understanding of the essence of digital images and videos — from formation and codification to processing and storage — and learn the most relevant problems in CV today together with the best techniques known to address them.

Pose, discuss and solve the most common Computer Vision problems, choosing appropriate classical or learned techniques.
Fluency in developing Computer Vision applications in Python (OpenCV for classical pipelines, deep-learning frameworks for modern models).
Teamwork skills: communication, coordination and leadership, exercised through the group project.
Conceptual mastery of the full pipeline: image formation and sensing → geometric & intensity transforms → filtering and frequency analysis → features, segmentation and detection → deep architectures, generative models and transformers.

Teaching methodology

Knowledge is acquired mainly through lectures and put into practice individually via coding exercises and an individual project. The estimated 150-hour workload (6 ECTS) breaks down as follows:

Lectures

43.3%

Exercises / async / field work

40.0%

Group work

10.0%

Individual studying

6.7%

Approx. hours: Lectures 65 h · Exercises/async/field 60 h · Group work 15 h · Individual study 10 h · Total 150 h.

AI policy — GenAI not permitted In this course the use of generative AI is not permitted unless the instructor states otherwise, because it would jeopardise acquisition of the course's fundamental knowledge and skills. Using AI-generated content for any assessment is treated as academic misconduct and may lead to failing the assignment or the course.

Assessment & grading

The course is graded on intermediate tests (solving CV problems and communicating progress through intermediate deliverables), an individual and a group project, class participation, and a final exam.

Final Exam

35%

Intermediate tests

30%

Individual work

15%

Group work

15%

Class participation

Final Exam35%

Comprehensive written exam over all 30 sessions, weighted toward problem-solving rather than recall. Minimum passing grade of 3.5/10 on this component is a hard gate — clear it even if your continuous marks are strong, or the course does not pass. Format: derivations and short numeric problems (project a point, compute a convolution output size, score an IoU), plus conceptual short-answer. Tip: practise deriving each module's core matrix/formula on blank paper under time pressure.

Intermediate tests30%

Continuous evaluation: solve CV problems and communicate progress through staged deliverables across the semester. Criteria: correctness of the method, quality of the implementation, and clarity of the written/verbal explanation. Format: short in-class problem sets and milestone hand-ins. Tip: these reward steady weekly work — falling behind compounds because each test assumes the previous module.

Individual work15%

Individual project / coding exercises building CV applications in Python (OpenCV for classical pipelines, a DL framework for learned models). Rubric: does it run and solve the stated task (correctness), is the code clean and reproducible (engineering), and is the result evaluated honestly (analysis)? Deliverable: a runnable repo/notebook plus a short report. Tip: log a baseline first, then improve — reviewers reward measured progress.

Group work15%

Team project exercising communication, coordination and leadership. Rubric: technical soundness of the CV solution, division/integration of work, and quality of the presentation/demo. Deliverable: a working system, a written report, and a live or recorded demo. Tip: agree interfaces early and use version control from day one; integration is where teams lose marks.

Class participation5%

Engagement during the live in-person sessions: asking and answering questions, contributing to discussion, and helping during practical exercises. Tip: participation is cheap to earn and easy to forget — speaking once per session comfortably covers it.

How the components fit together The weighting tells a story: 65% of the grade (final + intermediate tests) measures your individual command of the material, while 30% (individual + group projects) measures your ability to build real CV systems, and 5% rewards live engagement. Because continuous evaluation (intermediate tests + projects) is spread across the semester, the workload is deliberately even — but note the re-sit erases all of it, so the safest strategy is to pass in the ordinary call by keeping pace week to week.

Pass, attendance & re-sit rules Attendance: students who do not meet the 80% attendance rule fail both the ordinary and extraordinary calls for the year and must re-enrol (re-take) the next academic year — no re-sit opportunity.
Calls: four chances total across two consecutive academic years (ordinary + extraordinary/June–July re-sits).
Re-sit: the June/July re-sit is a single comprehensive exam (physical presence in Segovia or Madrid required); the grade depends on that exam only — continuous evaluation is not counted. Pass with ≥ 5; maximum grade 8.0 ("notable"). Retakers (3rd call) may reach a maximum of 10.0 and must confirm specific criteria with the assigned professor.
Review: grade appeals require attending the post-exam review session first. Failing more than 18 ECTS in a year after re-sits leads to being asked to leave the program.

Program — 30 sessions across 7 modules

Every session below is live and in-person. Session titles and topics follow the official syllabus; explanations, formulas, "key ideas", and demo links are study aids added for this companion site.

Module 1 · Sessions 1–4

Foundations & Computer-Vision Engineering

What computer vision is, where it sits among neighbouring fields, and the physical front end of every system: light, optics, sensors, colour, encoding, and the practical engineering of getting images and video streams into a program. This is the "how an image comes to exist" half of the course's image-formation story.

Situate CV relative to image processing, graphics, photogrammetry and machine learning, and recount its historical arc.
Explain illumination, optics (including fisheye/omnidirectional lenses) and sensor behaviour and how each shapes the captured image.
Reason about colour perception, colour spaces and image compression; take first steps in OpenCV.
Configure a real camera (focus, iris, exposure, white balance, gamma) and stream it over the network (RTSP).

SESSION 1 · LIVE IN-PERSON

Course presentation

Objective: define the field, fix the vocabulary, and survey what is and isn't possible today.

Definition & relationship with other disciplines — Computer Vision recovers semantic and geometric information from images and video; it is the inverse of computer graphics, which renders images from a known 3-D model. It overlaps image processing (which transforms images into images), photogrammetry (precise 3-D measurement from photos), pattern recognition and machine learning. Concretely: graphics asks "given this cube and light, what pixels result?"; CV asks the far harder reverse — "given these pixels, what cube and light produced them?".
Main concepts — a grayscale image is a function $I(x,y)$ mapping pixel coordinates to intensity (colour adds channels, video adds time $t$); the canonical pipeline runs acquisition → pre-processing → feature/representation → analysis → interpretation. Recovering the 3-D scene from a 2-D image is ill-posed: infinitely many scenes project to the same picture, so extra information (priors, geometry, learned statistics) is always required.
Historical background — from Larry Roberts' "blocks world" (1963) and David Marr's three levels of analysis (computational / algorithmic / implementational) in the 1980s, through the feature-engineering era (SIFT, HOG) of the 2000s, to the deep-learning inflection point when AlexNet won ImageNet in 2012 and halved the error rate of hand-designed pipelines.
State of the art — robust object detection and segmentation, monocular and multi-view depth, photorealistic generation (diffusion), and large vision-language models that caption, answer questions about and reason over images — the applications that frame the rest of the course.

Worked example: a single pixel of value 137 in an 8-bit image is just $I(x,y)=137$ on the scale $0$ (black) to $255$ (white); a $1920\times1080$ colour frame is therefore a $1080\times1920\times3$ array of ~6.2 million numbers — the raw object every algorithm in this course operates on.

Key idea: vision is an inverse problem — infer the 3-D world and its meaning from a 2-D projection — which is why priors, geometry and learning all matter.

Connects toThe "inverse problem" framing recurs everywhere: geometry (Module 2) supplies the priors of projection, classical features (Module 5) supply hand-designed priors, and deep learning (Modules 6–7) learns the priors from data.

Demo · Site overview Demo · 1 Image formation

Szeliski, Computer Vision: Algorithms and Applications — Ch. 1 (Introduction): read for the field map and the historical timeline.
Forsyth & Ponce, Computer Vision: A Modern Approach — Ch. 1: complementary framing of what CV is and isn't.

Key takeaway: you are learning to undo projection and quantisation — every later technique is a different strategy for adding back the information the camera threw away.

SESSION 2 · LIVE IN-PERSON

Computer Vision Engineering (I)

Objective: understand the optical and illumination physics that determine image quality.

Basic illumination techniques & sources — front lighting reveals surface texture, back lighting produces clean silhouettes for measuring outlines, and structured lighting projects a known pattern to extract 3-D shape. In industrial CV, choosing the right lighting is often cheaper and far more reliable than engineering an algorithm to cope with bad lighting — a controlled scene turns a hard recognition problem into an easy thresholding one.
Camera elements & optics fundamentals — aperture controls light intake and depth of field, focal length sets magnification and field of view, and the thin-lens relation $\tfrac{1}{f}=\tfrac{1}{z_o}+\tfrac{1}{z_i}$ ties object distance $z_o$, image distance $z_i$ and focal length $f$ together: only objects at the conjugate distance focus sharply, everything else blurs into a circle of confusion.
Fisheye & omnidirectional cameras — ultra-wide lenses sacrifice the rectilinear (straight-lines-stay-straight) perspective model for coverage approaching or exceeding 180°, introducing strong radial distortion that bends straight world lines into curves; modelling and undistorting this is essential before any geometric reasoning.
Sensors & example applications — CCD sensors shift charge across the chip for low noise; CMOS sensors digitise per-pixel for speed, low power and lower cost (now dominant). Both follow the photon → charge → voltage → digital number path, and the choice of lens + sensor is dictated by the application: machine vision, automotive, microscopy, astronomy.

Thin lens: 1/f = 1/z_o + 1/z_i · FOV ≈ 2·arctan(sensor_size / 2f)

Worked example: a 50 mm lens focused on an object 2 m away forms its image at $1/z_i = 1/50\text{mm} - 1/2000\text{mm}$, giving $z_i \approx 51.3$ mm behind the lens — barely past $f$, which is why distant scenes focus "at infinity" essentially at the focal plane. Pair that 50 mm lens with a 36 mm-wide sensor and the horizontal FOV is $\approx 2\arctan(36/(2\cdot50)) \approx 40°$.

Key idea: the image you analyse is already the product of physics choices — lighting, lens and sensor — made before a single line of code runs.

PitfallForgetting radial distortion. On wide or cheap lenses, straight lines bow visibly; running calibration (Sessions 12–13) and undistorting before homography or stereo is mandatory, or every geometric estimate inherits the lens error.

Demo · 3 Image sensing

Szeliski — Ch. 2.1–2.3 (image formation, photometric & optical models): the physics behind every formula above.
Livingstone, Vision and Art: The Biology of Seeing — light & the eye, the biological motivation for engineering an artificial vision front end.

Key takeaway: control the optics and lighting and you remove variance the algorithm would otherwise have to learn — good CV engineering starts at the lens, not the loss function.

SESSION 3 · LIVE IN-PERSON

Computer Vision Engineering (II)

Objective: represent colour, encode images efficiently, and start coding with OpenCV.

Color perception — human vision is trichromatic: three cone types (roughly L/M/S, "red/green/blue") sample the spectrum, so any colour collapses to three numbers. This is why three channels suffice — and why metamerism arises: two physically different spectra that excite the cones identically look the same, which is exactly what lets RGB displays fool the eye with just three primaries.
Color spaces — RGB mixes brightness and colour together; HSV separates hue, saturation and value so you can threshold on colour regardless of lighting; YCbCr splits luminance from chrominance (the basis of compression and broadcast); Lab is perceptually uniform so Euclidean distance approximates perceived colour difference. Pick the space that isolates the property your task depends on.
Image compression — natural images are highly redundant (neighbouring pixels correlate). Lossy JPEG exploits this in three steps — split into 8×8 blocks, apply the DCT, then quantize (coarsely round) the high-frequency coefficients the eye barely notices, then entropy-code — while lossless PNG uses prediction + DEFLATE to compress without discarding any data.
First steps with OpenCV — imread/imshow, the surprising BGR channel order (not RGB), uint8 vs float32 dtypes and the overflow/clipping bugs they cause, and basic array-slicing manipulation of the image-as-NumPy-array.

JPEG block: 8×8 pixels → DCT → divide by quantization table Q → round → entropy code

Worked example: to find a red ball, converting to HSV and thresholding hue $\in[0°,10°]\cup[350°,360°]$ with high saturation is robust to brightness changes; the same threshold in RGB would fail the moment a shadow falls across the ball, because R, G and B all drop together.

Key idea: JPEG already is a frequency-domain idea — it throws away high-frequency detail your eye barely notices, exactly the Fourier intuition from Module 3.

PitfallOpenCV loads images as BGR, not RGB. Display or save with the wrong assumption and reds and blues swap; this is the single most common first-week bug, and it silently corrupts any colour-based logic downstream.

Demo · 5 Intensity & histogram Demo · Image input

Szeliski — Ch. 2.3 (colour models) & Ch. 2.3.3 (compression): how the spaces and JPEG are defined.
Forsyth & Ponce — colour chapter: perception-first treatment of why three channels work.

Key takeaway: choosing the right colour space is the cheapest robustness win in classical CV — it moves the property you care about onto its own axis before you write a single threshold.

SESSION 4 · LIVE IN-PERSON

Camera configuration & streaming

Objective: operate a real camera and deliver a live video stream — the engineering glue of a CV system.

Acquisition methods — continuous frame grabbing vs hardware/software triggering (capture exactly when an event occurs, e.g. a part passes a sensor) and multi-camera synchronisation so frames share a timestamp — essential for stereo and motion analysis where mis-timed frames ruin correspondence.
Focus / Iris — focus sets the conjugate distance that appears sharp; the iris (aperture) trades light intake against depth of field. A wide aperture (small f-number) gathers light and isolates a subject with shallow DoF; a narrow aperture keeps a wide depth in focus but demands more light or longer exposure.
Exposure, white balance, gamma — exposure time and gain both brighten the image but gain amplifies noise, so prefer exposure when motion allows; white balance rescales channels to achieve colour constancy under different illuminants (so white looks white); gamma encoding $V_{out}=V_{in}^{\gamma}$ allocates more code values to dark tones, matching the eye's roughly logarithmic response (encode ≈1/2.2, decode ≈2.2).
Setting up an RTSP server — the Real-Time Streaming Protocol packetises and timestamps frames for low-latency networked delivery, letting downstream CV nodes consume a live feed — the engineering glue between a camera and a vision service.

Gamma: V_out = V_in ^ γ (encode ~1/2.2, decode ~2.2) · exposure value ∝ exposure_time × ISO_gain

Worked example: a pixel at linear intensity $0.2$ encoded with $\gamma=1/2.2$ is stored as $0.2^{0.455}\approx0.48$ — nearly mid-grey — so dark regions get far more of the available 256 levels than a linear encoding would give them, reducing visible banding in shadows.

Key idea: exposure, white balance and gamma are point intensity transforms applied in the camera — the same maps you'll implement by hand in Session 15.

Connects toEvery control here is a point operation $s=T(r)$. Gamma reappears verbatim in Session 15's pixelwise toolkit, and white balance is just a per-channel contrast stretch — so this "camera engineering" session is secretly a preview of image processing.

Demo · 5 Intensity (gamma/stretch)

Szeliski — Ch. 2.3.1 (sampling, aliasing, exposure): the signal-processing view of acquisition. OpenCV video I/O docs — practical capture & streaming.

Key takeaway: the camera is a tiny image-processing pipeline; understanding its knobs lets you fix problems at capture time instead of fighting them in software.

Module 2 · Sessions 5–13

Geometry, 3-D Vision & Calibration

The geometric backbone of computer vision: projective spaces, image transformations, the camera projection model, recovering 3-D structure from one or many views, depth sensing, and calibrating real cameras. This is where pixels become geometry.

Work in projective spaces and homogeneous coordinates, and apply the 2-D transformation hierarchy (rigid → affine → projective).
Write the full camera model $x = K\,[R\,|\,t]\,X$ and reason about intrinsics, extrinsics and coordinate frames.
Recover depth and 3-D structure via stereo, structured light, time-of-flight, optical flow, homography and epipolar geometry.
Calibrate a camera with Zhang's method and judge calibration quality by reprojection error.

SESSION 5 · LIVE IN-PERSON

Introduction to projective geometry

Objective: adopt homogeneous coordinates so projection and perspective become linear.

Projective space ℙⁿ — represent points by $(n{+}1)$-vectors defined only up to a non-zero scale, i.e. as rays through the origin in $\mathbb{R}^{n+1}$. This makes points at infinity first-class citizens (a final coordinate of 0), so parallel lines "meet at infinity" and perspective foreshortening becomes algebraically clean.
Projective space ℙ² — the image plane. Beautifully, both points and lines are 3-vectors and they are dual: a point lies on a line iff $\mathbf{x}^{\top}\mathbf{l}=0$, the line through two points is their cross product $\mathbf{l}=\mathbf{x}_1\times\mathbf{x}_2$, and the intersection of two lines is $\mathbf{x}=\mathbf{l}_1\times\mathbf{l}_2$ — one operation does join and meet.
Projective space ℙ³ — the world space of scene points, represented by 4-vectors, that the camera projects down to ℙ² via a single $3\times4$ matrix.

Homogeneous: (x, y) ↔ (x, y, 1) ~ (λx, λy, λ), λ ≠ 0 · point on line: xᵀl = 0 · line through points: l = x₁ × x₂

Worked example: the lines $x=1$ and $x=2$ are $\mathbf{l}_1=(1,0,-1)$ and $\mathbf{l}_2=(1,0,-2)$; their intersection $\mathbf{l}_1\times\mathbf{l}_2=(0,1,0)\to$ a point with third coordinate 0 — a point at infinity in the $y$-direction. That is the projective statement that two vertical lines are parallel.

Key idea: moving to homogeneous coordinates turns the non-linear perspective division into a single matrix multiply — the trick that powers the whole geometry module.

Connects toEverything in Sessions 6–13 is a linear map in these coordinates: image transforms (S6), the camera matrix (S7), homographies and the fundamental matrix (S9–10). Master homogeneous coordinates now and the rest of the module is matrix algebra.

Demo · 2 Camera model

Hartley & Zisserman, Multiple View Geometry — Ch. 1–2 (projective geometry of 2-D and 3-D): the definitive, careful treatment — read Ch. 2 closely for point/line duality.

Key takeaway: a point and a line are the same kind of object in ℙ², and a single cross product joins or meets them — this duality is what makes projective geometry so compact.

SESSION 6 · LIVE IN-PERSON

Image transformations

Objective: classify and apply the 2-D transformation hierarchy on images.

Scaling & Translation — scaling multiplies coordinates by $s$ (uniform) or $s_x,s_y$ (anisotropic); translation adds an offset. Translation is not linear in 2-D coordinates, which is precisely why we lift to homogeneous coordinates — there it becomes the extra column $t$ of a $3\times3$ matrix.
Rotation — an orthonormal matrix $R$ ($R^{\top}R=I$, $\det R=1$) that preserves lengths and angles; rotation + translation is a rigid/Euclidean motion (3 DOF in 2-D), the transform a moving rigid object or camera undergoes.
Affine — a 6-DOF map (a general $2\times2$ linear part + translation) combining rotation, anisotropic scale and shear; it preserves parallelism and ratios of areas but not angles or lengths — a good model for weak-perspective or a distant planar surface.
Perspective (homography) — the full 8-DOF $3\times3$ projective map; it preserves straight lines and the cross-ratio but not parallelism, so railway tracks converge. It exactly models a planar surface viewed from any angle, or any two views related by pure rotation.

Affine: [x'; y'; 1] = [[a,b,t_x],[c,d,t_y],[0,0,1]] · [x; y; 1] · DOF: rigid 3 ⊂ similarity 4 ⊂ affine 6 ⊂ projective 8

Worked example: a 90° rotation has $R=\big[\begin{smallmatrix}0&-1\\1&0\end{smallmatrix}\big]$, so the point $(2,0)$ maps to $(0,2)$. Each correspondence gives 2 equations, so an affine map (6 DOF) needs ≥ 3 point pairs and a homography (8 DOF) needs ≥ 4 — the DOF count literally is the data requirement.

Key idea: each transform is a strictly larger group than the last (rigid ⊂ similarity ⊂ affine ⊂ projective); the DOF count tells you how many point correspondences you need to estimate it.

Connects toThe 4-point homography here is the same $H$ you estimate by DLT in S9–10 and fit with RANSAC for panorama stitching in S20 — image transforms are the foundation those build on.

Demo · 4 Geometric transforms

Szeliski — Ch. 2.1.1 (2-D transforms) & 3.6 (image warping/resampling): the practical mechanics of applying a transform to pixels.
Hartley & Zisserman — Ch. 2: the full transformation hierarchy and what each preserves.

Key takeaway: know each transform's DOF and what it preserves — that pair of facts tells you both which transform models your situation and how many correspondences you must collect to estimate it.

SESSION 7 · LIVE IN-PERSON

Image acquisition model

Objective: write the complete pixel-from-world projection equation.

Lens models: pinhole, thin, thick — the pinhole is the ideal: every ray passes through one point, giving exact perspective with infinite depth of field but no light. The thin-lens model adds focusing and finite aperture; the thick-lens (and full compound-lens) model adds the principal planes and aberrations real optics introduce.
Projection models — perspective (objects shrink with distance, the realistic model) versus orthographic and weak-perspective approximations that drop the depth division and are valid when the scene's depth is small relative to its distance — far simpler algebra, used in early SfM.
Intrinsic & extrinsic parameters — intrinsics $K$ encode focal lengths $f_x,f_y$ (in pixels), the principal point $c_x,c_y$ and skew $s$; extrinsics $[R\,|\,t]$ encode the camera's orientation and position in the world. Together $P=K[R\,|\,t]$ is the $3\times4$ projection matrix.
Coordinate systems involved — a point travels world → camera (apply $[R|t]$) → image plane (perspective division by depth) → pixel grid (apply $K$); knowing which frame you are in is half of solving any geometry problem.

x = K [R | t] X , K = [[f_x, s, c_x],[0, f_y, c_y],[0,0,1]] · pinhole: x = f·X/Z

Worked example: with $f_x=f_y=800$, principal point $(320,240)$, and a point $1\,$m right and $5\,$m deep, $X=(1,0,5)$ in the camera frame: $u = 800\cdot1/5 + 320 = 480$, $v = 800\cdot0/5 + 240 = 240$. The depth-5 divisor is the perspective effect — double the depth and the off-centre shift halves.

Key idea: the camera matrix factorises cleanly into what the camera is (intrinsics $K$) and where it is (extrinsics $[R|t]$) — calibration estimates the first, pose estimation the second.

PitfallConflating "focal length in mm" (an optical property) with $f_x,f_y$ "in pixels" (the intrinsic). The pixel focal length is the mm focal length divided by the pixel pitch — mixing the two breaks every projection calculation.

Demo · 1 Pinhole projection Demo · 2 K[R|t]

Hartley & Zisserman — Ch. 6 (camera models): the full $P=K[R|t]$ decomposition and its degrees of freedom.
Forsyth & Ponce — geometric camera models: a gentler first pass on the same equations.

Key takeaway: $x = K[R\,|\,t]X$ is the single equation the whole geometry module orbits — every later result either estimates a piece of it or exploits the constraint it imposes between views.

SESSION 8 · LIVE IN-PERSON

Three-dimensional vision I & II

Objective: survey the cues and devices that recover depth from images.

Human 3-D & stereoscopic vision — two eyes/cameras see a point at slightly different image positions; this disparity $d$ is inversely proportional to depth, giving $Z = f\,B/d$ for focal length $f$ and baseline $B$. Wider baselines resolve far objects better but shrink the overlapping field of view.
Structured light (projection) — actively project a known pattern (stripes, dots, Gray code) and read its deformation on the surface; because the projected pattern is the correspondence, dense depth is recovered without solving the matching problem — the principle behind the original Kinect.
Time of flight — emit modulated light and measure round-trip time (or phase) per pixel: $Z = c\,\Delta t/2$. Works in textureless scenes where stereo fails, but resolution and multipath are limiting.
Depth from focus / shape from shading & texture — monocular cues: which depth is in focus reveals distance, shading gradients reveal surface orientation (shape-from-shading), and the foreshortening of a known texture reveals slant. Each is weak alone but informative as a prior.
Optical flow & motion — the brightness-constancy assumption (a moving point keeps its intensity) linearises to $I_x u + I_y v + I_t = 0$, one equation in two unknowns $(u,v)$ — the aperture problem. Lucas–Kanade resolves it by assuming constant flow in a small window.

Stereo depth: Z = f·B / d · ToF: Z = c·Δt/2 · Optical flow: I_x·u + I_y·v + I_t = 0

Worked example: a stereo rig with $f=800\,$px and baseline $B=0.1\,$m sees a point at disparity $d=20\,$px, so $Z = 800\cdot0.1/20 = 4\,$m. Note the sensitivity: at $d=2\,$px the same rig reports $40\,$m, so far-away depth is fragile — disparity error blows up quadratically with distance.

Key idea: there are many independent paths to depth — geometry (stereo), active sensing (structured light, ToF) and image cues (focus, shading, motion); robust systems fuse several.

PitfallThe aperture problem: from a single small window you can only measure flow normal to an edge, not along it — which is why optical flow needs a local smoothness or constant-flow assumption to give a full $(u,v)$.

Demo · 8 Gradients (flow building block)

Szeliski — Ch. 11–12 (stereo & 3-D reconstruction) and Ch. 9 (dense motion / optical flow): the geometry and the algorithms.
Hartley & Zisserman — stereo & reconstruction chapters: the projective view of two-view depth.

Key takeaway: depth from a 2-D image is never free — you pay with a second view, active illumination, motion, or a prior; choosing which cost to pay is the core design decision in any 3-D system.

SESSION 9 · LIVE IN-PERSON

Multiple view geometry

Objective: relate two or more views algebraically.

Single & dual-camera models — a single projection matrix $P$ maps the world to one image; a dual (stereo) model adds a second camera with a fixed relative pose $(R,t)$, and the pair's geometry is what lets us triangulate depth.
Homography — a $3\times3$ invertible $H$ relating two images of the same plane, or two views sharing the same centre (pure rotation). Eight DOF, estimable from four point correspondences — the workhorse of rectification, mosaicing and augmented-reality marker tracking.
Epipolar geometry — for a general (non-planar) scene, the fundamental matrix $F$ encodes the two-view geometry: any point $\mathbf{x}$ in image 1 must lie on the epipolar line $F\mathbf{x}$ in image 2, expressed by $\mathbf{x}'^{\top}F\mathbf{x}=0$. $F$ has rank 2 and 7 DOF.
Structure from motion — given many overlapping views, jointly recover all camera poses and a 3-D point cloud: match features, estimate pairwise geometry, triangulate, then globally refine — the engine behind photogrammetry and 3-D scanning apps.

Epipolar constraint: x'ᵀ F x = 0 · Planar map: x' = H x · F is 3×3, rank 2, 7 DOF

Worked example: without epipolar geometry, matching a point means searching the whole second image — $O(N^2)$ pixels. The constraint $\mathbf{x}'^{\top}F\mathbf{x}=0$ pins the match to a single line, cutting the search to $O(N)$; after rectification that line is a horizontal row, so the search is a 1-D scan.

Key idea: the epipolar constraint collapses a 2-D correspondence search to a 1-D line search — the single most useful geometric fact in stereo and SfM.

PitfallA homography only relates two views if the scene is planar or the camera purely rotates. Fit $H$ to a 3-D scene with translation and it will be wrong — that case needs the fundamental matrix $F$, not $H$.

Demo · 4 Projective/homography Demo · 14 Matching (correspondence)

Hartley & Zisserman — Ch. 9–11 (epipolar geometry, the fundamental matrix, the trifocal tensor and SfM): the core reference for this and the next session.

Key takeaway: $H$ for planes/rotation, $F$ for everything else — picking the right two-view model is the first decision in any multi-view pipeline.

SESSION 10 · LIVE IN-PERSON

Multiple view geometry (continued)

Objective: deepen the two-view tools and complete the SfM pipeline.

Single & dual-camera models — revisited as an estimation problem: normalise coordinates (translate/scale so points are centred with unit average distance) for numerical conditioning, and wrap the fit in RANSAC so a few wrong matches don't corrupt the model.
Homography — the Direct Linear Transform stacks each correspondence into two rows of a system $Ah=0$ and solves by SVD (the null-space vector); $\geq4$ matches suffice. Applications: stereo rectification, image mosaicing/panoramas, planar AR.
Epipolar geometry — the normalised 8-point algorithm estimates $F$ linearly; with known intrinsics, $F$ becomes the essential matrix $E=K'^{\top}FK$, which factors as $E=[t]_\times R$ and decomposes (via SVD) into the relative rotation $R$ and translation direction $t$ between the two cameras.
Structure from motion — from relative pose, triangulate 3-D points, then run bundle adjustment: a large non-linear least-squares refinement minimising total reprojection error over all camera and point parameters simultaneously.

Essential: E = [t]_× R · F = K'⁻ᵀ E K⁻¹ · DLT/8-point: solve A h = 0 by SVD

Worked example: the 8-point algorithm needs 8 correspondences because $F$ has 9 entries known only up to scale → 8 unknowns, and each match $\mathbf{x}'^{\top}F\mathbf{x}=0$ gives one linear equation. Hartley's normalisation (centre + scale to $\sqrt2$ average distance) can cut the numerical error by orders of magnitude versus raw pixels.

Key idea: bundle adjustment is the workhorse optimiser of 3-D vision — it minimises total reprojection error over all cameras and points at once.

PitfallSkipping normalisation before the 8-point/DLT solve. Raw pixel coordinates (hundreds in magnitude, ones for the homogeneous term) make $A$ wildly ill-conditioned; always normalise first, then de-normalise the result.

Demo · 14 Template matching

Hartley & Zisserman — Ch. 11 (estimating $F$, the normalised 8-point algorithm) & the bundle-adjustment appendix.
Szeliski — Ch. 11 (SfM pipelines end to end).

Key takeaway: robust geometry = good conditioning (normalise) + robustness to outliers (RANSAC) + a final non-linear polish (bundle adjustment); the linear estimate is only the seed.

SESSION 11 · LIVE IN-PERSON

Depth cameras

Objective: configure and capture from real depth-sensing rigs.

Stereoscopic camera configuration & capture — physically mount two calibrated cameras with a known baseline, then rectify the pair (warp both images so epipolar lines become corresponding horizontal rows). After rectification a block-matching or semi-global-matching algorithm scans each row for the disparity, and $Z=fB/d$ converts it to depth.
Structured-light camera configuration & capture — replace the second camera with a projector: emit a coded pattern (e.g. Gray-code stripes or a speckle), capture the deformed pattern, decode which projector column each pixel saw, and triangulate — effectively a stereo pair where one "camera" is an active emitter, robust on textureless surfaces.

After rectification: epipolar lines are image rows → depth = f·B / disparity(row)

Worked example: rectification reduces the correspondence search from a slanted epipolar line (sub-pixel resampling, 2-D bookkeeping) to "for each pixel, scan left along the same row" — the reason real-time stereo on embedded hardware is feasible at all.

Key idea: a rectified stereo pair turns the epipolar lines horizontal, so depth becomes a simple per-row 1-D disparity search.

PitfallStereo fails on textureless or repetitive regions (blank walls, periodic fences) where matching is ambiguous — exactly the case structured light solves by creating texture with its projected pattern.

Demo · 2 Camera model

Szeliski — Ch. 12 (depth sensing, rectification, active illumination): the practical pipeline for both rigs.

Key takeaway: rectification is the bridge from epipolar theory (S9–10) to a working depth sensor — it is what makes disparity a cheap 1-D lookup.

SESSION 12 · LIVE IN-PERSON

Camera calibration I

Objective: understand how intrinsics and distortion are estimated.

Introduction & tools — calibration recovers the intrinsics $K$ and lens-distortion coefficients so pixel measurements become metric geometry; without it, every projection, stereo depth and pose estimate carries a systematic bias. OpenCV's calibrateCamera implements the standard toolchain.
Zhang's method — show a planar checkerboard (known geometry) at several orientations. Each view yields a homography between the board plane and the image; because a homography of a calibration plane constrains $K$ via the image of the absolute conic, ~3+ views over-determine the 5 intrinsic unknowns, which are then refined non-linearly together with distortion.
Calibration process issues — pattern coverage (corners must fill the frame, including edges/corners where distortion is strongest), avoiding degenerate near-parallel board poses, and choosing the right distortion model (radial $k_1,k_2,k_3$ + tangential) — too many terms over-fits, too few leaves residual bias.

Each planar view → homography H = K [r₁ r₂ t]; ≥3 views over-determine the 5 intrinsics in K

Worked example: $K$ has 5 unknowns ($f_x,f_y,c_x,c_y,s$). Each checkerboard view contributes 2 constraints on $K$ (from the orthonormality of $r_1,r_2$), so 3 views give 6 constraints > 5 — enough to solve, and more views simply improve the least-squares estimate and average out noise.

Key idea: Zhang's flexible method needs only a printed checkerboard waved at the camera — each planar view gives a homography that constrains the unknown intrinsics.

PitfallCapturing all your views from the same distance and angle gives degenerate, near-identical homographies that under-constrain $K$ — vary depth, tilt and position, and push the board into the image corners where distortion lives.

Demo · 2 Intrinsics K

Zhang (2000), "A flexible new technique for camera calibration" — the original method, well worth reading; Szeliski Ch. 6.3 for the textbook treatment.

Key takeaway: calibration is geometry-as-estimation — a known planar target turns the unknown $K$ into a solvable least-squares problem, and view diversity is what makes it well-posed.

SESSION 13 · LIVE IN-PERSON

Camera calibration II

Objective: run an end-to-end calibration and validate it.

Pattern image acquisition — capture 15–30 sharp, well-distributed checkerboard views spanning depth, tilt and frame position; blurry or motion-smeared frames give sub-pixel corner errors that propagate into $K$.
Running the calibration process — detect and sub-pixel-refine the corners, build the correspondence between board points and image points, solve the linear system for an initial $K$ and per-view poses, then jointly refine $K$, distortion and all poses by minimising reprojection error (Levenberg–Marquardt).
Checking reprojection error — re-project the known 3-D corners with the estimated parameters and measure the RMS pixel distance to the detected corners; a good calibration lands below ~0.5 px. Inspect per-view error to spot and drop a bad capture.

Reprojection error = sqrt( (1/N) Σ ‖ x_i − project(K, R, t, X_i) ‖² ) (target ≲ 0.5 px)

Worked example: if reprojection error is 0.3 px the calibration is trustworthy; if one view reports 4 px while the rest report 0.3 px, that frame had mis-detected corners — delete it and re-run rather than accepting the inflated average.

Key idea: reprojection error in sub-pixel units is the honest yardstick — a calibration is only as good as the geometry it reproduces.

PitfallA low average reprojection error can hide one disastrous view. Always check the per-image breakdown; the mean is not robust to a single outlier capture.

Demo · 2 Reprojection intuition

Szeliski — Ch. 6.3 (calibration in practice); OpenCV calibration tutorial for the exact API and corner-refinement steps.

Key takeaway: reprojection error is the one number that certifies a calibration — measure it, read it per-view, and never trust intrinsics you haven't validated this way.

Module 3 · Session 14

Mid-term checkpoint

A consolidation point: the mid-term exam closes the geometry-and-acquisition half of the course before the pivot to image processing and learning.

Demonstrate mastery of image formation, the projective transformation hierarchy, the full camera model, multi-view geometry and calibration.

SESSION 14 · LIVE IN-PERSON

Mid-term exam

Objective: assess Sessions 1–13 — formation, optics, colour, projective geometry, the camera model, multi-view geometry and calibration.

Mid-term exam — solving Computer Vision problems on the foundations and geometry material covered so far: image formation and optics, colour and compression, projective geometry and the transformation hierarchy, the full camera model $K[R|t]$, multi-view geometry ($H$, $F$, $E$, SfM) and calibration.

Study tipRevisit the camera-model and geometric-transform demos and be able to derive each matrix by hand; most exam problems are variations on "given these parameters, where does this point project / how do these views relate?". Build a one-page formula sheet from memory: thin lens, $K[R|t]$, the transform DOF table, $Z=fB/d$, the epipolar constraint and reprojection error.

Likely problem types(1) project a given 3-D point through a given $K[R|t]$; (2) classify/estimate a 2-D transform and state its DOF; (3) compute stereo depth from disparity; (4) reason about why a calibration or correspondence failed. Practising each end-to-end beats re-reading the slides.

Review · 2 Camera model Review · 4 Geometric

Module 4 · Sessions 15–17

Image Processing — Point, Local & Global Operations

Classical image processing organised by spatial scope: pixelwise (point) operations, neighbourhood (local/convolution) operations, and whole-image (global/Fourier) operations. These are the operators every later technique is built from.

Apply point transforms (negatives, log, gamma, contrast stretch, histogram equalization, binarization) and read histograms.
Design and apply linear filters via convolution, exploit separability, and use orientable/band-pass filters.
Analyse images in the frequency domain with the 2-D DFT and interpret filtering as spectrum shaping.

SESSION 15 · LIVE IN-PERSON

Pixelwise operations

Objective: transform each pixel independently via an intensity map $s = T(r)$.

Pixelwise operations & pixel transformations — apply an intensity map $s=T(r)$ identically to every pixel: the negative $s=L-1-r$ (invert), log $s=c\log(1+r)$ (compress bright, expand dark), gamma $s=c\,r^{\gamma}$ (perceptual remap) and contrast stretching (linearly rescale a narrow intensity band to the full range). All are 1-D lookup tables.
Histogram equalization — use the image's own cumulative distribution as the transform: $s=\text{round}((L-1)\cdot\text{cdf}(r))$. This spreads frequent intensities apart and squeezes rare ones together, flattening the histogram and maximising global contrast — automatic, parameter-free contrast enhancement.
Binarization — collapse to a 0/1 image by thresholding $s=\mathbb{1}[r>t]$; the gateway to morphology, connected-component labelling and shape analysis. The whole game is choosing $t$ (Otsu, Session 19, picks it automatically).

s = T(r) · equalize: s = round((L−1)·cdf(r)) · gamma: s = c·rᵞ

Worked example: an image whose pixels all fall in $[100,150]$ looks flat and grey; contrast stretching $s=255\cdot(r-100)/50$ maps that band to the full $[0,255]$, instantly revealing detail — and histogram equalization does the same thing adaptively, driven by where the pixels actually cluster.

Key idea: a point operation depends only on a pixel's own value, so it is fully described by a 1-D lookup curve — and the histogram tells you which curve helps.

PitfallHistogram equalization is global and can over-amplify noise in near-flat regions; when contrast varies across the image, prefer adaptive (CLAHE) equalization that limits amplification per tile.

Demo · 5 Intensity & histogram Demo · 9 Binarization

Szeliski — Ch. 3.1 (point operators) & 3.1.4 (histogram equalization): the canonical curves and the CDF derivation.
Forsyth & Ponce — early image-processing chapter for intuition.

Key takeaway: read the histogram first — it diagnoses the contrast problem and points straight at the 1-D curve that fixes it.

SESSION 16 · LIVE IN-PERSON

Local operations

Objective: process each pixel using its neighbourhood via convolution.

Introduction — a local (neighbourhood) operator computes each output pixel from a window of input pixels. The linear case is convolution: slide a kernel $K$ over the image and take the weighted sum, $(I*K)(x,y)=\sum_{i,j} I(x-i,y-j)K(i,j)$. Boundary handling (zero-pad, reflect, replicate) matters at the edges.
Separable filters — a kernel is separable if $K=\mathbf{u}\mathbf{v}^{\top}$; then 2-D convolution becomes a 1-D horizontal pass followed by a 1-D vertical pass, dropping the cost per pixel from $O(k^2)$ to $O(2k)$. The Gaussian and the box filter are separable — a large speedup for big kernels.
Linear filtering examples — box and Gaussian blur (low-pass, denoising), Sobel (first-derivative, edges), Laplacian (second-derivative, fine detail), and unsharp masking (sharpen by adding back the high-pass residual).
Band-pass & orientable filters; other local operators — Difference-of-Gaussians as a band-pass/blob detector, steerable filters that synthesise any orientation from a basis, and non-linear neighbourhood ops: the median filter (removes salt-and-pepper noise, preserves edges) and the bilateral filter (edge-preserving smoothing).

(I * K)(x,y) = Σ_i Σ_j I(x−i, y−j) · K(i,j) · separable: cost O(k²) → O(2k)

Worked example: convolving a 1-D signal $[10,10,10,50,50]$ with the Sobel-like derivative kernel $[-1,0,1]$ gives a large response (≈40) only at the step between 10 and 50 — that spike is an edge. The vertical Sobel does this column-wise to find horizontal edges.

Key idea: convolution is the atom of classical vision and of CNNs — the only thing that changes later is that the kernel weights are learned rather than designed.

Connects toThis is the most important bridge in the course: the hand-designed kernels here become the learned kernels of the CNN in Session 21, and a single conv layer is exactly this operation with trainable weights. Separability reappears as depthwise-separable convolutions in efficient nets.

Demo · 6 Convolution playground Demo · 8 Sobel/DoG Demo · 11 CNN (learned kernels)

Szeliski — Ch. 3.2 (linear filtering & separability) and 3.3 (median/bilateral and other non-linear filters): work through the separability proof.

Key takeaway: if you truly understand convolution and separability here, CNNs in Module 6 are not a new idea — they are this operation with the kernel weights learned by gradient descent.

SESSION 17 · LIVE IN-PERSON

Global operations

Objective: analyse and filter images in the frequency domain.

Fourier transform & the 2-D (bidimensional) DFT — express the image as a sum of 2-D sinusoids; the DFT $F(u,v)$ gives the amplitude and phase of each spatial frequency $(u,v)$. The 2-D DFT is separable, so it is computed as 1-D FFTs along rows then columns in $O(MN\log MN)$.
DFT interpretation — the magnitude spectrum says how much of each frequency is present (centre = low frequencies = smooth gradients; periphery = high frequencies = edges and texture), while the phase says where that structure sits — and phase carries most of the recognisable image content (swap two images' phases and the picture follows the phase).
Other global transformations — the DCT (real-valued, energy-compacting — the heart of JPEG from Session 3) and wavelets (joint space-frequency, multi-scale — the basis of image pyramids and JPEG-2000).

F(u,v) = Σ_x Σ_y I(x,y) · e^( −j2π (ux/M + vy/N) ) · convolution theorem: I * K ⇔ F·K̂

Worked example: to remove periodic sensor noise (regular stripes), take the DFT — the stripes appear as two bright dots off-centre; zero those frequencies and inverse-transform, and the stripes vanish while the rest of the image is untouched. This surgical removal is impossible in the spatial domain.

Key idea: by the convolution theorem, blurring/sharpening in space is just multiplying the spectrum by a low-/high-pass mask — the same operation seen from the other domain.

Connects toThis closes a loop: JPEG (S3) is a block DCT, Gaussian blur (S16) is low-pass filtering, and a sharp edge is a burst of high frequencies — the spatial and frequency views are two descriptions of one image.

Demo · 7 2-D DFT & filtering

Szeliski — Ch. 3.4 (Fourier transforms & the convolution theorem) and 3.5 (pyramids/wavelets).
Forsyth & Ponce — frequency-domain chapter for the signal-processing grounding.

Key takeaway: the frequency domain turns convolution into multiplication and makes "remove this periodic structure" trivial — reach for it whenever a problem is about scale or periodicity.

Module 5 · Sessions 18–20

Classical Features, Recognition & Matching

Detecting salient structure (corners, edges, blobs), segmenting and recognising objects with classical pipelines, and matching images via templates and feature correspondences + homography — the pre-deep-learning toolkit that still underpins SLAM, panoramas and registration.

Detect corners (Harris & Stephens), edges and curvature-based salient features.
Approach object recognition through edges, thresholding and region-based segmentation.
Match images with template matching and with feature correspondences + homography estimation.

SESSION 18 · LIVE IN-PERSON

Salient feature detection

Objective: find repeatable, distinctive image points.

Introduction & definitions — a good feature is repeatable (found again under viewpoint/lighting change), distinctive (its descriptor is unique enough to match), local (robust to occlusion and clutter) and ideally invariant to rotation, scale and illumination. These properties are what make features matchable across images.
Surface curvature & curvature-based methods — treat intensity as a 2-D surface $z=I(x,y)$; corners are points of high curvature in two directions, edges high curvature in one. Curvature of the level curves gives an early, principled corner measure.
Harris & Stephens method — form the second-moment (structure) matrix $M$ by summing gradient outer products over a window; its two eigenvalues describe how intensity varies locally. Rather than compute eigenvalues, Harris scores cornerness cheaply as $R=\det(M)-k\,\mathrm{tr}(M)^2$ ($k\approx0.04$–0.06): large positive $R$ = corner, large negative = edge, small = flat.
Gradient-based methods — build detectors and descriptors from the image gradient $\nabla I=(I_x,I_y)$: gradient magnitude marks edges, orientation histograms (the idea behind SIFT/HOG) give rotation-tolerant descriptors.

M = Σ_w [[I_x², I_xI_y],[I_xI_y, I_y²]] · R = det(M) − k·tr(M)² · det = λ₁λ₂, tr = λ₁+λ₂

Worked example: on a flat patch both eigenvalues $\lambda_1,\lambda_2\approx0$ so $R\approx0$; on an edge one is large, one ~0, so $\det\approx0$ but $\mathrm{tr}^2$ is large → $R\ll0$; on a corner both are large → $\det$ dominates → $R\gg0$. The single scalar $R$ thus classifies the three cases without ever extracting eigenvalues.

Key idea: a corner is where intensity changes in two directions, i.e. both eigenvalues of the local gradient covariance $M$ are large — Harris just scores that cheaply.

PitfallHarris is rotation-invariant but not scale-invariant — a corner at one zoom is an edge or blob at another. For scale invariance use a scale-space detector (DoG/SIFT), which searches over blur levels too.

Demo · 8 Harris / DoG / Canny

Harris & Stephens (1988) — the original four-page paper, very readable; Szeliski Ch. 7.1 (feature detectors & their invariances).

Key takeaway: the eigenvalues of the local gradient structure matrix diagnose flat / edge / corner — almost every classical detector is a way of reading those two numbers cheaply.

SESSION 19 · LIVE IN-PERSON

Intro to object recognition

Objective: separate objects from background with classical segmentation.

Introduction — fix the vocabulary: classification says what an image contains, detection says what and where (boxes), segmentation labels every pixel. Classical recognition tackles the simplest version — separate a known object from its background.
Edge detection-based techniques — find boundaries (Canny: gradient + non-max suppression + hysteresis thresholding) then close them into object contours; works when objects have crisp, complete edges but struggles with gaps and texture.
Image thresholding — global thresholding splits the histogram at one level; adaptive thresholding varies the level per region for uneven lighting. Otsu's method picks the global level automatically by maximising the between-class variance $\sigma^2_{between}(t)=w_0w_1(\mu_0-\mu_1)^2$ — equivalently minimising within-class variance — assuming a bimodal histogram.
Region-based algorithms — group pixels by similarity: region growing (expand from seeds while a homogeneity test holds), split-and-merge (quadtree partition then merge similar neighbours), and connected-component labelling of a binary mask.

Otsu: t* = argmax_t σ²_between(t) = argmax_t w₀(t)·w₁(t)·(μ₀(t)−μ₁(t))²

Worked example: Otsu sweeps every candidate threshold $t$, splits the histogram into "below/above", computes each side's weight $w$ and mean $\mu$, and keeps the $t$ that maximises $w_0w_1(\mu_0-\mu_1)^2$ — for a histogram with two clear humps it lands neatly in the valley between them, with no parameter to tune.

Key idea: segmentation can be driven by discontinuity (edges) or similarity (regions) — two complementary views of the same partition problem.

PitfallOtsu assumes a bimodal histogram; under uneven illumination one global threshold fails everywhere. Switch to adaptive/local thresholding, or correct the illumination first with a high-pass/background-subtraction step.

Demo · 13 Otsu / region / edge / k-means Demo · 9 Connected components

Otsu (1979) — the original automatic-thresholding paper; Szeliski Ch. 5 (segmentation, active contours, graph cuts); Forsyth & Ponce segmentation chapters.

Key takeaway: classical segmentation gives you a fast, label-free baseline — reach for thresholding/regions before deep models when lighting is controlled and objects are well separated.

SESSION 20 · LIVE IN-PERSON

Image matching techniques

Objective: locate and align content across images.

Template matching — slide a reference patch $T$ over the image and score each location: SSD (sum of squared differences, fast but brightness-sensitive) or normalised cross-correlation (NCC, which subtracts the mean and divides by the norm, making it invariant to linear brightness/contrast changes). The best-scoring location is the match.
Feature matching + homography — detect features, match their descriptors (nearest neighbour with a ratio test), then fit a homography $H$ with RANSAC: repeatedly sample 4 matches, fit $H$, count inliers, keep the best — discarding the inevitable wrong matches. The result aligns or stitches the images.

NCC(u,v) = Σ (I−Ī)(T−T̄) / sqrt( Σ(I−Ī)² · Σ(T−T̄)² ) ∈ [−1, 1]

Worked example: RANSAC's power is statistical — even with 50% wrong matches, the chance that a random sample of 4 is all-correct is $0.5^4=6\%$, so after ~70 iterations you are >99% likely to have hit a clean sample and recovered the true $H$. That is why it tolerates outliers a least-squares fit cannot.

Key idea: template matching is brittle to rotation/scale; feature matching + RANSAC + homography is the robust upgrade that makes panoramas and registration work.

Connects toThis is the payoff of the whole classical track: features (S18) + the homography (S6, S9–10) + RANSAC (S10) combine here into panorama stitching and image registration — the same pipeline that bootstraps SfM and visual SLAM.

Demo · 14 NCC / SSD matching Demo · 4 Homography warp

Szeliski — Ch. 7.1–7.2 (feature matching & the ratio test) and Ch. 8 (image alignment, RANSAC & stitching).

Key takeaway: use NCC when the target's appearance is fixed; switch to features + RANSAC + homography the moment viewpoint, rotation or scale can change.

Module 6 · Sessions 21–24

Deep Learning for Vision — CNNs & Architectures

The deep-learning core: convolutional neural networks from first principles, implementation in modern frameworks, the canonical architecture lineage (AlexNet → VGG → GoogLeNet → ResNet → U-Net), and recurrent/attention models for sequence and captioning tasks.

Explain CNN building blocks — convolution, non-linearity, pooling — and why they suit images.
Implement a CNN from scratch and in a deep-learning framework.
Compare landmark architectures and the ideas they introduced (depth, 1×1 convs, residuals, encoder–decoder).
Describe RNN/GRU/LSTM models and attention-based image captioning.

SESSION 21 · LIVE IN-PERSON

Convolutional neural networks

Objective: motivate and define the CNN.

Introduction — a fully-connected MLP on a $224\times224\times3$ image would need ~150k weights per neuron and ignore that nearby pixels relate. A convolutional layer instead slides a small shared kernel, so it has few parameters and respects spatial structure — the right model for images.
Applications — the same convolutional backbone, with different output heads, powers classification, detection, semantic/instance segmentation, depth, pose and generation — the unifying architecture of modern CV.
Functional perspective — a CNN is a composition $f=f_L\circ\dots\circ f_1$ of differentiable layers; training minimises a loss by gradient descent, with gradients computed by backpropagation (the chain rule applied layer by layer from the loss backward).

conv: y = σ(W * x + b) · output size: out = ⌊(n + 2p − k)/s⌋ + 1

Worked example: a $32\times32$ input, $5\times5$ kernel, padding $p=0$, stride $s=1$ gives output $(32-5)/1+1 = 28$, i.e. a $28\times28$ map. With $p=2$ ("same" padding) it stays $32\times32$. A 3×3 conv on 64 input channels producing 128 output channels has only $3\cdot3\cdot64\cdot128 = 73{,}728$ weights — tiny next to a dense layer.

Key idea: convolution gives CNNs local connectivity, weight sharing and translation equivariance — far fewer parameters than a dense net, and the right inductive bias for images.

Connects toThe conv operation is exactly Session 16's convolution — only now the kernel weights $W$ are learned by backprop instead of designed by hand. Memorise the output-size formula; it is examined and used constantly.

Demo · 10 MLP playground Demo · 11 CNN feature maps

Drummy et al., Dive into Deep Learning — the CNN chapters (convolutions, padding/stride, channels); Szeliski Ch. 5.3–5.4 for the CV framing.

Key takeaway: a CNN trades a dense net's brute force for the inductive biases images actually have — locality, weight sharing, translation equivariance — which is why it generalises from far less data.

SESSION 22 · LIVE IN-PERSON

Convolutional neural networks II

Objective: build a CNN in practice.

Deep learning frameworks — tensors (n-D arrays on CPU/GPU), autograd (the framework records operations and differentiates automatically), and the canonical training loop: forward pass → compute loss → loss.backward() → optimizer.step() → zero gradients. GPUs accelerate the underlying matrix multiplies by 1–2 orders of magnitude.
Implementing a CNN from scratch — stack conv → activation (ReLU) → pooling blocks to extract features, flatten, then fully-connected layers to a class score; pick a loss (cross-entropy for classification), an optimiser (SGD/Adam) and a learning rate, then train and monitor train/validation curves.

max-pool 2×2: y(i,j) = max over 2×2 window · ReLU: σ(z)=max(0,z)

Worked example: a $2\times2$ max-pool with stride 2 on a $28\times28$ map halves it to $14\times14$, keeping the strongest activation in each window — so the feature "an edge is present here-ish" survives a small shift. A typical block then doubles channels (e.g. 32→64), trading the lost spatial resolution for richer feature semantics.

Key idea: pooling and strided convolution shrink spatial resolution while growing channel depth — the network trades "where" for "what" as it goes deeper.

PitfallForgetting to zero gradients each step makes them accumulate across batches and training diverges; and a learning rate too high explodes the loss while too low stalls it. Watch the loss curve from step one.

Demo · 11 Pooling & activations

Dive into Deep Learning — the modern-training chapters and the PyTorch framework tutorials: build and train a small CNN end to end.

Key takeaway: the train loop (forward → loss → backward → step) is the same for every model in Modules 6–7 — internalise it once and only the architecture and loss change afterward.

SESSION 23 · LIVE IN-PERSON

Convolutional neural networks III — advanced architectures

Objective: trace the landmark architectures and the ideas each contributed.

AlexNet (2012) — the breakthrough that started the deep era: 8 layers, ReLU activations (faster, no saturation), dropout regularisation, and training split across two GPUs. Halved the ImageNet error and ended the hand-crafted-feature era.
VGG (2014) — showed that depth drives accuracy: uniform stacks of small $3\times3$ convolutions, where two stacked 3×3s match one 5×5's receptive field with fewer parameters and more non-linearity.
GoogLeNet / Inception (2014) — parallel multi-scale branches per module, and $1\times1$ convolutions to mix channels and cut depth cheaply before expensive 3×3/5×5 convs — far fewer parameters at the same accuracy.
ResNet (2015) — the residual block $y=F(x)+x$: the layer learns a residual on top of the identity, so adding depth never hurts. Enabled 50–152+ layer networks and remains the default backbone.
U-Net (2015) — a symmetric encoder–decoder with skip connections that copy high-resolution encoder features across to the decoder, giving sharp per-pixel output — the standard for segmentation and the structural ancestor of diffusion denoisers.

Residual block: y = F(x, {W_i}) + x · two 3×3 convs ≈ one 5×5 receptive field, fewer params

Worked example: stacking plain layers past ~20 deep raised training error (the degradation problem). The residual fix is subtle: if the optimal map is the identity, a plain block must learn it exactly, but a residual block just drives $F(x)\to0$ — far easier — so gradients flow through the $+x$ shortcut and 100+ layers finally train.

Key idea: the residual shortcut lets gradients flow straight through, so "deeper" stops hurting — ResNet is why 100+ layer networks train at all.

Connects toThis lineage is one idea evolving — go deeper (VGG), go efficient (Inception's 1×1), go trainable-at-depth (ResNet's skip), go dense-output (U-Net's skip). U-Net's encoder–decoder reappears in segmentation (S25) and diffusion (S28).

Demo · 12 LeNet/VGG/ResNet/U-Net/ViT

He et al. (ResNet, 2016) and Simonyan & Zisserman (VGG, 2015) — read both for the depth/residual story; Dive into Deep Learning — the modern-CNN chapter walks through each architecture in code.

Key takeaway: you don't memorise architectures, you memorise the ideas each contributed — depth, 1×1 mixing, residual shortcuts, encoder–decoder skips — and recognise them recombined everywhere after.

SESSION 24 · LIVE IN-PERSON

Recurrent neural networks

Objective: model sequences and connect vision to language.

Introduction — an RNN carries a hidden state forward, $h_t=f(Wx_t+Uh_{t-1}+b)$, reusing the same weights at every step so it handles variable-length sequences (video frames, word sequences). Trained by backpropagation-through-time.
GRU networks — gated recurrent units use update and reset gates to decide how much past state to keep, mitigating vanishing gradients with fewer parameters and gates than the LSTM — often the practical default.
LSTM networks — add an explicit cell state plus input/forget/output gates; the forget gate lets information persist or be erased across many steps, capturing long-range dependencies a plain RNN forgets.
Image captioning with attention — a CNN encodes the image into a grid of feature vectors; an RNN decoder generates words one at a time and, at each step, computes an attention distribution $\alpha=\text{softmax}(\text{score}(h,\text{features}))$ over the grid to look at the relevant region for the next word.

h_t = f(W·x_t + U·h_{t−1} + b) · attention: α = softmax(score(h, features)), context = Σ α·features

Worked example: captioning "a dog catching a frisbee", the decoder's attention $\alpha$ peaks over the dog's head when emitting "dog" and shifts to the disc when emitting "frisbee" — a soft, differentiable pointer over image regions, visualisable as a heatmap per word.

Key idea: attention lets the decoder look back at the most relevant image regions for each word — the precursor of the transformer attention in Session 29.

Pitfall & connects toRNNs process steps sequentially, so they are slow and still strain on very long dependencies. Session 29's self-attention drops recurrence entirely for a parallel, global mechanism — the same attention idea you meet here, scaled up into the transformer.

Demo · 12 Architectures (ViT/attention)

Dive into Deep Learning — the RNN/GRU/LSTM and attention chapters (with the captioning case study); Bishop, Pattern Recognition and ML for the probabilistic sequence-model background.

Key takeaway: gating solved the memory problem and attention solved the "which input matters" problem — together they set up the transformer, which keeps attention and discards recurrence.

Module 7 · Sessions 25–30

Detection, Segmentation, Generative Models & Transformers

The modern frontier: dense prediction (detection, semantic/instance segmentation), training a real detector (YOLO), generative modelling (VAEs, PixelRNN/CNN, GANs, diffusion), and the attention/transformer revolution (ViT, hybrids) that now spans classification, detection and generation.

Distinguish classification, localization, detection, semantic and instance segmentation, and the architectures for each.
Train a YOLO model for real-time object detection.
Explain generative families — VAE, autoregressive (PixelRNN/CNN), GAN, diffusion — and their trade-offs.
Describe self-attention, Vision Transformers and CNN–transformer hybrids, and their CV applications.

SESSION 25 · LIVE IN-PERSON

Detection & segmentation using Deep Learning

Objective: map each dense-prediction task to its network design.

Semantic segmentation — label every pixel with a class via a fully-convolutional encoder–decoder (U-Net/FCN), trained with per-pixel cross-entropy; it tells what is at each pixel but does not separate two instances of the same class.
Classification + localization — one object: a shared backbone feeds a classification head (softmax over classes) and a regression head (4 box coordinates), trained with a combined classification + box-regression loss.
Object detection / instance segmentation — many objects: two-stage (Faster R-CNN: propose regions then classify/refine) or one-stage (SSD, YOLO: dense predictions in one pass); Mask R-CNN adds a per-instance mask head. IoU measures box overlap and non-max suppression removes duplicate boxes for the same object.

IoU = area(∩) / area(∪) ∈ [0,1] · match if IoU ≥ 0.5; NMS keeps highest-score box, drops overlaps

Worked example: a predicted box and the ground-truth box overlap on 30 px² and together cover 90 px², so IoU $=30/90=0.33$ — below the usual 0.5 threshold, this counts as a miss. NMS uses the same metric: among overlapping detections it keeps the top-scoring box and suppresses any other with IoU above a threshold.

Key idea: the four tasks form a ladder — "what" → "what + where (one)" → "what + where (many)" → "what + where + exact pixels" — each adding an output head, not a new paradigm.

Connects toThe shared backbone is a Module 6 CNN (often ResNet); semantic segmentation reuses U-Net (S23) directly. Only the heads and losses change per task — the representation is the same.

Demo · 13 Segmentation Demo · 14 Detection Demo · 12 U-Net

Szeliski — Ch. 6 (recognition & detection); Dive into Deep Learning — the object-detection and semantic-segmentation chapters (anchors, IoU, NMS, FCN).

Key takeaway: dense-prediction tasks differ only by output head and loss on a shared backbone — once you can read IoU and NMS, the detection literature stops being a zoo.

SESSION 26 · LIVE IN-PERSON

Detection & segmentation using Deep Learning II

Objective: train a one-stage detector end-to-end.

Training YOLO for object detection — divide the image into an $S\times S$ grid; each cell predicts a few anchor-based boxes with coordinates, an objectness confidence and class probabilities, all in one forward pass. The multipart loss sums box-localization, objectness and classification terms (with $\lambda_{coord}$ up-weighting localization and down-weighting empty cells); at inference, confidence thresholding + NMS yield the final detections. In practice you fine-tune a pretrained YOLO on your own annotated dataset.

YOLO loss = λ_coord·L_box + L_obj + λ_noobj·L_noobj + L_cls · one forward pass → all boxes

Worked example: a 13×13 grid with 3 anchors and 80 classes predicts $13\cdot13\cdot3=507$ candidate boxes per image in a single pass; most have near-zero objectness and are thresholded away, then NMS collapses the overlapping survivors to a handful of final detections — fast enough for video at 30+ FPS.

Key idea: YOLO reframes detection as a single regression over a grid, trading a little accuracy for real-time speed — "you only look once".

PitfallOne-stage detectors face extreme foreground/background imbalance (the vast majority of grid cells are empty); the $\lambda_{noobj}$ down-weighting (and, in later detectors, focal loss) is what stops the background dominating the gradient.

Demo · 14 Object detection

Redmon et al. (YOLO, 2016) — the original one-look formulation; Ultralytics YOLO docs for the practical training workflow; Dive into Deep Learning for anchors and loss design.

Key takeaway: the speed/accuracy trade-off is the detection design axis — one-stage (YOLO) for real time, two-stage (Faster R-CNN) when precision matters more than latency.

SESSION 27 · LIVE IN-PERSON

Generative models I

Objective: learn distributions over images (latent-variable & autoregressive).

Unsupervised learning — learn structure without labels. An autoencoder compresses an image to a bottleneck code and reconstructs it; PCA is the linear special case (the optimal linear bottleneck), so an autoencoder is "PCA with non-linear encoder/decoder".
Variational Autoencoders (VAE) — the encoder outputs a distribution $q(z|x)$ (mean + variance) rather than a point; training maximises the ELBO = reconstruction term − KL($q\,\|\,$prior), and the reparameterization trick $z=\mu+\sigma\odot\epsilon$ keeps it differentiable. Sampling $z$ from the prior and decoding generates new images.
PixelRNN / PixelCNN — autoregressive models that factor the image likelihood exactly as $p(x)=\prod_i p(x_i\mid x_{<i})$, predicting each pixel conditioned on the ones already generated — sharp and likelihood-exact, but slow because generation is sequential.

VAE: maximize ELBO = E_q[log p(x|z)] − KL( q(z|x) ‖ p(z) ) · reparam: z = μ + σ⊙ε, ε∼N(0,I)

Worked example: the KL term pulls every encoded $q(z|x)$ toward the standard-normal prior, so the latent space has no "holes". Decode the midpoint $0.5z_A+0.5z_B$ of two digits' codes and you get a plausible digit halfway between them — the smoothness a plain autoencoder lacks.

Key idea: a VAE makes the latent space continuous and smooth, so walking through $z$ morphs generated images — exactly what the latent-walk demo shows.

Connects toThree generative philosophies meet here and in S28: VAEs (latent-variable, smooth but blurry), autoregressive (exact likelihood, slow), and — next session — GANs (sharp, implicit) and diffusion (iterative denoising). Each trades sample quality, diversity, speed and training stability differently.

Demo · 16 Autoencoder / PCA Demo · 17 VAE latent walk Demo · 15 Self-supervised

Kingma & Welling (VAE, 2014) — the original ELBO + reparameterization paper; Bishop, Pattern Recognition and ML — latent-variable models & the EM/variational background; Dive into Deep Learning for the code.

Key takeaway: generative modelling is about learning $p(x)$; VAEs do it through a smooth latent bottleneck, autoregressive models through an exact pixel-by-pixel factorisation — different routes to the same goal.

SESSION 28 · LIVE IN-PERSON

Generative models II

Objective: contrast adversarial and diffusion generation.

Generative Adversarial Networks (GANs) — two networks compete: a generator $G$ turns noise $z$ into fake images, a discriminator $D$ tries to tell real from fake, and $G$ is trained to fool $D$. At the min-max equilibrium $G$'s samples are indistinguishable from real data — yielding very sharp images, at the cost of unstable training and mode collapse (the generator emits only a few varieties).
Diffusion models — define a fixed forward process that gradually adds Gaussian noise until an image becomes pure noise, then train a network to reverse one noising step (in practice, to predict the noise $\epsilon$). Generation starts from noise and denoises iteratively to a sample — stable to train, high quality and diverse, the basis of modern text-to-image systems.

GAN: min_G max_D E[log D(x)] + E[log(1 − D(G(z)))] · Diffusion forward: x_t = √(ᾱ_t)·x₀ + √(1−ᾱ_t)·ε · loss: ‖ε − ε_θ(x_t,t)‖²

Worked example: a diffusion model's training target is dead simple — corrupt a clean image to $x_t$ with known noise $\epsilon$, then minimise $\|\epsilon-\epsilon_\theta(x_t,t)\|^2$, i.e. just predict the noise you added. Sampling then peels that noise off step by step from pure static. The denoiser $\epsilon_\theta$ is typically a U-Net (Session 23) — the architecture comes full circle.

Key idea: GANs learn generation implicitly through a critic; diffusion learns it by reversing noise step by step — more stable to train and now state of the art.

PitfallGAN training is a delicate equilibrium: if $D$ wins too fast the generator's gradients vanish, and mode collapse can leave it producing one image. Diffusion sidesteps both with a stable regression loss — at the cost of slow, many-step sampling.

Demo · 17 GAN walk & diffusion noising

Goodfellow et al. (GAN, 2014) — the adversarial game; Ho et al. (DDPM, 2020) — the noise-prediction diffusion formulation; Dive into Deep Learning — GAN chapter for the training loop.

Key takeaway: diffusion won the quality/stability/diversity trade-off that held GANs back — and it does so by reusing the U-Net and the standard regression loss you already know.

SESSION 29 · LIVE IN-PERSON

Transformer architecture

Objective: understand attention and Vision Transformers.

Self-attention & transformers — every token emits a query, key and value; attention weights each token's value by the softmax-normalised similarity of its query to all keys, $\text{softmax}(QK^{\top}/\sqrt{d_k})V$. The $\sqrt{d_k}$ scaling keeps the dot products from saturating the softmax. Multi-head attention runs several in parallel; no recurrence or convolution is involved, so it parallelises fully.
Vision Transformers (ViT) — cut the image into fixed patches (e.g. 16×16), linearly embed each as a token, add positional encodings (since attention is permutation-invariant), prepend a class token, and run a standard transformer encoder. With enough data it matches or beats CNNs.
Hybrid models — combine a CNN feature extractor (strong local inductive bias, data-efficient) with transformer reasoning over those features (global context), e.g. DETR for end-to-end detection without hand-designed anchors/NMS.
Applications — one attention-based backbone now spans classification, detection and generation, increasingly unifying the architectures of the whole field.

Attention(Q,K,V) = softmax( Q Kᵀ / √d_k ) V · cost O(n²) in the number of tokens n

Worked example: a CNN pixel only "sees" distant pixels after stacking many layers (its receptive field grows linearly with depth); a single self-attention layer connects every patch to every other at once. The price is the $QK^{\top}$ matrix being $n\times n$ — quadratic in the token count, which is why patches (not pixels) are used and why long-sequence efficiency is an active research area.

Key idea: self-attention gives a global receptive field in one layer — every patch can attend to every other — which is what convolution achieves only by stacking depth.

Connects toThis is the captioning attention of Session 24 generalised: instead of a decoder attending to image regions, every token attends to every other. ViT's patches are the transformer analogue of a CNN's local windows, and hybrids fuse the two inductive biases.

Demo · 12 Vision Transformer

Vaswani et al. (Attention Is All You Need, 2017) — the transformer; Dosovitskiy et al. (ViT, 2021) — images as patch sequences; Dive into Deep Learning — attention & transformers, with code.

Key takeaway: attention buys a global receptive field at quadratic cost in tokens — the central trade-off that shapes every modern vision-transformer design choice.

SESSION 30 · LIVE IN-PERSON

Synthesis & wrap-up

Objective: integrate the course — from image formation to transformers — and close the project work.

End-to-end view — trace one real pipeline through every stage you have learned: acquire and calibrate (Modules 1–2) → pre-process and filter (Module 4) → detect features or segment (Module 5) → feed a deep model (Modules 6–7) → produce the output. At each stage you now choose deliberately between a classical, interpretable, data-light tool and a learned, powerful, data-hungry one.
Project consolidation & outlook — final-project guidance (scope, baseline-first, honest evaluation) and the open frontiers: foundation and vision-language models, 3-D and video understanding, and efficiency/deployment on constrained hardware.

Worked example: an autonomous-checkout system threads the whole course — calibrated cameras (S12–13) and depth (S8, S11) localise items in 3-D, classical segmentation (S19) or a learned detector (S26 YOLO) finds them, a CNN/ViT (S23, S29) classifies, and the geometry (S7) maps detections back to shelf positions. No single technique suffices; the architecture is the integration.

Key idea: the field is one continuous story — geometry tells you where, processing tells you what's there, and learning tells you what it means; great systems combine all three.

Exam & project tipFor the final, be able to place any technique on the geometry/processing/learning map and justify a classical-vs-learned choice. For the projects, ship a working baseline early, measure it honestly, and document why you chose each component — that reasoning is what the rubrics reward.

Review · All 17 demos About the playground

Szeliski — concluding chapters & applications; revisit each module's readings for the final exam, prioritising the formulas on your one-page sheet.

Key takeaway: the goal was never any single algorithm — it was the judgement to compose geometry, processing and learning into a system that solves a real vision problem.

Key concepts — glossary

Around 50 terms threaded through the course, in roughly pipeline order. Each links conceptually to a session and, where relevant, a live demo.

Image as a function I(x,y): The basic model of a digital image: a function from pixel coordinates to intensity (plus channels for colour, time for video). Everything operates on this array. (S1)
Inverse problem: Recovering the 3-D scene/meaning from a 2-D projection; ill-posed, so priors, geometry or learned statistics are always needed. (S1)
Thin-lens equation: $1/f = 1/z_o + 1/z_i$ relating focal length, object and image distance; only the conjugate distance focuses sharply. (S2)
Aperture & depth of field: The iris opening trades light intake against the depth range that appears sharp; wide aperture = bright but shallow DoF. (S2, S4)
Gamma encoding: $V_{out}=V_{in}^{\gamma}$ allocating more code values to dark tones to match the eye's response (encode ~1/2.2). (S4, S15)
Metamerism: Different spectra that excite the cones identically look the same — why three RGB primaries can fool the eye. (S3)
JPEG / DCT compression: Block discrete cosine transform + coarse quantization of high frequencies + entropy coding; lossy, exploits perceptual redundancy. (S3, S17)
Homogeneous coordinates: Appending a scale coordinate so points are defined up to a non-zero factor; makes translation and perspective linear matrix operations. (S5)
Point–line duality (ℙ²): Points and lines are both 3-vectors; $\mathbf{x}^{\top}\mathbf{l}=0$ for incidence, and the cross product does both join and meet. (S5)
Transformation hierarchy: Nested groups rigid ⊂ similarity ⊂ affine ⊂ projective (3/4/6/8 DOF); the DOF count sets how many correspondences you need. (S6)
Pinhole camera: Ideal projection $x=f X/Z,\,y=fY/Z$; the geometric core of every camera model. (S2, S7)
Intrinsics K: Focal lengths, principal point and skew — the camera's internal geometry estimated by calibration. (S7, S12–13)
Extrinsics [R|t]: The camera's rotation and translation in the world — its pose. (S7)
Homography: A 3×3 projective map between two images of a plane (or pure-rotation views); 8 DOF. (S6, S9–10, S20)
Projection matrix P = K[R|t]: The $3\times4$ map from world points to pixels; factors into intrinsics (what the camera is) and extrinsics (where it is). (S7)
Epipolar geometry: The two-view constraint $x'^{\top}Fx=0$ that reduces correspondence search to a line. (S9–10)
Fundamental / essential matrix: $F$ relates two uncalibrated views; $E=K'^{\top}FK$ for calibrated cameras factors as $[t]_\times R$ into relative pose. (S9–10)
8-point / DLT algorithm: Linear estimation of $F$ or $H$ by stacking correspondences and solving $Ah=0$ via SVD; normalise coordinates first. (S10)
RANSAC: Random sample consensus: repeatedly fit a model to a minimal sample and keep the one with most inliers — robust to outlier matches. (S10, S20)
Bundle adjustment: Joint non-linear least-squares refinement of all camera and 3-D-point parameters minimising total reprojection error. (S10)
Rectification: Warping a stereo pair so epipolar lines become image rows, turning disparity into a 1-D per-row search. (S11)
Disparity: Horizontal shift of a point between rectified stereo views; inversely proportional to depth. (S8, S11)
Structure from motion: Recovering cameras and 3-D points from multiple views, refined by bundle adjustment. (S9–10)
Optical flow: Apparent per-pixel motion field, constrained by brightness constancy $I_xu+I_yv+I_t=0$. (S8)
Camera calibration: Estimating $K$ and lens distortion (Zhang's method), validated by reprojection error. (S12–13)
Reprojection error: RMS pixel distance between observed and reprojected points — the calibration quality metric. (S13)
Color space: A coordinate system for colour (RGB, HSV, Lab, YCbCr) chosen to separate the property of interest. (S3)
Histogram equalization: Remapping intensities by their CDF to maximise contrast. (S15, demo 5)
Convolution: Sliding weighted sum $(I*K)(x,y)=\sum I(x-i,y-j)K(i,j)$; the atom of filtering and CNNs. (S16, demo 6)
Separable filter: A 2-D kernel that factors into two 1-D passes, cutting cost from $O(k^2)$ to $O(2k)$. (S16)
Point operation (LUT): A transform $s=T(r)$ depending only on a pixel's own value — fully described by a 1-D lookup curve (negative, log, gamma, stretch). (S15)
Image gradient: $\nabla I=(I_x,I_y)$; magnitude marks edges, orientation their direction. (S8, S16, S18)
Aperture problem: From a local window only motion normal to an edge is observable; optical flow needs a smoothness/constant-flow assumption to recover full $(u,v)$. (S8)
Convolution theorem: Convolution in space equals multiplication of spectra; blurring/sharpening = applying a low/high-pass mask in the frequency domain. (S17)
Edge detection: Locating intensity discontinuities (Sobel magnitude, Canny with NMS + hysteresis). (S16, S18–19, demo 8)
Fourier transform (2-D DFT): Image as a sum of spatial frequencies; filtering = shaping the spectrum. (S17, demo 7)
Harris corner: Cornerness $R=\det(M)-k\,\mathrm{tr}(M)^2$ from the gradient second-moment matrix. (S18, demo 8)
Otsu thresholding: Automatic binarization maximising between-class variance. (S19, demo 13)
Morphology: Erode/dilate/open/close on binary images via a structuring element. (S15, demo 9)
NCC / template matching: Brightness-invariant patch correlation for locating templates. (S20, demo 14)
Backpropagation: The chain rule applied layer by layer from the loss backward to compute every parameter's gradient; the engine of all deep training. (S21–22)
Conv output-size formula: $\text{out}=\lfloor(n+2p-k)/s\rfloor+1$ for input $n$, kernel $k$, padding $p$, stride $s$; essential for tracking feature-map dimensions. (S21)
ReLU & activation: $\sigma(z)=\max(0,z)$ — the non-linearity that lets stacked layers represent non-linear functions without saturating gradients. (S21–22)
Translation equivariance: Shift the input and a CNN's feature maps shift identically — the inductive bias weight-sharing convolution provides. (S21)
Pooling: Down-sampling (max/avg) that adds translation tolerance and shrinks feature maps. (S22, demo 11)
Softmax & cross-entropy: Softmax turns scores into class probabilities; cross-entropy is the loss that trains classifiers and per-pixel segmenters. (S22, S25)
Inception / 1×1 convolution: Parallel multi-scale branches with 1×1 convs to mix channels and cut cost cheaply (GoogLeNet). (S23, demo 12)
Receptive field: The input region influencing one activation; grows with depth, instant with attention. (S21–23, S29)
Residual connection: $y=F(x)+x$ shortcut that lets very deep networks train (ResNet). (S23, demo 12)
Encoder–decoder (U-Net): Contracting + expanding paths with skip connections for dense (segmentation) output. (S23, S25, demo 12)
Recurrence (RNN/GRU/LSTM): State carried across sequence steps with shared weights; gates (GRU/LSTM) preserve long-range memory against vanishing gradients. (S24)
Attention: A softmax-weighted, differentiable lookup that lets a model focus on the most relevant inputs; first met in captioning. (S24, S29)
Semantic vs instance segmentation: Per-pixel class labels vs per-pixel labels that also separate individual objects (masks per instance). (S25, demo 13)
IoU & NMS: Intersection-over-union overlap metric and non-max suppression to prune duplicate detections. (S25–26, demo 14)
YOLO / one-stage detection: Predicts all boxes + classes in a single grid-regression pass; fast (real-time) at some accuracy cost. (S26, demo 14)
Autoencoder & PCA: A bottleneck that compresses then reconstructs; PCA is the optimal linear case, an autoencoder its non-linear generalisation. (S27, demo 16)
ELBO & reparameterization: The VAE objective (reconstruction − KL) and the trick $z=\mu+\sigma\epsilon$ that keeps sampling differentiable. (S27)
Latent space: Low-dimensional code an autoencoder/VAE maps to; smooth in VAEs so interpolation is meaningful. (S27, demos 16–17)
Autoregressive model: Factorises $p(x)=\prod_i p(x_i\mid x_{<i})$, generating pixels one at a time (PixelRNN/CNN); exact likelihood, slow sampling. (S27)
Mode collapse: A GAN failure where the generator produces only a few varieties of sample, losing diversity. (S28, demo 17)
GAN: Generator vs discriminator min-max game producing sharp samples. (S28, demo 17)
Diffusion model: Generation by learning to reverse a gradual noising process. (S28, demo 17)
Self-attention: $\mathrm{softmax}(QK^{\top}/\sqrt{d_k})V$ relating all positions; the basis of transformers/ViT. (S29, demo 12)
Vision Transformer (ViT): Image as a sequence of patch tokens processed by a transformer encoder. (S29, demo 12)
Patch tokens & positional encoding: ViT splits the image into embedded patches (tokens) and adds positional encodings because attention is otherwise order-agnostic. (S29)
CNN–transformer hybrid: A CNN front end (local, data-efficient) feeding transformer reasoning (global context), e.g. DETR for anchor-free detection. (S29)
Inductive bias: Built-in assumptions a model makes (locality/weight-sharing in CNNs, global relations in transformers) that shape what it learns from limited data. (S21, S29)

Annotated bibliography

Recommended texts from the official syllabus. Annotations indicate where each is most useful in this course.

Hartley, R. I. & Zisserman, A. — Multiple View Geometry in Computer Vision. ISBN 9780521540513 (Digital)
The definitive reference for the geometry module (S5–13): projective spaces, the camera matrix, homographies, epipolar geometry, F/E estimation and SfM.
Forsyth, D. & Ponce, J. — Computer Vision: A Modern Approach. ISBN 0130851981 (Digital)
Balanced classical text covering cameras, colour, filtering, segmentation and recognition — strong support for Modules 1, 4 and 5.
Livingstone, M. — Vision and Art: The Biology of Seeing. ISBN 9780810904064 (Digital)
Accessible grounding in human vision, colour and luminance perception — motivation for the formation/colour material (S2–3).
Szeliski, R. — Computer Vision: Algorithms and Applications. ISBN 9783030343743 (Printed)
The course's broadest backbone, spanning nearly every session from image formation through features, segmentation, recognition and deep learning.
Bishop, C. M. — Pattern Recognition and Machine Learning. ISBN 9780387310732 (Printed)
Rigorous ML foundations — probability, latent-variable models and neural nets — underpinning Modules 6–7 (CNNs, VAEs, sequence models).
Drummy, M. et al. — Dive into Deep Learning: Tools for Engagement. ISBN 9781544361376 (Printed)
Hands-on, framework-based deep-learning reference for the CNN, architecture, RNN, generative and transformer sessions (S21–30).