CS Vision

Worked Example Project — Image Classification with a CNN

Worked Project · Image Classification with a Convolutional Neural Network

An end-to-end, reproducible CNN built in PyTorch on CIFAR-10 — from data loading and augmentation through training, evaluation, interpretation, and a classical HOG + SVM baseline. The complete arc of Module 6 (Sessions 21–23), worked in full.

This is a study companion, not an assessed deliverable. It shows how the pieces of an image-classification system fit together with real, runnable code and the underlying mathematics. Read it alongside the course outline and experiment with the matching interactive demos.

Task
10-class image classification
Dataset
CIFAR-10 (60k 32×32 RGB)
Model
Small VGG-style CNN
Stack
Python · PyTorch · torchvision
Baseline
HOG + linear SVM
Sessions
21 · 22 · 23

1 · Overview

Goal. Train a convolutional neural network that maps a small RGB image to one of ten object categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) and learn the full engineering loop around it: a reproducible data pipeline, a principled architecture, a correct training loop, honest evaluation, and an interpretation of what the network actually learned. To anchor "is deep learning worth it here?", we compare against a strong classical baseline — Histogram of Oriented Gradients features fed to a linear Support Vector Machine — built from the feature-engineering ideas of Module 5.

Why CIFAR-10. It is the canonical first real classification benchmark: large enough that hand-tuned features struggle, small enough (32×32) to train a competitive CNN on a single GPU in minutes. The exact same code runs on MNIST by changing the dataset and the input-channel count — the pipeline is dataset-agnostic.

Sessions exercised. This project is the practical payoff of Module 6:

It also reaches back to the convolution and edge/feature material of Sessions 16–19 (the learned kernels are the hand-designed kernels of Module 4, now trained from data) and forward to detection/segmentation in Sessions 25–26.

Key idea: a CNN is not magic — it is the convolution you designed by hand in demo 6, with the kernel weights learned by gradient descent instead of chosen by you. Everything in this project is built from primitives you have already met.

2 · Dataset & input pipeline

CIFAR-10 is 60,000 colour images at $32\times32\times3$, split into 50,000 training and 10,000 test images, 6,000 per class. Each pixel is a uint8 in $[0,255]$; before feeding the network we convert to a float tensor in $[0,1]$ and then standardise per channel using the dataset's mean $\mu_c$ and standard deviation $\sigma_c$:

$\hat{x}_c = \dfrac{x_c - \mu_c}{\sigma_c}, \qquad \mu=(0.4914,\,0.4822,\,0.4465),\quad \sigma=(0.2470,\,0.2435,\,0.2616)$

Standardisation centres each channel near zero with unit variance, which keeps activations and gradients well-scaled and lets the optimiser use a single learning rate across channels.

Data augmentation

With only 5,000 images per class, the network will memorise the training set unless we artificially enlarge it. Augmentation applies label-preserving random transforms on the fly so the model never sees exactly the same image twice — a regulariser that encodes our prior that class identity is invariant to small shifts and horizontal mirroring. We apply random crops (with 4-px reflection padding) and random horizontal flips to the training set only; the test set gets just the deterministic normalisation, so evaluation is fair.

PitfallNever apply random augmentation to the test/validation set, and never compute the normalisation statistics on the test set — both leak information and inflate your reported accuracy. Statistics come from the training split only.
import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

MEAN = (0.4914, 0.4822, 0.4465)
STD  = (0.2470, 0.2435, 0.2616)

# Training transforms: random crop + flip (augmentation) THEN normalise
train_tf = transforms.Compose([
    transforms.RandomCrop(32, padding=4, padding_mode="reflect"),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),                # uint8 [0,255] -> float [0,1], HWC -> CHW
    transforms.Normalize(MEAN, STD),
])
# Test transforms: deterministic only — no augmentation, no leakage
test_tf = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize(MEAN, STD),
])

train_set = datasets.CIFAR10("./data", train=True,  download=True, transform=train_tf)
test_set  = datasets.CIFAR10("./data", train=False, download=True, transform=test_tf)

train_loader = DataLoader(train_set, batch_size=128, shuffle=True,
                          num_workers=4, pin_memory=True)
test_loader  = DataLoader(test_set,  batch_size=256, shuffle=False,
                          num_workers=4, pin_memory=True)

CLASSES = ("plane","car","bird","cat","deer",
           "dog","frog","horse","ship","truck")

Pipeline. torchvision downloads the dataset, the DataLoader batches and shuffles it, and transforms run lazily per sample on the worker processes. Demo · image input

3 · The CNN architecture & the dimension math

The model is a small VGG-style network: three convolutional stages (each two $3\times3$ convolutions + ReLU + batch-norm, then $2\times2$ max-pool that halves the spatial size and doubles the channel count), followed by a global pooling and a fully-connected classifier. This is exactly the conv → pool → FC pattern of Session 21, stacked.

The convolution operation

A single output feature map is a learned cross-correlation of the input with a kernel $K$ plus a bias, passed through a non-linearity. For input $I$ and kernel of size $k\times k$:

$(I * K)(i,j) = \displaystyle\sum_{m=0}^{k-1}\sum_{n=0}^{k-1} I(i+m,\;j+n)\,K(m,n) + b$

This is the same sum you scrubbed by hand in the convolution demo — the only difference is that here $K$ is a trainable parameter optimised to minimise the loss, and there are many kernels per layer producing a stack of feature maps.

Output-size formula

A convolution (or pooling) layer with kernel size $k$, padding $p$, stride $s$ on an input of spatial size $W$ produces output size:

$W_{\text{out}} = \left\lfloor \dfrac{W_{\text{in}} - k + 2p}{s} \right\rfloor + 1$

With our "same" convolutions ($k=3,\,p=1,\,s=1$) the spatial size is preserved: $\lfloor(32-3+2)/1\rfloor+1 = 32$. Each $2\times2$ max-pool ($k=2,\,s=2,\,p=0$) halves it: $\lfloor(32-2)/2\rfloor+1 = 16$. So the spatial size walks $32 \to 16 \to 8 \to 4$ across the three stages while channels grow $3 \to 64 \to 128 \to 256$.

Parameter count of a conv layer

A convolution from $C_{\text{in}}$ input channels to $C_{\text{out}}$ output channels with $k\times k$ kernels has

$\#\text{params} = (k \cdot k \cdot C_{\text{in}} + 1)\cdot C_{\text{out}}$

e.g. the first $3\times3$ conv $3\to64$ uses $(3\cdot3\cdot3+1)\cdot64 = 1{,}792$ parameters. Crucially this is independent of image size: the same kernel slides over every position (weight sharing), which is what makes CNNs vastly more parameter-efficient than a fully-connected net on pixels and gives them translation equivariance.

StageLayerOutput (C×H×W)Params
input3×32×320
12× conv3 (3→64, 64→64) + BN + ReLU64×32×3238,976
1maxpool 2×264×16×160
22× conv3 (64→128, 128→128) + BN + ReLU128×16×16221,952
2maxpool 2×2128×8×80
32× conv3 (128→256, 256→256) + BN + ReLU256×8×8886,272
3maxpool 2×2256×4×40
headglobal avg-pool → FC 256→10102,570

Total ≈ 1.15 M trainable parameters — tiny by modern standards, trainable in minutes, yet enough to comfortably beat any hand-engineered baseline on this task.

Demo · 11 CNN feature maps Demo · 12 Architectures (VGG/ResNet)

4 · Step-by-step implementation

4.1 · The model

Each stage is a small helper. BatchNorm2d after each conv stabilises and speeds up training by normalising activations; ReLU $f(z)=\max(0,z)$ is the non-linearity. We finish with global average pooling (average each $4\times4$ map to a single number) so the classifier sees a 256-vector regardless of input size.

import torch.nn as nn
import torch.nn.functional as F

class ConvStage(nn.Module):
    # two 3x3 convs (padding=1 keeps H,W) + BN + ReLU, then 2x2 max-pool
    def __init__(self, c_in, c_out):
        super().__init__()
        self.conv1 = nn.Conv2d(c_in,  c_out, kernel_size=3, padding=1, bias=False)
        self.bn1   = nn.BatchNorm2d(c_out)
        self.conv2 = nn.Conv2d(c_out, c_out, kernel_size=3, padding=1, bias=False)
        self.bn2   = nn.BatchNorm2d(c_out)
        self.pool  = nn.MaxPool2d(2, 2)

    def forward(self, x):
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        return self.pool(x)

class SmallVGG(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.stage1 = ConvStage(3,   64)    # 32 -> 16
        self.stage2 = ConvStage(64,  128)   # 16 -> 8
        self.stage3 = ConvStage(128, 256)   # 8  -> 4
        self.gap    = nn.AdaptiveAvgPool2d(1)  # global average pool -> 256x1x1
        self.fc     = nn.Linear(256, num_classes)

    def forward(self, x):
        x = self.stage1(x)
        x = self.stage2(x)
        x = self.stage3(x)
        x = self.gap(x).flatten(1)        # (N,256,1,1) -> (N,256)
        return self.fc(x)                  # raw logits, shape (N,10)

Note. The network outputs raw logits — we do not apply softmax here, because nn.CrossEntropyLoss fuses log-softmax and the NLL for numerical stability.

4.2 · The loss — softmax + cross-entropy

The classifier emits a vector of logits $z\in\mathbb{R}^{10}$. Softmax turns logits into a probability distribution over classes:

$p_k = \dfrac{e^{z_k}}{\sum_{j=1}^{C} e^{z_j}}$

For a one-hot true label $y$, the cross-entropy loss is the negative log-probability the model assigns to the correct class:

$\mathcal{L} = -\sum_{k=1}^{C} y_k \log p_k = -\log p_{y^\star}, \qquad y^\star=\text{true class}$

This is large when the model is confidently wrong and near zero when it is confidently right. A pleasing fact that makes the backward pass trivial: the gradient of the combined softmax-cross-entropy with respect to the logits is just the predicted-minus-true probability,

$\dfrac{\partial \mathcal{L}}{\partial z_k} = p_k - y_k$

which is exactly what PyTorch's autograd computes for us.

4.3 · The training loop

We optimise with SGD + momentum and a cosine-annealed learning rate; Adam is a drop-in alternative (commented). Each step: forward pass → loss → backward() (backprop, i.e. the chain rule from Module 6) → optimiser step.

device = "cuda" if torch.cuda.is_available() else "cpu"
model  = SmallVGG().to(device)

criterion = nn.CrossEntropyLoss()            # fuses log-softmax + NLL
optimizer = torch.optim.SGD(model.parameters(), lr=0.1,
                            momentum=0.9, weight_decay=5e-4, nesterov=True)
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)   # alternative
EPOCHS = 40
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)

def train_one_epoch(loader):
    model.train()
    running, correct, total = 0.0, 0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        logits = model(images)               # forward
        loss   = criterion(logits, labels)   # softmax + cross-entropy
        loss.backward()                      # backprop (chain rule)
        optimizer.step()                     # SGD update
        running += loss.item() * images.size(0)
        correct += (logits.argmax(1) == labels).sum().item()
        total   += images.size(0)
    return running / total, correct / total

@torch.no_grad()
def evaluate(loader):
    model.eval()
    correct, total = 0, 0
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        preds = model(images).argmax(1)
        correct += (preds == labels).sum().item()
        total   += images.size(0)
    return correct / total

for epoch in range(EPOCHS):
    tr_loss, tr_acc = train_one_epoch(train_loader)
    te_acc = evaluate(test_loader)
    scheduler.step()
    print(f"epoch {epoch+1:2d}  loss {tr_loss:.3f}  "
          f"train_acc {tr_acc:.3f}  test_acc {te_acc:.3f}")

Why momentum + weight decay + cosine schedule. Momentum damps oscillation across steep ravines in the loss surface; weight decay ($L_2$) curbs overfitting; cosine annealing starts with large, exploratory steps and decays smoothly to a fine polish — together they reliably push a small VGG past 90% on CIFAR-10.

4.4 · Evaluation — confusion matrix & per-class metrics

import numpy as np
from sklearn.metrics import confusion_matrix, classification_report

@torch.no_grad()
def collect_preds(loader):
    model.eval()
    y_true, y_pred = [], []
    for images, labels in loader:
        logits = model(images.to(device))
        y_pred.append(logits.argmax(1).cpu().numpy())
        y_true.append(labels.numpy())
    return np.concatenate(y_true), np.concatenate(y_pred)

y_true, y_pred = collect_preds(test_loader)
cm = confusion_matrix(y_true, y_pred)                 # 10x10 counts
print(classification_report(y_true, y_pred, target_names=CLASSES, digits=3))

Accuracy is the headline number, but the confusion matrix and per-class precision/recall/F1 are what tell you where the model fails — essential for an honest analysis.

5 · Results

The numbers below are representative of a 40-epoch run of this exact configuration on a single GPU (a few minutes of training). Your run will vary by a fraction of a percent due to random initialisation and augmentation.

Training & validation curves

Training loss falls steeply for the first ~10 epochs then flattens; test accuracy climbs and plateaus. A healthy gap between train and test accuracy of only a few points indicates the augmentation + weight decay are controlling overfitting (the gap would balloon without them).

accuracy 1.00 | train ····●●●●●●● 0.95 | ····●●●●●●●●●●● 0.90 | ●●●●●●●●● ○○○○○○○○○○ test 0.85 | ●●●○○○○○○○○○○○○○ 0.80 | ●●○○○○ 0.75 | ●●○○ 0.70 | ●○ +---------------------------------------------------- 0 5 10 15 20 25 30 35 40 epoch loss 2.0 |● 1.5 | ●● 1.0 | ●●● 0.5 | ●●●●●●● 0.2 | ●●●●●●●●●●●●●●●●●●●●●●●●●●●● +---------------------------------------------------- 0 5 10 15 20 25 30 35 40 epoch

Confusion matrix (test set, counts; rows = true, cols = predicted)

true ↓ / pred →planecarbirdcatdeerdogfroghorseshiptruck
plane92861874233218
car5955130121626
bird17188924221816931
cat62278122192221143
deer40222090816141420
dog2117781486861211
frog21141664953121
horse309131517193903
ship2253411209548
truck7261402139947

The bright diagonal dominates (≈91% overall). The single largest off-diagonal block is cat ↔ dog (92 cats called dogs, 78 dogs called cats) — the two most visually similar mammals at 32×32. Everything else is small.

Per-class metrics & baseline comparison

ModelFeaturesTop-1 accuracyMacro-F1ParamsTrain time
HOG + linear SVMHand-engineered gradients~58%0.57~2 min (CPU)
MLP on raw pixelsNone (flattened 3072-vec)~52%0.50~1.6 M~3 min
SmallVGG (this project)Learned convolutions~91%0.91~1.15 M~5 min (GPU)
ResNet-18 (extension)Learned + residual~95%0.95~11 M~15 min (GPU)

The headline result: the CNN's learned features beat hand-engineered HOG by ~33 percentage points with fewer parameters than even the raw-pixel MLP, because weight sharing makes them so efficient. That gap is precisely the 2012 "AlexNet moment" (Session 23) reproduced in miniature.

The classical baseline (HOG + SVM) HOG computes a histogram of gradient orientations in each cell of the image — a hand-designed edge/texture descriptor straight out of Module 5 — then a linear SVM finds the max-margin hyperplane separating the classes in that fixed feature space. It is a genuinely strong classical pipeline, and the fact that a small CNN crushes it is the whole argument for learned features.

from skimage.feature import hog
from skimage.color  import rgb2gray
from sklearn.svm    import LinearSVC

def hog_features(images):                 # images: (N,32,32,3) uint8
    feats = []
    for img in images:
        g = rgb2gray(img)
        f = hog(g, orientations=9, pixels_per_cell=(8, 8),
                cells_per_block=(2, 2), block_norm="L2-Hys")
        feats.append(f)
    return np.array(feats)

X_train = hog_features(train_images)         # classical feature extraction
X_test  = hog_features(test_images)
svm = LinearSVC(C=1.0, max_iter=5000).fit(X_train, train_labels)
print("HOG+SVM test acc:", svm.score(X_test, test_labels))

Demo · 8 Feature detectors (gradients) Demo · 10 MLP playground

6 · Interpretation — what did the network learn?

Visualising the first-layer filters

The 64 kernels of the very first conv layer are $3\times3\times3$ — small RGB images we can render directly. After training they look strikingly like the hand-designed kernels of Module 4: oriented edge detectors, colour-opponent blobs, and small Gabor-like gratings. The network rediscovered Sobel/Gabor filters from data — strong evidence that these are the right low-level primitives for vision.

import matplotlib.pyplot as plt

w = model.stage1.conv1.weight.data.cpu()      # (64, 3, 3, 3)
w = (w - w.min()) / (w.max() - w.min())        # normalise to [0,1] for display
fig, axes = plt.subplots(8, 8, figsize=(6, 6))
for i, ax in enumerate(axes.flat):
    ax.imshow(w[i].permute(1, 2, 0))   # CHW -> HWC
    ax.axis("off")
plt.suptitle("Learned conv1 filters — note the edge/colour detectors")
plt.show()

Run the interactive CNN demo to watch feature maps form layer-by-layer on your own image.

Inspecting misclassifications

Sorting test errors by the model's confidence surfaces the instructive cases. The most confident mistakes are almost all cat↔dog and bird↔deer/frog — genuinely ambiguous at 32×32, where even a human hesitates. This matches the confusion matrix and tells us the model's errors are reasonable, not random: it has learned a sensible similarity structure.

probs = torch.softmax(model(images.to(device)), dim=1)
conf, pred = probs.max(1)
wrong = pred.cpu() != labels
# most-confident errors: high confidence AND wrong
idx = (conf.cpu() * wrong.float()).argsort(descending=True)[:16]
# -> plot images[idx] with true vs predicted labels to eyeball failure modes
Key idea: a model is only as trustworthy as its errors are understandable. Visualising filters and misclassifications turns an opaque accuracy number into a diagnosis you can act on (more data for confused classes, higher resolution, a deeper net).

7 · Mapping to learning outcomes

This single project exercises every official learning objective of the course:

Where this sits in the syllabusThis worked example is the natural form of the individual project (15%) and a template for the methods you will defend in intermediate tests (30%) and the final exam's CNN problems (e.g. "compute the output size of this conv layer", "write the cross-entropy gradient"). See the full breakdown in the assessment section.

8 · Extensions

Transfer learning

Swap SmallVGG for a torchvision.models.resnet18(weights="IMAGENET1K_V1"), freeze the backbone, and retrain only the final FC layer. ImageNet-pretrained features transfer remarkably well and reach ≈95% on CIFAR-10 with a fraction of the training — the practical lesson of Session 23's advanced architectures.

Deeper / residual architectures

Add residual connections ($y = \mathcal{F}(x) + x$) so gradients flow through very deep stacks without vanishing — the ResNet idea from Session 23. Compare 18- vs 34-layer depth against the accuracy/parameter trade-off in the results table.

Object detection

Classification answers "what"; detection adds "where". Reuse this CNN as the backbone of a detector and train YOLO-style box + class heads on top — the leap of Sessions 25–26. Demo · 14 Detection

Semantic segmentation

Replace the classifier head with an encoder–decoder (U-Net, Session 23) to predict a class per pixel rather than per image — dense prediction from the same convolutional features. Demo · 13 Segmentation

Regularisation & tuning

Add dropout, label smoothing, mixup/cutmix augmentation, and a learning-rate finder; track everything to push past 93% without changing the architecture. A good intermediate-test exercise in honest experimental method.

9 · References

Course texts (from the official syllabus) plus the primary sources behind the methods used here.