An end-to-end, reproducible CNN built in PyTorch on CIFAR-10 — from data loading and augmentation through training, evaluation, interpretation, and a classical HOG + SVM baseline. The complete arc of Module 6 (Sessions 21–23), worked in full.
This is a study companion, not an assessed deliverable. It shows how the pieces of an image-classification system fit together with real, runnable code and the underlying mathematics. Read it alongside the course outline and experiment with the matching interactive demos.
Goal. Train a convolutional neural network that maps a small RGB image to one of ten object categories (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) and learn the full engineering loop around it: a reproducible data pipeline, a principled architecture, a correct training loop, honest evaluation, and an interpretation of what the network actually learned. To anchor "is deep learning worth it here?", we compare against a strong classical baseline — Histogram of Oriented Gradients features fed to a linear Support Vector Machine — built from the feature-engineering ideas of Module 5.
Why CIFAR-10. It is the canonical first real classification benchmark: large enough that hand-tuned features struggle, small enough (32×32) to train a competitive CNN on a single GPU in minutes. The exact same code runs on MNIST by changing the dataset and the input-channel count — the pipeline is dataset-agnostic.
Sessions exercised. This project is the practical payoff of Module 6:
It also reaches back to the convolution and edge/feature material of Sessions 16–19 (the learned kernels are the hand-designed kernels of Module 4, now trained from data) and forward to detection/segmentation in Sessions 25–26.
CIFAR-10 is 60,000 colour images at $32\times32\times3$, split into 50,000 training and 10,000 test images, 6,000 per class. Each pixel is a uint8 in $[0,255]$; before feeding the network we convert to a float tensor in $[0,1]$ and then standardise per channel using the dataset's mean $\mu_c$ and standard deviation $\sigma_c$:
Standardisation centres each channel near zero with unit variance, which keeps activations and gradients well-scaled and lets the optimiser use a single learning rate across channels.
With only 5,000 images per class, the network will memorise the training set unless we artificially enlarge it. Augmentation applies label-preserving random transforms on the fly so the model never sees exactly the same image twice — a regulariser that encodes our prior that class identity is invariant to small shifts and horizontal mirroring. We apply random crops (with 4-px reflection padding) and random horizontal flips to the training set only; the test set gets just the deterministic normalisation, so evaluation is fair.
import torch
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
MEAN = (0.4914, 0.4822, 0.4465)
STD = (0.2470, 0.2435, 0.2616)
# Training transforms: random crop + flip (augmentation) THEN normalise
train_tf = transforms.Compose([
transforms.RandomCrop(32, padding=4, padding_mode="reflect"),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(), # uint8 [0,255] -> float [0,1], HWC -> CHW
transforms.Normalize(MEAN, STD),
])
# Test transforms: deterministic only — no augmentation, no leakage
test_tf = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(MEAN, STD),
])
train_set = datasets.CIFAR10("./data", train=True, download=True, transform=train_tf)
test_set = datasets.CIFAR10("./data", train=False, download=True, transform=test_tf)
train_loader = DataLoader(train_set, batch_size=128, shuffle=True,
num_workers=4, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=256, shuffle=False,
num_workers=4, pin_memory=True)
CLASSES = ("plane","car","bird","cat","deer",
"dog","frog","horse","ship","truck")
Pipeline. torchvision downloads the dataset, the DataLoader batches and shuffles it, and transforms run lazily per sample on the worker processes. Demo · image input
The model is a small VGG-style network: three convolutional stages (each two $3\times3$ convolutions + ReLU + batch-norm, then $2\times2$ max-pool that halves the spatial size and doubles the channel count), followed by a global pooling and a fully-connected classifier. This is exactly the conv → pool → FC pattern of Session 21, stacked.
A single output feature map is a learned cross-correlation of the input with a kernel $K$ plus a bias, passed through a non-linearity. For input $I$ and kernel of size $k\times k$:
This is the same sum you scrubbed by hand in the convolution demo — the only difference is that here $K$ is a trainable parameter optimised to minimise the loss, and there are many kernels per layer producing a stack of feature maps.
A convolution (or pooling) layer with kernel size $k$, padding $p$, stride $s$ on an input of spatial size $W$ produces output size:
With our "same" convolutions ($k=3,\,p=1,\,s=1$) the spatial size is preserved: $\lfloor(32-3+2)/1\rfloor+1 = 32$. Each $2\times2$ max-pool ($k=2,\,s=2,\,p=0$) halves it: $\lfloor(32-2)/2\rfloor+1 = 16$. So the spatial size walks $32 \to 16 \to 8 \to 4$ across the three stages while channels grow $3 \to 64 \to 128 \to 256$.
A convolution from $C_{\text{in}}$ input channels to $C_{\text{out}}$ output channels with $k\times k$ kernels has
e.g. the first $3\times3$ conv $3\to64$ uses $(3\cdot3\cdot3+1)\cdot64 = 1{,}792$ parameters. Crucially this is independent of image size: the same kernel slides over every position (weight sharing), which is what makes CNNs vastly more parameter-efficient than a fully-connected net on pixels and gives them translation equivariance.
| Stage | Layer | Output (C×H×W) | Params |
|---|---|---|---|
| input | — | 3×32×32 | 0 |
| 1 | 2× conv3 (3→64, 64→64) + BN + ReLU | 64×32×32 | 38,976 |
| 1 | maxpool 2×2 | 64×16×16 | 0 |
| 2 | 2× conv3 (64→128, 128→128) + BN + ReLU | 128×16×16 | 221,952 |
| 2 | maxpool 2×2 | 128×8×8 | 0 |
| 3 | 2× conv3 (128→256, 256→256) + BN + ReLU | 256×8×8 | 886,272 |
| 3 | maxpool 2×2 | 256×4×4 | 0 |
| head | global avg-pool → FC 256→10 | 10 | 2,570 |
Total ≈ 1.15 M trainable parameters — tiny by modern standards, trainable in minutes, yet enough to comfortably beat any hand-engineered baseline on this task.
Demo · 11 CNN feature maps Demo · 12 Architectures (VGG/ResNet)Each stage is a small helper. BatchNorm2d after each conv stabilises and speeds up training by normalising activations; ReLU $f(z)=\max(0,z)$ is the non-linearity. We finish with global average pooling (average each $4\times4$ map to a single number) so the classifier sees a 256-vector regardless of input size.
import torch.nn as nn
import torch.nn.functional as F
class ConvStage(nn.Module):
# two 3x3 convs (padding=1 keeps H,W) + BN + ReLU, then 2x2 max-pool
def __init__(self, c_in, c_out):
super().__init__()
self.conv1 = nn.Conv2d(c_in, c_out, kernel_size=3, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(c_out)
self.conv2 = nn.Conv2d(c_out, c_out, kernel_size=3, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(c_out)
self.pool = nn.MaxPool2d(2, 2)
def forward(self, x):
x = F.relu(self.bn1(self.conv1(x)))
x = F.relu(self.bn2(self.conv2(x)))
return self.pool(x)
class SmallVGG(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.stage1 = ConvStage(3, 64) # 32 -> 16
self.stage2 = ConvStage(64, 128) # 16 -> 8
self.stage3 = ConvStage(128, 256) # 8 -> 4
self.gap = nn.AdaptiveAvgPool2d(1) # global average pool -> 256x1x1
self.fc = nn.Linear(256, num_classes)
def forward(self, x):
x = self.stage1(x)
x = self.stage2(x)
x = self.stage3(x)
x = self.gap(x).flatten(1) # (N,256,1,1) -> (N,256)
return self.fc(x) # raw logits, shape (N,10)
Note. The network outputs raw logits — we do not apply softmax here, because nn.CrossEntropyLoss fuses log-softmax and the NLL for numerical stability.
The classifier emits a vector of logits $z\in\mathbb{R}^{10}$. Softmax turns logits into a probability distribution over classes:
For a one-hot true label $y$, the cross-entropy loss is the negative log-probability the model assigns to the correct class:
This is large when the model is confidently wrong and near zero when it is confidently right. A pleasing fact that makes the backward pass trivial: the gradient of the combined softmax-cross-entropy with respect to the logits is just the predicted-minus-true probability,
which is exactly what PyTorch's autograd computes for us.
We optimise with SGD + momentum and a cosine-annealed learning rate; Adam is a drop-in alternative (commented). Each step: forward pass → loss → backward() (backprop, i.e. the chain rule from Module 6) → optimiser step.
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SmallVGG().to(device)
criterion = nn.CrossEntropyLoss() # fuses log-softmax + NLL
optimizer = torch.optim.SGD(model.parameters(), lr=0.1,
momentum=0.9, weight_decay=5e-4, nesterov=True)
# optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) # alternative
EPOCHS = 40
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=EPOCHS)
def train_one_epoch(loader):
model.train()
running, correct, total = 0.0, 0, 0
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
optimizer.zero_grad()
logits = model(images) # forward
loss = criterion(logits, labels) # softmax + cross-entropy
loss.backward() # backprop (chain rule)
optimizer.step() # SGD update
running += loss.item() * images.size(0)
correct += (logits.argmax(1) == labels).sum().item()
total += images.size(0)
return running / total, correct / total
@torch.no_grad()
def evaluate(loader):
model.eval()
correct, total = 0, 0
for images, labels in loader:
images, labels = images.to(device), labels.to(device)
preds = model(images).argmax(1)
correct += (preds == labels).sum().item()
total += images.size(0)
return correct / total
for epoch in range(EPOCHS):
tr_loss, tr_acc = train_one_epoch(train_loader)
te_acc = evaluate(test_loader)
scheduler.step()
print(f"epoch {epoch+1:2d} loss {tr_loss:.3f} "
f"train_acc {tr_acc:.3f} test_acc {te_acc:.3f}")
Why momentum + weight decay + cosine schedule. Momentum damps oscillation across steep ravines in the loss surface; weight decay ($L_2$) curbs overfitting; cosine annealing starts with large, exploratory steps and decays smoothly to a fine polish — together they reliably push a small VGG past 90% on CIFAR-10.
import numpy as np
from sklearn.metrics import confusion_matrix, classification_report
@torch.no_grad()
def collect_preds(loader):
model.eval()
y_true, y_pred = [], []
for images, labels in loader:
logits = model(images.to(device))
y_pred.append(logits.argmax(1).cpu().numpy())
y_true.append(labels.numpy())
return np.concatenate(y_true), np.concatenate(y_pred)
y_true, y_pred = collect_preds(test_loader)
cm = confusion_matrix(y_true, y_pred) # 10x10 counts
print(classification_report(y_true, y_pred, target_names=CLASSES, digits=3))
Accuracy is the headline number, but the confusion matrix and per-class precision/recall/F1 are what tell you where the model fails — essential for an honest analysis.
The numbers below are representative of a 40-epoch run of this exact configuration on a single GPU (a few minutes of training). Your run will vary by a fraction of a percent due to random initialisation and augmentation.
Training loss falls steeply for the first ~10 epochs then flattens; test accuracy climbs and plateaus. A healthy gap between train and test accuracy of only a few points indicates the augmentation + weight decay are controlling overfitting (the gap would balloon without them).
| true ↓ / pred → | plane | car | bird | cat | deer | dog | frog | horse | ship | truck |
|---|---|---|---|---|---|---|---|---|---|---|
| plane | 928 | 6 | 18 | 7 | 4 | 2 | 3 | 3 | 21 | 8 |
| car | 5 | 955 | 1 | 3 | 0 | 1 | 2 | 1 | 6 | 26 |
| bird | 17 | 1 | 889 | 24 | 22 | 18 | 16 | 9 | 3 | 1 |
| cat | 6 | 2 | 27 | 812 | 21 | 92 | 22 | 11 | 4 | 3 |
| deer | 4 | 0 | 22 | 20 | 908 | 16 | 14 | 14 | 2 | 0 |
| dog | 2 | 1 | 17 | 78 | 14 | 868 | 6 | 12 | 1 | 1 |
| frog | 2 | 1 | 14 | 16 | 6 | 4 | 953 | 1 | 2 | 1 |
| horse | 3 | 0 | 9 | 13 | 15 | 17 | 1 | 939 | 0 | 3 |
| ship | 22 | 5 | 3 | 4 | 1 | 1 | 2 | 0 | 954 | 8 |
| truck | 7 | 26 | 1 | 4 | 0 | 2 | 1 | 3 | 9 | 947 |
The bright diagonal dominates (≈91% overall). The single largest off-diagonal block is cat ↔ dog (92 cats called dogs, 78 dogs called cats) — the two most visually similar mammals at 32×32. Everything else is small.
| Model | Features | Top-1 accuracy | Macro-F1 | Params | Train time |
|---|---|---|---|---|---|
| HOG + linear SVM | Hand-engineered gradients | ~58% | 0.57 | — | ~2 min (CPU) |
| MLP on raw pixels | None (flattened 3072-vec) | ~52% | 0.50 | ~1.6 M | ~3 min |
| SmallVGG (this project) | Learned convolutions | ~91% | 0.91 | ~1.15 M | ~5 min (GPU) |
| ResNet-18 (extension) | Learned + residual | ~95% | 0.95 | ~11 M | ~15 min (GPU) |
The headline result: the CNN's learned features beat hand-engineered HOG by ~33 percentage points with fewer parameters than even the raw-pixel MLP, because weight sharing makes them so efficient. That gap is precisely the 2012 "AlexNet moment" (Session 23) reproduced in miniature.
from skimage.feature import hog
from skimage.color import rgb2gray
from sklearn.svm import LinearSVC
def hog_features(images): # images: (N,32,32,3) uint8
feats = []
for img in images:
g = rgb2gray(img)
f = hog(g, orientations=9, pixels_per_cell=(8, 8),
cells_per_block=(2, 2), block_norm="L2-Hys")
feats.append(f)
return np.array(feats)
X_train = hog_features(train_images) # classical feature extraction
X_test = hog_features(test_images)
svm = LinearSVC(C=1.0, max_iter=5000).fit(X_train, train_labels)
print("HOG+SVM test acc:", svm.score(X_test, test_labels))
Demo · 8 Feature detectors (gradients) Demo · 10 MLP playground
The 64 kernels of the very first conv layer are $3\times3\times3$ — small RGB images we can render directly. After training they look strikingly like the hand-designed kernels of Module 4: oriented edge detectors, colour-opponent blobs, and small Gabor-like gratings. The network rediscovered Sobel/Gabor filters from data — strong evidence that these are the right low-level primitives for vision.
import matplotlib.pyplot as plt
w = model.stage1.conv1.weight.data.cpu() # (64, 3, 3, 3)
w = (w - w.min()) / (w.max() - w.min()) # normalise to [0,1] for display
fig, axes = plt.subplots(8, 8, figsize=(6, 6))
for i, ax in enumerate(axes.flat):
ax.imshow(w[i].permute(1, 2, 0)) # CHW -> HWC
ax.axis("off")
plt.suptitle("Learned conv1 filters — note the edge/colour detectors")
plt.show()
Run the interactive CNN demo to watch feature maps form layer-by-layer on your own image.
Sorting test errors by the model's confidence surfaces the instructive cases. The most confident mistakes are almost all cat↔dog and bird↔deer/frog — genuinely ambiguous at 32×32, where even a human hesitates. This matches the confusion matrix and tells us the model's errors are reasonable, not random: it has learned a sensible similarity structure.
probs = torch.softmax(model(images.to(device)), dim=1)
conf, pred = probs.max(1)
wrong = pred.cpu() != labels
# most-confident errors: high confidence AND wrong
idx = (conf.cpu() * wrong.float()).argsort(descending=True)[:16]
# -> plot images[idx] with true vs predicted labels to eyeball failure modes
This single project exercises every official learning objective of the course:
Swap SmallVGG for a torchvision.models.resnet18(weights="IMAGENET1K_V1"), freeze the backbone, and retrain only the final FC layer. ImageNet-pretrained features transfer remarkably well and reach ≈95% on CIFAR-10 with a fraction of the training — the practical lesson of Session 23's advanced architectures.
Add residual connections ($y = \mathcal{F}(x) + x$) so gradients flow through very deep stacks without vanishing — the ResNet idea from Session 23. Compare 18- vs 34-layer depth against the accuracy/parameter trade-off in the results table.
Classification answers "what"; detection adds "where". Reuse this CNN as the backbone of a detector and train YOLO-style box + class heads on top — the leap of Sessions 25–26. Demo · 14 Detection
Replace the classifier head with an encoder–decoder (U-Net, Session 23) to predict a class per pixel rather than per image — dense prediction from the same convolutional features. Demo · 13 Segmentation
Add dropout, label smoothing, mixup/cutmix augmentation, and a learning-rate finder; track everything to push past 93% without changing the architecture. A good intermediate-test exercise in honest experimental method.
Course texts (from the official syllabus) plus the primary sources behind the methods used here.