From-Scratch Build Β· Computer Vision

Chihuahua or Muffin? 🐢🧁

A small convolutional neural network that learns the classic fine-grained vision problem: telling apart two near-identical round-brown-blob classes. Trained end to end on CPU with PyTorch β€” and it reaches 99.4% accuracy on a held-out test split, against a 50% chance baseline.

PythonPyTorchCNN Image classificationnumpy + OpenCVpytest

Honest note about the data

This is a synthetic dataset, not the meme photos

The famous "chihuahua vs muffin" grid is a set of real internet photos, and that set is unreliable to obtain cleanly. So this project does not use those photos. Instead it ships a procedurally rendered dataset β€” drawn with numpy and OpenCV β€” that recreates the difficulty the meme is famous for: both classes are a round tan/brown blob carrying small dark dots, so colour alone gives nothing away and the network must learn structure.

What each image actually is. A 64Γ—64 render. A chihuahua is a brown head-blob with two triangular ears, two symmetric eyes and a nose. A muffin is a brown domed top with a fluted rim and scattered blueberry specks β€” no ears, no symmetric face. Position, scale, rotation, colour, dot count, lighting and noise are randomised per image, and the two classes have near-identical mean brightness (β‰ˆ192.7 vs β‰ˆ193.4), so the only reliable signal is shape and layout.

Real predictions

The trained model, on unseen samples

Each tile is a freshly rendered image (seeds the model never trained on), with the model's actual call and confidence. Generated by predict.py from the trained checkpoint.

rendered chihuahua sample, model predicted chihuahua
🐢 chihuahua98% conf
rendered chihuahua sample, model predicted chihuahua
🐢 chihuahua100% conf
rendered muffin sample, model predicted muffin
🧁 muffin100% conf
rendered muffin sample, model predicted muffin
🧁 muffin100% conf
grid of test-set predictions, green titles for correct
A larger grid sampled from the held-out test split (green = correct, red = wrong). Produced by train.py.

By the numbers

One real training run

99.4%
held-out test accuracy (179 / 180)
50%
chance baseline (2 balanced classes)
25.8k
trainable parameters
~39s
to train, 15 epochs, CPU
1,200
rendered images (600 / class)
15
passing pytest tests
training curve: loss falling, validation accuracy rising from chance
Training loss (orange) and validation accuracy (blue). The model sits at chance for the first ~2 epochs, then climbs once it discovers the structural cues β€” honest, slightly wobbly, real.

Architecture

TinyCNN β€” three conv blocks and a small head

A compact network: three conv β†’ batch-norm β†’ ReLU β†’ max-pool blocks shrink the 64Γ—64 image to 8Γ—8 while growing channels 3β†’16β†’32β†’64, then global average pooling feeds a small dense head to two logits. Trained with Adam, cross-entropy loss and light augmentation (horizontal flip, brightness jitter, noise). Best-on-validation weights are restored before the final test evaluation.

# input 3 x 64 x 64
conv3x3(3  -> 16) + BN + ReLU + maxpool2   # -> 16 x 32 x 32
conv3x3(16 -> 32) + BN + ReLU + maxpool2   # -> 32 x 16 x 16
conv3x3(32 -> 64) + BN + ReLU + maxpool2   # -> 64 x 8 x 8
global average pool                        # -> 64
dropout -> linear(64->32) -> ReLU -> linear(32->2)  # -> 2 logits

Pipeline

From rendered folder to confident call

  1. Generate

    data/generate.py renders the two classes into train/val/test folders (70/15/15, class-balanced).

  2. Augment

    Train-split images get random flips, brightness jitter and noise so the model learns the class, not the pixels.

  3. Train

    train.py fits TinyCNN with Adam + cross-entropy, selecting the best epoch on the validation split.

  4. Evaluate

    The untouched test split gives the real accuracy and a confusion matrix, written to results.json.

  5. Predict

    predict.py classifies any single image and returns a label plus class probabilities.