Visual Place Recognition — Built From Scratch

What it is

Knowing a place from a single picture

Visual place recognition turns "where am I?" into a search problem. Every known location is stored not as a coordinate but as a compact numerical fingerprint of what it looks like. When a new query image arrives, you fingerprint it the same way and find its nearest neighbours in the database — the closest matches are the most likely places.

What makes it genuinely hard is that the same place rarely looks the same twice: lighting shifts, seasons change, viewpoints differ, people and cars come and go. A good descriptor has to ignore all of that and lock onto the structure that actually identifies the location — the building, the skyline, the layout.

1 : N

one query image searched against a whole database of places — recognition reframed as nearest-neighbour retrieval.

The stack

From pixels to a place ID

Two ideas do the heavy lifting: a classical descriptor, and an exact search over it. Built with OpenCV + NumPy + scikit-learn — no deep nets, no FAISS.

descriptor

Colour + HOG

An HSV colour histogram concatenated with a Histogram of Oriented Gradients — one L2-normalised vector per image capturing colour and coarse layout.

descriptor

Bag of Visual Words

ORB keypoints quantised against a KMeans vocabulary, encoded as a word-occurrence histogram — a structure-focused fingerprint.

index

Exact nearest-neighbour

Every database fingerprint scored against the query by cosine / L2 distance (scikit-learn). Exact rankings for this size; FAISS would scale it.

data

Controlled benchmark

Distinct synthetic place scenes, split into database and query views by photometric and geometric augmentation — known ground truth.

metric

Recall@K

The honest score: how often the true place lands in the top-K retrieved results, measured against a random-retrieval baseline.

result

0.944 → 1.000

Real Recall@1 of 0.944 (colour+HOG) and 1.000 (BoVW) vs a 0.086 random baseline on a 12-place benchmark.

Architecture

How a place is recognised

Every query runs the same describe-then-retrieve pipeline:

Build the database
Fingerprint every reference image once with the colour+HOG (or BoVW) descriptor and store the vectors in an exact NN index.
Fingerprint the query
Run the new photo through the same descriptor to get its global vector.
Retrieve
Score the query against every database vector by cosine / L2 distance and rank the closest — the candidate places.
Answer
Return the top match as the recognised place; report Recall@K over all queries.

Reflection

What rebuilding it taught me

Recognition is retrieval. Reframing "what place is this" as "find the nearest fingerprint" is the whole conceptual move.
The descriptor is everything. A representation that survives lighting, season and viewpoint changes is the difference between working and useless.
Classical goes a long way. Colour + HOG and a Bag of Visual Words already separate distinct places cleanly under mild appearance change — no deep net required to learn the core idea.
Controlled means honest. The strong Recall@K here comes from a controlled benchmark; a real deployment with season and lighting change is much harder, and that gap is the point.
Recall@K keeps you honest. A single "it matched!" demo means nothing; measuring top-K recall across a dataset against a random baseline does.