From-Scratch Build · Computer Vision

Visual Place Recognition

Show the system a photo and it answers a deceptively hard question: where was this taken? Not from GPS — from the pixels alone, by matching the image against a database of places it has seen before. I rebuilt it from scratch because it's the relocalisation trick that lets a robot or a phone know where it is when the satellite signal is gone.

PythonOpenCVImage retrieval Color + HOGBag of Visual WordsExact NN search

What it is

Knowing a place from a single picture

Visual place recognition turns "where am I?" into a search problem. Every known location is stored not as a coordinate but as a compact numerical fingerprint of what it looks like. When a new query image arrives, you fingerprint it the same way and find its nearest neighbours in the database — the closest matches are the most likely places.

What makes it genuinely hard is that the same place rarely looks the same twice: lighting shifts, seasons change, viewpoints differ, people and cars come and go. A good descriptor has to ignore all of that and lock onto the structure that actually identifies the location — the building, the skyline, the layout.

1 : N
one query image searched against a whole database of places — recognition reframed as nearest-neighbour retrieval.

The stack

From pixels to a place ID

Two ideas do the heavy lifting: a classical descriptor, and an exact search over it. Built with OpenCV + NumPy + scikit-learn — no deep nets, no FAISS.

descriptor

Colour + HOG

An HSV colour histogram concatenated with a Histogram of Oriented Gradients — one L2-normalised vector per image capturing colour and coarse layout.

descriptor

Bag of Visual Words

ORB keypoints quantised against a KMeans vocabulary, encoded as a word-occurrence histogram — a structure-focused fingerprint.

index

Exact nearest-neighbour

Every database fingerprint scored against the query by cosine / L2 distance (scikit-learn). Exact rankings for this size; FAISS would scale it.

data

Controlled benchmark

Distinct synthetic place scenes, split into database and query views by photometric and geometric augmentation — known ground truth.

metric

Recall@K

The honest score: how often the true place lands in the top-K retrieved results, measured against a random-retrieval baseline.

result

0.944 → 1.000

Real Recall@1 of 0.944 (colour+HOG) and 1.000 (BoVW) vs a 0.086 random baseline on a 12-place benchmark.

Architecture

How a place is recognised

Every query runs the same describe-then-retrieve pipeline:

  1. Build the database

    Fingerprint every reference image once with the colour+HOG (or BoVW) descriptor and store the vectors in an exact NN index.

  2. Fingerprint the query

    Run the new photo through the same descriptor to get its global vector.

  3. Retrieve

    Score the query against every database vector by cosine / L2 distance and rank the closest — the candidate places.

  4. Answer

    Return the top match as the recognised place; report Recall@K over all queries.

Reflection

What rebuilding it taught me