BCSAI · AI: Computer Vision · Final Project

Show it a photo. It tells you where you are.

A fast image-retrieval system for place recognition around the IE Tower. Given a query photo, it returns the top-K most similar gallery images and predicts the location — with three retrieval tracks (classical, deep, and CNN) sharing one FAISS index and evaluation harness.

The three tracks → See the pipeline

Visual Place Recognition

Query photo

Top-5 retrieved · gallery

Pick a query and press Retrieve.

L2-normalized embeddings · cosine similarity · FAISS top-K

↑ A scripted illustration — tiles stand in for IE Tower gallery photos; switch tracks to see how rankings shift.

01 — Three retrieval tracks

Three ways to embed a place, one index.

Every track produces L2-normalized vectors that plug into the same FAISS index, so they're compared on equal footing through one shared data loader and evaluation harness.

CLASSICAL

SIFT / ORB → VLAD

Local hand-crafted features aggregated into a single VLAD descriptor per image.

DEEP

DINOv2 ViT-S/14

Self-supervised global embeddings (with a ResNet50 fallback) — strong out of the box.

CNN BASELINE

Supervised CNN

A small CNN trained on gallery labels, then reused as an embedding extractor for retrieval.

02 — Pipeline

From query to location.

Query photoone image in

→

EmbedL2-normalized vector

→

FAISStop-K nearest

→

Predictlocation label

Reproducible end to end: prepare_data → build_index (per method) → run_eval against a held-out test set → a streamlit demo UI. Evaluated with regression, classification, and ranking metrics.

03 — Stack

Built with.

🐍 Python 3.11 🔎 FAISS 🧠 DINOv2 (ViT-S/14) 🖼 OpenCV · SIFT/ORB/VLAD 🔥 PyTorch (CNN / ResNet50) 📊 Streamlit 🧪 pytest