From-Scratch Build · Observability
A tiny service that quietly polls a list of websites, asks each one "are you up?", and publishes the answers as metrics a monitoring system can graph and alert on. Built from scratch to learn how uptime monitoring actually works under the hood.
What it is
When you run several websites, the question that matters most is the simplest: are they up right now? This build answers it continuously. It holds a small list of target URLs, and whenever asked, it sends each one a quick HTTP request and records whether it responded — and with what status code.
Crucially, it doesn't store or graph anything itself. Instead it exposes a single /metrics endpoint that speaks the Prometheus exposition format — plain text lines like http_status{target="site"} 200. A monitoring system scrapes that endpoint on a schedule, builds the history, draws the dashboards and fires the alerts. The checker's whole job is to be the honest little sensor at the bottom of that stack.
The core idea I wanted to learn: good monitoring separates measuring from storing. This service only measures — it exports a number on demand. By following Prometheus's pull model, a fifty-line script plugs straight into a full observability pipeline.
The stack
The point of this rebuild was how little it takes to be a proper metrics source. Here is what each piece does.
A minimal Python web framework. It exists here to serve exactly one route — /metrics — and nothing more.
For each target the checker sends a GET request with a timeout and reads back the status code — or marks it down if nothing answers.
Results are emitted as Prometheus exposition text — one labelled line per target — the lingua franca of modern monitoring.
A simple map of name-to-URL defines what gets checked. Adding a site to monitor is a one-line edit.
The whole thing ships as a container, so it drops into any host or monitoring network with no Python setup required.
It joins the monitoring stack's own network, so the scraper can reach it by name and pull metrics on its schedule.
How a check works
Every time the monitoring system comes knocking, the same quick cycle runs — and it's deliberately stateless:
The monitoring system hits the /metrics endpoint on its regular interval.
The checker sends an HTTP request to every URL in its list, one after another, with a timeout guarding each.
It captures each response's status code — or a zero when a target fails to answer at all.
The results become labelled Prometheus lines, one per target, ready to parse.
The endpoint replies with plain text, and the scraper stores the snapshot in its time-series database.
Also exporting response time and certificate expiry per target. Left as future work; for now it reports up/down status.
Why the pull model
Keeping the checker deliberately dumb is the design — and it pays off in three ways:
A failed target reports 0 instead of crashing — so "the site is down" is itself a clean, alertable signal rather than a gap in the data.
Reflection
0 for an unreachable target turns an outage into a first-class signal you can alert on, instead of a hole in the graph.