Overview — goal, sessions exercised, stack
The deliverable is a single REST microservice — a URL-shortener API — taken all the way from a green-field repository to a deployed, observed, and operated service. It is the same arc the syllabus draws across the semester: Individual Assignment 1 (a minimal app) becomes the substrate that Assignment 2 and the group project improve with DevOps practices, so nothing is throwaway.
We pick a URL shortener because it is small enough to read in one sitting yet rich enough to demonstrate every operational concern: it has state (a key→URL map), an obvious correctness test, a natural place to apply a design pattern (the key-generation algorithm), and a clear service-level objective (redirects must be fast and available).
What each part of the course this exercises
- Sessions 1–2 (SDLC): the project moves through requirements → design → build → test → deploy → operate, explicitly.
- Session 2–3 (Git & tooling): a Git-flow branching model with pull-request review gates.
- Sessions 4–6 (Design patterns): the Strategy pattern for swappable key-generation algorithms.
- Session 6 (Testing): a unit + integration test suite with a coverage gate in CI.
- Session 7 (Scrum): the work is planned and demoed as a sprint.
- Session 8 (Backend): a web/app server, an in-process store, and a health endpoint.
- Session 10 (Deployment): the build-once-deploy-many pipeline and container artifact.
- Session 11 (Security): non-root container, pinned dependencies, no secrets in the image.
- IaC objective: Terraform describes the runtime so the environment is reproducible.
The stack
- API Python 3.12 · Flask · Gunicorn
- Tests pytest · coverage
- Lint ruff
- Container Docker (multi-stage)
- CI/CD GitHub Actions
- Registry GHCR
- IaC Terraform
- Runtime AWS ECS Fargate + ALB
- Observe /healthz · structured logs · Prometheus metrics
Repository layout
The app & a design pattern applied
The interesting design decision is how short keys are generated. A naïve service hard-codes one algorithm; ours treats key generation as a swappable Strategy (one of the behavioural patterns from Topic 03). Each algorithm implements the same interface, so the rest of the system depends on a stable contract rather than a concrete implementation — and we can switch from random keys to a deterministic counter-based scheme without touching the API layer.
# Strategy pattern — interchangeable key-generation algorithms
# behind one interface (Sessions 4-6: behavioural design patterns).
from __future__ import annotations
import secrets
import string
from typing import Protocol
ALPHABET = string.ascii_letters + string.digits
class KeyStrategy(Protocol):
"""The stable contract every algorithm must honour."""
def generate(self) -> str: ...
class RandomKey:
"""Cryptographically-random short key. Good default: no coordination needed."""
def __init__(self, length: int = 7) -> None:
self.length = length
def generate(self) -> str:
return "".join(secrets.choice(ALPHABET) for _ in range(self.length))
class CounterKey:
"""Deterministic base-62 of a monotonic counter. Shortest keys, but needs a source of truth."""
def __init__(self, start: int = 1000) -> None:
self._n = start
def generate(self) -> str:
n, out = self._n, []
self._n += 1
if n == 0:
return ALPHABET[0]
while n > 0:
n, r = divmod(n, len(ALPHABET))
out.append(ALPHABET[r])
return "".join(reversed(out))
def get_strategy(name: str) -> KeyStrategy:
"""Factory: pick the strategy by name (driven by config, not hard-coded)."""
strategies = {"random": RandomKey, "counter": CounterKey}
if name not in strategies:
raise ValueError(f"unknown key strategy: {name!r}")
return strategies[name]()
.generate(). Adding a third scheme
(say, a hash of the long URL) is a one-class change that touches no existing call site — the Open/Closed
principle in action.# The HTTP layer. Depends on the KeyStrategy interface, not a concrete class.
from flask import Blueprint, request, jsonify, redirect, current_app
api = Blueprint("api", __name__)
@api.post("/shorten")
def shorten():
data = request.get_json(silent=True) or {}
url = data.get("url")
if not url or not url.startswith(("http://", "https://")):
return jsonify(error="a valid http(s) url is required"), 400
store = current_app.config["STORE"]
keygen = current_app.config["KEYGEN"]
# retry on the rare key collision (random strategy)
for _ in range(5):
key = keygen.generate()
if store.put_if_absent(key, url):
short = f"{request.host_url}{key}"
return jsonify(key=key, short_url=short, url=url), 201
return jsonify(error="could not allocate a key"), 503
@api.get("/<key>")
def resolve(key: str):
url = current_app.config["STORE"].get(key)
if url is None:
return jsonify(error="not found"), 404
return redirect(url, code=302)
# Application factory — wires config, the store, the chosen strategy,
# and the observability endpoints. Config is injected, never hard-coded.
import os
from flask import Flask
from .api import api
from .store import MemoryStore
from .keygen import get_strategy
from .observability import register_observability
def create_app() -> Flask:
app = Flask(__name__)
app.config["STORE"] = MemoryStore()
app.config["KEYGEN"] = get_strategy(os.environ.get("KEY_STRATEGY", "random"))
app.register_blueprint(api)
register_observability(app)
return app
The MemoryStore in store.py is a thread-safe dict wrapper with a
put_if_absent method; swapping it for Redis or Postgres later is, again, a one-class change
because the API depends only on its small interface (see Topic 05
on keeping the persistence layer behind a port).
Containerization — a multi-stage Dockerfile
The image is the unit of deployment. A multi-stage build compiles and installs dependencies in a fat builder stage, then copies only the resulting virtual environment into a slim runtime stage. The shipped image carries no compilers, no build caches, and no shell history — smaller attack surface, faster pulls.
# ---------- Stage 1: builder ----------
FROM python:3.12-slim AS builder
ENV PYTHONDONTWRITEBYTECODE=1 \
PIP_NO_CACHE_DIR=1
WORKDIR /build
COPY requirements.txt .
# build the dependency tree into an isolated venv we can copy wholesale
RUN python -m venv /opt/venv \
&& /opt/venv/bin/pip install --upgrade pip \
&& /opt/venv/bin/pip install -r requirements.txt
# ---------- Stage 2: runtime ----------
FROM python:3.12-slim AS runtime
# run as an unprivileged user (Session 11: never run containers as root)
RUN useradd --create-home --uid 10001 appuser
ENV PATH="/opt/venv/bin:$PATH" \
PYTHONUNBUFFERED=1 \
KEY_STRATEGY=random \
PORT=8080
WORKDIR /app
COPY --from=builder /opt/venv /opt/venv
COPY app/ ./app/
USER appuser
EXPOSE 8080
# container-native health check — the orchestrator polls this
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD python -c "import urllib.request,os; \
urllib.request.urlopen(f'http://127.0.0.1:{os.environ[\"PORT\"]}/healthz').read()" || exit 1
# Gunicorn: a production WSGI server, not Flask's dev server
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "2", \
"--access-logfile", "-", "app:create_app()"]
python:3.12-slim@sha256:…) so a moving tag
can't silently change your image. Order layers by change frequency — copy
requirements.txt and install before copying app code, so dependency layers stay cached
when only source changes. Never COPY . . blindly: a .dockerignore
keeps .git, tests, and any .env out of the image (Session 11).CI/CD pipeline — GitHub Actions, stage by stage
The pipeline is the spine of the project: every change runs the same gates, and only a green build on
main reaches production. This is the build-once-deploy-many and
continuous delivery discipline from Topic 10.
- BuildInstall deps in a clean runner
- Testpytest with a coverage gate
- Lintruff static checks
- ImageBuild & push to GHCR
- DeployTerraform apply on main
name: ci
on:
push:
branches: [main]
pull_request:
permissions:
contents: read
packages: write # push the image to GHCR
env:
IMAGE: ghcr.io/${{ github.repository }}
jobs:
# ---------- test + lint run on every push and PR ----------
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
cache: pip
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest coverage ruff
- name: Lint
run: ruff check app tests
- name: Test with coverage gate
run: |
coverage run -m pytest -q
coverage report --fail-under=80 # fail the build below 80%
# ---------- build the image; on main, also push it ----------
image:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Log in to GHCR
if: github.ref == 'refs/heads/main'
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v6
with:
context: .
push: ${{ github.ref == 'refs/heads/main' }}
tags: |
${{ env.IMAGE }}:${{ github.sha }}
${{ env.IMAGE }}:latest
# ---------- deploy only from main, after a green image ----------
deploy:
needs: image
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production # gate: require an approval here
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }} # OIDC, no static keys
aws-region: eu-west-1
- name: Terraform apply
working-directory: infra
run: |
terraform init
terraform apply -auto-approve \
-var="image=${{ env.IMAGE }}:${{ github.sha }}"
test to pass,
builds on every run for parity, but only pushes on main. deploy runs
only from main, behind a GitHub environment that can require a human approval, and uses
OIDC role assumption so no long-lived AWS keys ever live in the repo (Session 11). The image
tag is the immutable commit SHA — the exact artifact tested is the exact artifact deployed.IaC & deploy — Terraform
The runtime is described declaratively so it is reproducible and reviewable: the same
terraform apply recreates the whole environment from scratch, and every change to the
infrastructure goes through a pull request like any other code. Below, an ECS Fargate service sits behind an
Application Load Balancer; the ALB health check points at the very /healthz endpoint the app exposes.
terraform {
required_version = ">= 1.6"
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.0" }
}
# remote state so the team shares one source of truth
backend "s3" {
bucket = "urlshort-tfstate"
key = "prod/terraform.tfstate"
region = "eu-west-1"
}
}
provider "aws" {
region = "eu-west-1"
}
variable "image" {
description = "Container image tag to deploy (passed by CI)"
type = string
}
resource "aws_ecs_cluster" "main" {
name = "urlshort"
}
resource "aws_ecs_task_definition" "app" {
family = "urlshort"
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = 256
memory = 512
execution_role_arn = aws_iam_role.exec.arn
container_definitions = jsonencode([{
name = "urlshort"
image = var.image
essential = true
portMappings = [{ containerPort = 8080 }]
environment = [{ name = "KEY_STRATEGY", value = "random" }]
healthCheck = {
command = ["CMD-SHELL", "python -c \"import urllib.request;urllib.request.urlopen('http://127.0.0.1:8080/healthz')\""]
interval = 30
timeout = 5
retries = 3
startPeriod = 10
}
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.app.name
"awslogs-region" = "eu-west-1"
"awslogs-stream-prefix" = "urlshort"
}
}
}])
}
resource "aws_cloudwatch_log_group" "app" {
name = "/ecs/urlshort"
retention_in_days = 30
}
resource "aws_ecs_service" "app" {
name = "urlshort"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = 2 # two tasks: rolling deploys with no downtime
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnets
security_groups = [aws_security_group.app.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "urlshort"
container_port = 8080
}
}
resource "aws_lb_target_group" "app" {
name = "urlshort"
port = 8080
protocol = "HTTP"
vpc_id = var.vpc_id
target_type = "ip"
health_check {
path = "/healthz" # the ALB polls the app's health endpoint
healthy_threshold = 2
unhealthy_threshold = 3
interval = 15
}
}
/healthz check, then drains the old ones —
zero-downtime releases (the deployment strategies from Topic 10). The image variable is the seam
where CI hands the freshly-built, tested artifact to the infrastructure.Observability & reliability
A service you can't see is a service you can't operate. We expose three things — a health check, structured logs, and metrics — and then define an explicit SLO with an error budget so "is it healthy enough?" becomes a number, not an argument.
# Health, structured logging, and Prometheus metrics in one place.
import logging, sys, time, json
from flask import request, Response
_REQUESTS: dict[tuple[str, int], int] = {}
_LATENCY_SUM = 0.0
_LATENCY_COUNT = 0
def register_observability(app):
_configure_json_logging()
@app.get("/healthz")
def healthz():
# liveness: cheap, dependency-free, polled by Docker/ECS/ALB
return {"status": "ok"}, 200
@app.before_request
def _start_timer():
request._t0 = time.perf_counter()
@app.after_request
def _record(resp):
global _LATENCY_SUM, _LATENCY_COUNT
dt = time.perf_counter() - getattr(request, "_t0", time.perf_counter())
key = (request.endpoint or "unknown", resp.status_code)
_REQUESTS[key] = _REQUESTS.get(key, 0) + 1
_LATENCY_SUM += dt
_LATENCY_COUNT += 1
app.logger.info(json.dumps({
"msg": "request", "path": request.path, "method": request.method,
"status": resp.status_code, "duration_ms": round(dt * 1000, 2),
}))
return resp
@app.get("/metrics")
def metrics():
# Prometheus text exposition format — scraped on an interval
lines = ["# TYPE http_requests_total counter"]
for (endpoint, status), n in _REQUESTS.items():
lines.append(f'http_requests_total{{endpoint="{endpoint}",status="{status}"}} {n}')
avg = (_LATENCY_SUM / _LATENCY_COUNT) if _LATENCY_COUNT else 0.0
lines += ["# TYPE http_request_duration_seconds_avg gauge",
f"http_request_duration_seconds_avg {avg:.6f}"]
return Response("\n".join(lines) + "\n", mimetype="text/plain")
def _configure_json_logging():
handler = logging.StreamHandler(sys.stdout) # logs go to stdout; the platform collects them
logging.getLogger().addHandler(handler)
logging.getLogger().setLevel(logging.INFO)
Health vs. readiness. /healthz answers liveness — "is the process up?"
— and must stay cheap and dependency-free, or a slow database makes the orchestrator kill a perfectly live
container. A separate readiness probe (checking dependencies) gates whether traffic is routed in.
Logs go to stdout as one JSON object per line, so the platform (CloudWatch here) collects and
indexes them — the app never owns log files. Metrics are exposed in Prometheus format and
scraped on an interval, feeding dashboards and alerts.
An SLO and an error budget
The SLI (indicator) is the fraction of redirect requests served successfully in under 200 ms. The SLO (objective) is the target we promise; the error budget is the amount of failure that target permits — and it is permission to take risk, not a goal of zero.
The Agile / Git workflow
The work is run as a single Scrum sprint (Topic 07) on a
Git-flow branching model (Topic 02). The two reinforce each
other: the sprint defines what a unit of value is, and the branching model defines how that
value safely reaches main.
| Scrum element | How it shows up in this project |
|---|---|
| Product backlog | Issues: "shorten endpoint", "redirect endpoint", "Dockerfile", "CI pipeline", "Terraform service", "health + metrics". |
| Sprint goal | "A user can shorten a URL and be redirected, served by a deployed, observed container." |
| Sprint backlog | The subset above committed for this sprint, each issue sized in points. |
| Daily standup | Async on the PR board: what merged, what's blocked, what's next. |
| Increment | The green build on main deployed to production — a demonstrable redirect. |
| Sprint review | Live demo: shorten a URL, follow the redirect, show the Grafana dashboard. |
| Retrospective | What to keep (coverage gate caught a bug) and improve (flaky integration test). |
Each backlog item becomes a short-lived feature branch off main, opened as a pull request. The
PR is the quality gate: CI must be green (tests, coverage, lint) and a teammate must approve before merge —
exactly the review surface from Session 3.
# branch off main for one issue
git switch -c feat/shorten-endpoint
# conventional commits make the history (and changelog) readable
git commit -m "feat(api): add POST /shorten with strategy-based keys"
git commit -m "test(api): cover collision retry path"
# publish and open a PR — CI runs test + lint on the PR automatically
git push -u origin feat/shorten-endpoint
gh pr create --fill # review + green CI required before merge
# after approval + squash-merge, main builds the image and (with approval) deploys
Mapping to learning outcomes
Read against the course's stated objectives (see Course · Learning objectives), this one project touches every one:
| Objective | Where this project demonstrates it |
|---|---|
| Holistic vision | One feature carried end-to-end: design → build → test → ship → operate. |
| Agile methodology | §7 — Scrum sprint with review and retrospective on a Git-flow model. |
| Architecture & patterns | §2 — Strategy pattern; store behind a port; stateless container. |
| Testing plan | §4 — unit + integration suite with an 80% coverage gate in CI. |
| Core DevOps, any vendor | §4–6 — CI/CD, build-once-deploy-many, monitoring; portable concepts. |
| Management trade-offs | §6 — the error budget arbitrates ship-fast vs. stay-stable. |
| Infrastructure as Code | §5 — Terraform for the ECS service, ALB, logging, and state. |
| Continuous improvement | §7 retro + §6 SLO review feed the next sprint's backlog. |
| Cloud-native computing | §3,§5 — containerized, orchestrated, declaratively provisioned. |
Extensions
- Persistence: swap
MemoryStorefor Redis or Postgres behind the same interface, and add a readiness probe that checks the connection. - Blue/green or canary deploys: route a small traffic slice to the new task set and promote only if its error rate stays inside the budget.
- Real dashboards & alerts: wire
/metricsinto Prometheus + Grafana and alert on burn-rate against the error budget (Assignment 2's brief). - Supply-chain security: add a Trivy image scan and dependency review to the pipeline, and sign the image (Session 11 / DevSecOps).
- Autoscaling: add an ECS target-tracking policy on CPU or request count to handle load spikes.
- Rate limiting: protect
/shortenfrom abuse with a per-IP limit applied as a Decorator over the route — another behavioural pattern in practice.
References
The course notes this project ties together:
- 01 · Software Development Life Cycle — the phase arc.
- 02 · Git Basics — branching, PRs, the Git-flow model.
- 03 · Design Patterns — Strategy (and Decorator) applied.
- 05 · Software Architectures — ports, statelessness.
- 06 · Testing — the pyramid and the coverage gate.
- 07 · Scrum — roles, events, artifacts.
- 08 · Backend Components — web/app server, store, health.
- 10 · Software Deployment — pipelines, artifacts, strategies.
- 11 · DevOps Security — non-root images, secrets, scanning.
- Course structure — the full 35-session program.
- Readings — Accelerate and The DevOps Handbook for SLOs, error budgets, and continuous delivery.
This worked example is the pattern behind Individual Assignments 1 & 2 and the group project.