S SDDO Notes · IE BCSAI 2025
Worked Example · End-to-end

Ship a containerized web app with a full CI/CD + IaC pipeline.

One project that exercises the whole course: a small Flask REST API built around a clean design pattern, packaged in a multi-stage Docker image, tested and shipped by a GitHub Actions pipeline, provisioned with Terraform, made observable with health checks, structured logs and an SLO, and run as Scrum on a Git-flow branching model. Every snippet below is real, runnable configuration — read it top to bottom as the story of one feature going to production.

10 sections 6 real config files Sessions 1–35 exercised
01 · The brief

Overview — goal, sessions exercised, stack

The deliverable is a single REST microservice — a URL-shortener API — taken all the way from a green-field repository to a deployed, observed, and operated service. It is the same arc the syllabus draws across the semester: Individual Assignment 1 (a minimal app) becomes the substrate that Assignment 2 and the group project improve with DevOps practices, so nothing is throwaway.

We pick a URL shortener because it is small enough to read in one sitting yet rich enough to demonstrate every operational concern: it has state (a key→URL map), an obvious correctness test, a natural place to apply a design pattern (the key-generation algorithm), and a clear service-level objective (redirects must be fast and available).

What each part of the course this exercises

The stack

Build once, deploy many The single Docker image produced by CI is the only artifact that travels between environments. Staging and production differ only in configuration (environment variables, the Terraform workspace) — never in the artifact. This is the central deployment discipline from Session 10.

Repository layout

# the shape of the repo this page describes urlshort/ ├── app/ │ ├── __init__.py # app factory + route wiring │ ├── api.py # REST endpoints │ ├── store.py # in-process key→URL store │ ├── keygen.py # Strategy pattern: key generators │ └── observability.py # /healthz, /metrics, logging ├── tests/ │ ├── test_keygen.py # unit tests (the strategies) │ └── test_api.py # integration tests (the HTTP layer) ├── infra/ │ └── main.tf # Terraform: ECS service + ALB ├── .github/workflows/ │ └── ci.yml # build → test → lint → image → deploy ├── Dockerfile # multi-stage build ├── requirements.txt └── README.md
02 · The application

The app & a design pattern applied

The interesting design decision is how short keys are generated. A naïve service hard-codes one algorithm; ours treats key generation as a swappable Strategy (one of the behavioural patterns from Topic 03). Each algorithm implements the same interface, so the rest of the system depends on a stable contract rather than a concrete implementation — and we can switch from random keys to a deterministic counter-based scheme without touching the API layer.

app/keygen.py
# Strategy pattern — interchangeable key-generation algorithms
# behind one interface (Sessions 4-6: behavioural design patterns).
from __future__ import annotations
import secrets
import string
from typing import Protocol

ALPHABET = string.ascii_letters + string.digits


class KeyStrategy(Protocol):
    """The stable contract every algorithm must honour."""
    def generate(self) -> str: ...


class RandomKey:
    """Cryptographically-random short key. Good default: no coordination needed."""
    def __init__(self, length: int = 7) -> None:
        self.length = length

    def generate(self) -> str:
        return "".join(secrets.choice(ALPHABET) for _ in range(self.length))


class CounterKey:
    """Deterministic base-62 of a monotonic counter. Shortest keys, but needs a source of truth."""
    def __init__(self, start: int = 1000) -> None:
        self._n = start

    def generate(self) -> str:
        n, out = self._n, []
        self._n += 1
        if n == 0:
            return ALPHABET[0]
        while n > 0:
            n, r = divmod(n, len(ALPHABET))
            out.append(ALPHABET[r])
        return "".join(reversed(out))


def get_strategy(name: str) -> KeyStrategy:
    """Factory: pick the strategy by name (driven by config, not hard-coded)."""
    strategies = {"random": RandomKey, "counter": CounterKey}
    if name not in strategies:
        raise ValueError(f"unknown key strategy: {name!r}")
    return strategies[name]()
Why a pattern here The Strategy keeps the varying part — the algorithm — isolated and independently testable. The API never knows which generator it holds; it only calls .generate(). Adding a third scheme (say, a hash of the long URL) is a one-class change that touches no existing call site — the Open/Closed principle in action.
app/api.py
# The HTTP layer. Depends on the KeyStrategy interface, not a concrete class.
from flask import Blueprint, request, jsonify, redirect, current_app

api = Blueprint("api", __name__)


@api.post("/shorten")
def shorten():
    data = request.get_json(silent=True) or {}
    url = data.get("url")
    if not url or not url.startswith(("http://", "https://")):
        return jsonify(error="a valid http(s) url is required"), 400

    store = current_app.config["STORE"]
    keygen = current_app.config["KEYGEN"]

    # retry on the rare key collision (random strategy)
    for _ in range(5):
        key = keygen.generate()
        if store.put_if_absent(key, url):
            short = f"{request.host_url}{key}"
            return jsonify(key=key, short_url=short, url=url), 201
    return jsonify(error="could not allocate a key"), 503


@api.get("/<key>")
def resolve(key: str):
    url = current_app.config["STORE"].get(key)
    if url is None:
        return jsonify(error="not found"), 404
    return redirect(url, code=302)
app/__init__.py
# Application factory — wires config, the store, the chosen strategy,
# and the observability endpoints. Config is injected, never hard-coded.
import os
from flask import Flask
from .api import api
from .store import MemoryStore
from .keygen import get_strategy
from .observability import register_observability


def create_app() -> Flask:
    app = Flask(__name__)
    app.config["STORE"] = MemoryStore()
    app.config["KEYGEN"] = get_strategy(os.environ.get("KEY_STRATEGY", "random"))
    app.register_blueprint(api)
    register_observability(app)
    return app

The MemoryStore in store.py is a thread-safe dict wrapper with a put_if_absent method; swapping it for Redis or Postgres later is, again, a one-class change because the API depends only on its small interface (see Topic 05 on keeping the persistence layer behind a port).

03 · Packaging

Containerization — a multi-stage Dockerfile

The image is the unit of deployment. A multi-stage build compiles and installs dependencies in a fat builder stage, then copies only the resulting virtual environment into a slim runtime stage. The shipped image carries no compilers, no build caches, and no shell history — smaller attack surface, faster pulls.

Dockerfile
# ---------- Stage 1: builder ----------
FROM python:3.12-slim AS builder

ENV PYTHONDONTWRITEBYTECODE=1 \
    PIP_NO_CACHE_DIR=1

WORKDIR /build
COPY requirements.txt .
# build the dependency tree into an isolated venv we can copy wholesale
RUN python -m venv /opt/venv \
 && /opt/venv/bin/pip install --upgrade pip \
 && /opt/venv/bin/pip install -r requirements.txt

# ---------- Stage 2: runtime ----------
FROM python:3.12-slim AS runtime

# run as an unprivileged user (Session 11: never run containers as root)
RUN useradd --create-home --uid 10001 appuser
ENV PATH="/opt/venv/bin:$PATH" \
    PYTHONUNBUFFERED=1 \
    KEY_STRATEGY=random \
    PORT=8080

WORKDIR /app
COPY --from=builder /opt/venv /opt/venv
COPY app/ ./app/

USER appuser
EXPOSE 8080

# container-native health check — the orchestrator polls this
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
  CMD python -c "import urllib.request,os; \
      urllib.request.urlopen(f'http://127.0.0.1:{os.environ[\"PORT\"]}/healthz').read()" || exit 1

# Gunicorn: a production WSGI server, not Flask's dev server
CMD ["gunicorn", "--bind", "0.0.0.0:8080", "--workers", "2", \
     "--access-logfile", "-", "app:create_app()"]
Notes & trade-offs Pin the base by digest in production (python:3.12-slim@sha256:…) so a moving tag can't silently change your image. Order layers by change frequency — copy requirements.txt and install before copying app code, so dependency layers stay cached when only source changes. Never COPY . . blindly: a .dockerignore keeps .git, tests, and any .env out of the image (Session 11).
04 · Automation

CI/CD pipeline — GitHub Actions, stage by stage

The pipeline is the spine of the project: every change runs the same gates, and only a green build on main reaches production. This is the build-once-deploy-many and continuous delivery discipline from Topic 10.

.github/workflows/ci.yml
name: ci

on:
  push:
    branches: [main]
  pull_request:

permissions:
  contents: read
  packages: write          # push the image to GHCR

env:
  IMAGE: ghcr.io/${{ github.repository }}

jobs:
  # ---------- test + lint run on every push and PR ----------
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
          cache: pip
      - name: Install dependencies
        run: |
          pip install -r requirements.txt
          pip install pytest coverage ruff
      - name: Lint
        run: ruff check app tests
      - name: Test with coverage gate
        run: |
          coverage run -m pytest -q
          coverage report --fail-under=80   # fail the build below 80%

  # ---------- build the image; on main, also push it ----------
  image:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Log in to GHCR
        if: github.ref == 'refs/heads/main'
        uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - name: Build and push
        uses: docker/build-push-action@v6
        with:
          context: .
          push: ${{ github.ref == 'refs/heads/main' }}
          tags: |
            ${{ env.IMAGE }}:${{ github.sha }}
            ${{ env.IMAGE }}:latest

  # ---------- deploy only from main, after a green image ----------
  deploy:
    needs: image
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production      # gate: require an approval here
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_DEPLOY_ROLE }}   # OIDC, no static keys
          aws-region: eu-west-1
      - name: Terraform apply
        working-directory: infra
        run: |
          terraform init
          terraform apply -auto-approve \
            -var="image=${{ env.IMAGE }}:${{ github.sha }}"
Stage by stage test runs on every push and PR — lint and a coverage gate that fails the build below 80% (Session 6 says coverage is a floor, not a ceiling). image needs test to pass, builds on every run for parity, but only pushes on main. deploy runs only from main, behind a GitHub environment that can require a human approval, and uses OIDC role assumption so no long-lived AWS keys ever live in the repo (Session 11). The image tag is the immutable commit SHA — the exact artifact tested is the exact artifact deployed.
05 · Infrastructure as code

IaC & deploy — Terraform

The runtime is described declaratively so it is reproducible and reviewable: the same terraform apply recreates the whole environment from scratch, and every change to the infrastructure goes through a pull request like any other code. Below, an ECS Fargate service sits behind an Application Load Balancer; the ALB health check points at the very /healthz endpoint the app exposes.

infra/main.tf
terraform {
  required_version = ">= 1.6"
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.0" }
  }
  # remote state so the team shares one source of truth
  backend "s3" {
    bucket = "urlshort-tfstate"
    key    = "prod/terraform.tfstate"
    region = "eu-west-1"
  }
}

provider "aws" {
  region = "eu-west-1"
}

variable "image" {
  description = "Container image tag to deploy (passed by CI)"
  type        = string
}

resource "aws_ecs_cluster" "main" {
  name = "urlshort"
}

resource "aws_ecs_task_definition" "app" {
  family                   = "urlshort"
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = 256
  memory                   = 512
  execution_role_arn       = aws_iam_role.exec.arn

  container_definitions = jsonencode([{
    name      = "urlshort"
    image     = var.image
    essential = true
    portMappings = [{ containerPort = 8080 }]
    environment  = [{ name = "KEY_STRATEGY", value = "random" }]
    healthCheck = {
      command     = ["CMD-SHELL", "python -c \"import urllib.request;urllib.request.urlopen('http://127.0.0.1:8080/healthz')\""]
      interval    = 30
      timeout     = 5
      retries     = 3
      startPeriod = 10
    }
    logConfiguration = {
      logDriver = "awslogs"
      options = {
        "awslogs-group"         = aws_cloudwatch_log_group.app.name
        "awslogs-region"        = "eu-west-1"
        "awslogs-stream-prefix" = "urlshort"
      }
    }
  }])
}

resource "aws_cloudwatch_log_group" "app" {
  name              = "/ecs/urlshort"
  retention_in_days = 30
}

resource "aws_ecs_service" "app" {
  name            = "urlshort"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 2          # two tasks: rolling deploys with no downtime
  launch_type     = "FARGATE"

  network_configuration {
    subnets         = var.private_subnets
    security_groups = [aws_security_group.app.id]
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "urlshort"
    container_port   = 8080
  }
}

resource "aws_lb_target_group" "app" {
  name        = "urlshort"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = var.vpc_id
  target_type = "ip"

  health_check {
    path                = "/healthz"   # the ALB polls the app's health endpoint
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
  }
}
IaC notes Remote state (the S3 backend) lets the whole team share one source of truth and avoids the "works on my laptop" infrastructure drift. Two desired tasks behind the ALB give a rolling deploy: ECS starts new tasks, waits for them to pass the /healthz check, then drains the old ones — zero-downtime releases (the deployment strategies from Topic 10). The image variable is the seam where CI hands the freshly-built, tested artifact to the infrastructure.
06 · Run it well

Observability & reliability

A service you can't see is a service you can't operate. We expose three things — a health check, structured logs, and metrics — and then define an explicit SLO with an error budget so "is it healthy enough?" becomes a number, not an argument.

app/observability.py
# Health, structured logging, and Prometheus metrics in one place.
import logging, sys, time, json
from flask import request, Response

_REQUESTS: dict[tuple[str, int], int] = {}
_LATENCY_SUM = 0.0
_LATENCY_COUNT = 0


def register_observability(app):
    _configure_json_logging()

    @app.get("/healthz")
    def healthz():
        # liveness: cheap, dependency-free, polled by Docker/ECS/ALB
        return {"status": "ok"}, 200

    @app.before_request
    def _start_timer():
        request._t0 = time.perf_counter()

    @app.after_request
    def _record(resp):
        global _LATENCY_SUM, _LATENCY_COUNT
        dt = time.perf_counter() - getattr(request, "_t0", time.perf_counter())
        key = (request.endpoint or "unknown", resp.status_code)
        _REQUESTS[key] = _REQUESTS.get(key, 0) + 1
        _LATENCY_SUM += dt
        _LATENCY_COUNT += 1
        app.logger.info(json.dumps({
            "msg": "request", "path": request.path, "method": request.method,
            "status": resp.status_code, "duration_ms": round(dt * 1000, 2),
        }))
        return resp

    @app.get("/metrics")
    def metrics():
        # Prometheus text exposition format — scraped on an interval
        lines = ["# TYPE http_requests_total counter"]
        for (endpoint, status), n in _REQUESTS.items():
            lines.append(f'http_requests_total{{endpoint="{endpoint}",status="{status}"}} {n}')
        avg = (_LATENCY_SUM / _LATENCY_COUNT) if _LATENCY_COUNT else 0.0
        lines += ["# TYPE http_request_duration_seconds_avg gauge",
                  f"http_request_duration_seconds_avg {avg:.6f}"]
        return Response("\n".join(lines) + "\n", mimetype="text/plain")


def _configure_json_logging():
    handler = logging.StreamHandler(sys.stdout)   # logs go to stdout; the platform collects them
    logging.getLogger().addHandler(handler)
    logging.getLogger().setLevel(logging.INFO)

Health vs. readiness. /healthz answers liveness — "is the process up?" — and must stay cheap and dependency-free, or a slow database makes the orchestrator kill a perfectly live container. A separate readiness probe (checking dependencies) gates whether traffic is routed in. Logs go to stdout as one JSON object per line, so the platform (CloudWatch here) collects and indexes them — the app never owns log files. Metrics are exposed in Prometheus format and scraped on an interval, feeding dashboards and alerts.

An SLO and an error budget

The SLI (indicator) is the fraction of redirect requests served successfully in under 200 ms. The SLO (objective) is the target we promise; the error budget is the amount of failure that target permits — and it is permission to take risk, not a goal of zero.

SLI
success < 200ms
SLO (30 days)
99.9%
Error budget
0.1%
≈ allowed downtime
43m 12s / mo
How the budget drives decisions A 99.9% monthly SLO permits ~43 minutes of "bad" time per 30 days. While budget remains, the team is free to ship fast and take risks; when the budget is spent, the policy flips — feature work pauses and the next sprint goes to reliability. The error budget turns the perennial Dev-vs-Ops tension ("ship faster" vs. "stay stable") into a shared, quantified rule instead of a fight — the cultural goal of DevOps from Session 1.
07 · How the team works

The Agile / Git workflow

The work is run as a single Scrum sprint (Topic 07) on a Git-flow branching model (Topic 02). The two reinforce each other: the sprint defines what a unit of value is, and the branching model defines how that value safely reaches main.

Scrum elementHow it shows up in this project
Product backlogIssues: "shorten endpoint", "redirect endpoint", "Dockerfile", "CI pipeline", "Terraform service", "health + metrics".
Sprint goal"A user can shorten a URL and be redirected, served by a deployed, observed container."
Sprint backlogThe subset above committed for this sprint, each issue sized in points.
Daily standupAsync on the PR board: what merged, what's blocked, what's next.
IncrementThe green build on main deployed to production — a demonstrable redirect.
Sprint reviewLive demo: shorten a URL, follow the redirect, show the Grafana dashboard.
RetrospectiveWhat to keep (coverage gate caught a bug) and improve (flaky integration test).

Each backlog item becomes a short-lived feature branch off main, opened as a pull request. The PR is the quality gate: CI must be green (tests, coverage, lint) and a teammate must approve before merge — exactly the review surface from Session 3.

the loop, per backlog item
# branch off main for one issue
git switch -c feat/shorten-endpoint

# conventional commits make the history (and changelog) readable
git commit -m "feat(api): add POST /shorten with strategy-based keys"
git commit -m "test(api): cover collision retry path"

# publish and open a PR — CI runs test + lint on the PR automatically
git push -u origin feat/shorten-endpoint
gh pr create --fill            # review + green CI required before merge

# after approval + squash-merge, main builds the image and (with approval) deploys
08 · The syllabus, exercised

Mapping to learning outcomes

Read against the course's stated objectives (see Course · Learning objectives), this one project touches every one:

ObjectiveWhere this project demonstrates it
Holistic visionOne feature carried end-to-end: design → build → test → ship → operate.
Agile methodology§7 — Scrum sprint with review and retrospective on a Git-flow model.
Architecture & patterns§2 — Strategy pattern; store behind a port; stateless container.
Testing plan§4 — unit + integration suite with an 80% coverage gate in CI.
Core DevOps, any vendor§4–6 — CI/CD, build-once-deploy-many, monitoring; portable concepts.
Management trade-offs§6 — the error budget arbitrates ship-fast vs. stay-stable.
Infrastructure as Code§5 — Terraform for the ECS service, ALB, logging, and state.
Continuous improvement§7 retro + §6 SLO review feed the next sprint's backlog.
Cloud-native computing§3,§5 — containerized, orchestrated, declaratively provisioned.
09 · Go further

Extensions

10 · Where this comes from

References

The course notes this project ties together:

Coursework briefs

This worked example is the pattern behind Individual Assignments 1 & 2 and the group project.

See the project briefs →