cloud-lab · worked example — a scalable 3-tier web app on the cloud

Worked example — design & "deploy" a scalable, reliable 3-tier web app on the cloud

This is the end-to-end project that runs as the common thread through the course: a group takes a single web application from a blank cloud account to a load-balanced, autoscaling, observable production deployment. Below is one fully worked instance of that project — the same decisions, code and numbers a team would produce — grounded in the concepts each session teaches. Every interactive demo in the lab corresponds to one decision made here.

Goal

Run "PhotoShare", a read-heavy image-sharing web app, on a public cloud so it (a) survives the loss of any single instance or availability zone, (b) absorbs a 10× traffic spike automatically, and (c) costs as little as possible at that reliability target. We target a 99.9% monthly availability SLO and a steady state of ~1,200 requests/second with bursts to ~9,000 req/s.

Stack (AWS naming; Azure equivalents in parentheses)

edge / CDN

CloudFront

azure

Front Door

load balancer

ALB

azure

App Gateway

app tier

ECS Fargate

azure

Container Apps

database

RDS Postgres

azure

Azure DB

cache

ElastiCache

azure

Cache for Redis

object store

S3

azure

Blob Storage

IaC

Terraform

also

OpenTofu

delivery model

IaaS + PaaS

compute

containers

Course sessions exercised

S1 · NIST service models S2 · virtualization S3–4 · containers S5–6 · clouds + FinOps S7 · serverless S8–9 · Linux S10 · Ansible + Terraform S11 · storage + CAP S12 · LB / K8s S12 · availability / SRE S12 · DDoS

On this page: requirements & architecture · IaC, container & deploy · scaling & reliability · FinOps cost model · results & trade-offs · learning-outcome map · extensions · references

1 · Requirements & architecture

Functional & non-functional requirements

Stateless app tier. No request may depend on a specific instance, so any node can serve any user and the autoscaler can add/remove nodes freely. Session state lives in the cache, not in memory.
No single point of failure. Every tier spans at least two Availability Zones (AZs); the database runs Multi-AZ with a standby replica.
Elasticity. Capacity tracks load within a target utilization band (see autoscaling demo).
Durable user uploads. Images go to object storage (11 nines of durability), never to the local disk of an ephemeral container.
99.9% monthly availability SLO ⇒ ≤ 43.2 min downtime / 30-day month.

Reference architecture (request flow, top → bottom)

Each tier is horizontally redundant across two AZs. Solid downward flow is the read/write request path; the cache and object store sit beside the app tier.

clients → edge

Browser / mobile

CloudFront CDNstatic assets + image cache, TLS, WAF, rate limit

▼ cache-miss / dynamic requests

L7 load balancer (public subnets, 2 AZs)

Application Load Balancerhealth checks · TLS termination · least-outstanding-requests

▼ round-robin / least-conn across healthy targets

app tier — stateless containers (private subnets, 2 AZs, autoscaled)

app taskAZ-a

app taskAZ-b

… N tasks2 → 20

▼ reads hit cache first; writes + cold reads hit DB

managed data tier (private subnets, 2 AZs)

Redis cachesessions + hot reads

Postgres primaryAZ-a · writes

Postgres standbyAZ-b · sync replica, auto-failover

S3 object storeuser uploads · 11 nines durability

The dotted boundary between "you manage" and "provider manages" lands differently per tier: the app tier is closest to IaaS/containers (you own the image and scaling policy), while Postgres, Redis and S3 are PaaS — the provider patches, replicates and backs them up. See the service-models demo for where that line falls.

CAP decision

The relational primary is the system of record and is run CP: under a partition the standby will not accept writes until failover completes, so we never split-brain the ledger of who owns which photo. The Redis read cache and the CDN are run AP: a stale thumbnail for a few seconds is acceptable. This is the same trade-off explored in the CAP demo.

2 · Infrastructure as Code, container & deploy

The whole environment is declarative: one terraform apply creates the network, load balancer, container service, database and cache. Nothing is clicked in a console, so the environment is reproducible and drift is detectable (see the IaC drift demo).

main.tf — network, load balancer & autoscaling app service (Terraform / AWS)

terraform {
  required_version = ">= 1.6.0"
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.40" }
  }
  backend "s3" {                 # remote state => team-safe, lockable
    bucket = "photoshare-tfstate"
    key    = "prod/terraform.tfstate"
    region = "eu-west-1"
    dynamodb_table = "tf-locks"  # state locking prevents concurrent applies
  }
}

provider "aws" { region = "eu-west-1" }

# --- networking: one VPC, public + private subnets in 2 AZs ----------
module "vpc" {
  source             = "terraform-aws-modules/vpc/aws"
  version            = "~> 5.8"
  name               = "photoshare"
  cidr               = "10.0.0.0/16"
  azs                = ["eu-west-1a", "eu-west-1b"]
  public_subnets     = ["10.0.0.0/24",  "10.0.1.0/24"]   # ALB lives here
  private_subnets    = ["10.0.10.0/24", "10.0.11.0/24"]  # app + data here
  enable_nat_gateway = true
  single_nat_gateway = false   # one NAT per AZ => no cross-AZ SPOF
}

# --- L7 load balancer across both public subnets --------------------
resource "aws_lb" "app" {
  name               = "photoshare-alb"
  load_balancer_type = "application"
  subnets            = module.vpc.public_subnets
  security_groups    = [aws_security_group.alb.id]
}

resource "aws_lb_target_group" "app" {
  name        = "photoshare-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = module.vpc.vpc_id
  target_type = "ip"
  health_check {
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
    timeout             = 5
  }
}

# --- stateless container service (ECS Fargate) ----------------------
resource "aws_ecs_service" "app" {
  name            = "photoshare"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 4                 # baseline; autoscaler overrides
  launch_type     = "FARGATE"

  network_configuration {
    subnets         = module.vpc.private_subnets
    security_groups = [aws_security_group.app.id]
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "web"
    container_port   = 8080
  }
  # spread tasks across AZs so one AZ loss never drops >50% capacity
  placement_constraints { type = "spread"  field = "attribute:ecs.availability-zone" }
}

# --- managed data tier: Multi-AZ Postgres + Redis + S3 --------------
resource "aws_db_instance" "pg" {
  identifier              = "photoshare-pg"
  engine                  = "postgres"
  engine_version          = "16.3"
  instance_class          = "db.r6g.large"
  allocated_storage       = 100
  multi_az                = true       # synchronous standby in AZ-b
  storage_encrypted       = true
  backup_retention_period = 7
  deletion_protection     = true
}

resource "aws_elasticache_replication_group" "redis" {
  replication_group_id       = "photoshare-redis"
  node_type                  = "cache.r6g.large"
  num_cache_clusters         = 2       # primary + replica, 2 AZs
  automatic_failover_enabled = true
}

resource "aws_s3_bucket" "uploads" {
  bucket = "photoshare-user-uploads"
}
resource "aws_s3_bucket_versioning" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  versioning_configuration { status = "Enabled" }
}

The app itself is packaged as a small, layer-cached container image. A multi-stage build keeps the runtime image tiny (no compiler, no build deps) — the density win you measured in the VM-vs-container demo.

Dockerfile — multi-stage build, non-root, minimal runtime

# ---- build stage: has the toolchain, thrown away afterwards --------
FROM node:20-bookworm-slim AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
RUN npm run build

# ---- runtime stage: only the artifacts + node runtime -------------
FROM gcr.io/distroless/nodejs20-debian12 AS runtime
WORKDIR /app
COPY --from=build /app/dist  ./dist
COPY --from=build /app/node_modules ./node_modules
ENV NODE_ENV=production PORT=8080
USER 1000:1000          # never run as root
EXPOSE 8080
HEALTHCHECK --interval=15s --timeout=3s \
  CMD ["/nodejs/bin/node", "dist/healthcheck.js"]
CMD ["dist/server.js"]

.github/workflows/deploy.yml — CI build + push + rolling deploy

name: deploy
on:
  push:
    branches: [main]

permissions:
  id-token: write       # OIDC: no long-lived AWS keys in the repo
  contents: read

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/gh-deploy
          aws-region: eu-west-1

      - name: Build, tag & push image
        run: |
          REG=123456789012.dkr.ecr.eu-west-1.amazonaws.com
          aws ecr get-login-password | docker login --username AWS --password-stdin $REG
          docker build -t $REG/photoshare:${{ github.sha }} .
          docker push $REG/photoshare:${{ github.sha }}

      - name: Roll out new task definition
        run: |
          aws ecs update-service \
            --cluster main --service photoshare \
            --force-new-deployment \
            --task-definition photoshare:${{ github.sha }}
          # ECS drains old tasks only after new ones pass health checks
          aws ecs wait services-stable --cluster main --services photoshare

why this is safe

The rollout is blue/green at the task level: ECS starts new-version tasks, waits for them to pass the ALB /healthz check, shifts traffic, then drains the old ones. A bad build never serves traffic. Config (DB host, Redis URL, bucket name) is injected as environment variables from Terraform outputs — the same Ansible/Terraform automation taught in S10.

3 · Scaling & reliability

Autoscaling policy (target tracking)

The app tier uses target-tracking autoscaling: the scaler adds or removes tasks to hold average CPU near a 60% setpoint, with a floor of 2 and a ceiling of 20. CPU is the right signal here because PhotoShare's per-request work (image resize + JSON render) is CPU-bound. This is exactly the control loop in the autoscaling demo.

autoscaling.tf — target-tracking on average CPU

resource "aws_appautoscaling_target" "app" {
  service_namespace  = "ecs"
  resource_id        = "service/main/photoshare"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 2
  max_capacity       = 20
}

resource "aws_appautoscaling_policy" "cpu" {
  name               = "track-cpu-60"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.app.resource_id
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"

  target_tracking_scaling_policy_configuration {
    target_value       = 60.0   # hold average CPU at 60%
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    scale_out_cooldown = 60      # add capacity quickly
    scale_in_cooldown  = 300     # remove it slowly => avoid flapping
  }
}

Sizing the band: each task handles about 150 req/s at 60% CPU. So the desired task count for an offered load $L$ (req/s) is

$$ N \;=\; \left\lceil \frac{L}{150 \times 0.60} \right\rceil \;=\; \left\lceil \frac{L}{90} \right\rceil. $$

Steady state $L = 1{,}200$ req/s ⇒ $N = \lceil 13.3 \rceil = 14$ tasks.
Wait — that exceeds the design baseline because the CDN absorbs most reads. With a measured CDN cache-hit ratio of 92%, only ~8% of the 1,200 req/s (≈ 96 req/s) reaches the origin ⇒ $N = \lceil 96/90 \rceil = 2$ tasks at steady state.
A 9,000 req/s burst at the edge ⇒ ~720 req/s at origin ⇒ $N = \lceil 720/90 \rceil = 8$ tasks, well inside the ceiling of 20.

The asymmetric cooldown (fast out, slow in) is deliberate: the cost of under-provisioning during a spike (dropped users) is far higher than the cost of a few extra tasks for five extra minutes.

SLO & error budget

The composed availability of the request path follows the series/parallel rules from the availability demo. The app tier is redundant (parallel) so it is far more available than any one task; the serial dependency chain is roughly CDN → ALB → app tier → database. With the database Multi-AZ pair as the weakest link (~99.95% effective), the realistic composed availability is ≈ 99.9%, which we adopt as the SLO.

A 99.9% monthly SLO converts to a concrete error budget. For a 30-day month ($43{,}200$ minutes):

$$ \text{budget} = (1 - 0.999)\times 43{,}200 \text{ min} = 0.001 \times 43{,}200 = 43.2 \text{ min/month}. $$

SLO	unavailable fraction	downtime / 30-day month	downtime / year
99.0% (two nines)	0.010	432 min	3.65 days
99.9% (three nines) — our SLO	0.001	43.2 min	8.77 hours
99.95%	0.0005	21.6 min	4.38 hours
99.99% (four nines)	0.0001	4.32 min	52.6 min

The error budget is a decision tool, the core of the SRE practice from S12. If a month burns less than 43.2 min of downtime, the team is free to ship features fast. If a single incident burns, say, 30 min, only 13.2 min remain — the team freezes risky deploys and spends the rest of the month on reliability work. Concretely, an incident's budget burn is

$$ \text{budget consumed} = \frac{\text{outage minutes}}{43.2} \times 100\%. $$

A 9-minute partial outage that degraded 50% of users consumes $\tfrac{9 \times 0.5}{43.2}\approx 10.4\%$ of the month's budget.

defending the budget — DDoS

A flood of requests could blow the SLO by exhausting capacity. The CDN/WAF runs a token-bucket rate limiter at the edge (explored in the DDoS demo): legitimate users stay under the refill rate and are admitted, while a volumetric attacker drains the bucket and is shed before it ever reaches the origin or burns the error budget.

4 · FinOps cost model

FinOps (S5–6) turns architecture into a monthly bill and then optimizes it. We model the steady-state footprint (2 app tasks average, the always-on data tier, plus per-use storage and egress), then apply the pricing-model levers from the cost demo: reserved/committed-use discounts on the always-on database, and the fact that the CDN moves most bytes off the expensive origin path.

Assumptions: 730 hours/month; eu-west-1 on-demand list prices (rounded, illustrative); 2 Fargate tasks averaged over the month (each 1 vCPU + 2 GB); 92% CDN hit ratio on 5 TB total egress; 2 TB of stored uploads.

on-demand baseline (no commitments)

component	unit	qty	$/unit	$/month
App — Fargate vCPU	vCPU-hr	1,460	0.0445	64.97
App — Fargate memory	GB-hr	2,920	0.0049	14.31
Postgres db.r6g.large (Multi-AZ ⇒ ×2)	hr	730	0.480	350.40
Postgres storage (gp3, 100 GB)	GB-mo	100	0.115	11.50
ElastiCache Redis (2 × r6g.large)	node-hr	1,460	0.226	329.96
ALB (hours + LCU)	mo	1	28.00	28.00
S3 storage (2 TB)	GB-mo	2,048	0.023	47.10
CloudFront egress (4.6 TB @ edge)	GB	4,710	0.085	400.35
Origin egress (0.4 TB, 8% miss)	GB	410	0.090	36.90
NAT gateways (2 AZ) + data	mo	1	75.00	75.00
on-demand total				1,358.49

optimized — 1-yr reserved on always-on tiers, CDN keeps egress off origin

lever	applies to	discount	on-demand	optimized
1-yr Compute Savings Plan	Fargate vCPU + mem	−40%	79.28	47.57
1-yr Reserved Instance	Postgres Multi-AZ	−42%	350.40	203.23
1-yr Reserved node	Redis (2 nodes)	−40%	329.96	197.98
CDN offload (92% hit)	egress path	already modeled	437.25	437.25
unchanged (storage, ALB, NAT)	—	0%	161.60	161.60
optimized total			1,358.49	1,047.63

Result: committing the always-on tiers (compute, database, cache) for one year cuts the bill from $1,358/mo to $1,048/mo — a 23% saving (~$3,730/yr) for zero architectural change, just a pricing-model decision. The largest remaining line is CDN egress; the biggest reliability-vs-cost lever is the Multi-AZ database, which doubles the DB compute line but is the difference between 99.9% and a single-AZ ~99.5%.

FinOps rule of thumb

Reserve what is always on (database, cache, baseline compute); leave bursty capacity on-demand (the autoscaled tasks above the floor); never reserve what you can let the CDN or autoscaler remove. A workload that ran 24/7 always-on without any of these levers — the worst case in the cost demo — would cost meaningfully more for the same architecture.

5 · Results & trade-offs

What the design buys, and what it costs:

dimension	this design	trade-off accepted
Availability	≈ 99.9% (43 min/mo budget), survives 1 AZ loss	Multi-AZ doubles DB cost; 99.99% would need multi-region (S8 extension)
Elasticity	2 → 20 tasks, scale-out in ~60 s	cold capacity is not instant; very spiky load benefits from a warm floor
Consistency	CP primary DB, AP cache/CDN	readers can see a few seconds of stale cached data
Cost	~$1,048/mo optimized	reservations lock in 1-yr commitment; CDN egress dominates
Operability	100% IaC, CI rolling deploys, error-budget driven	Terraform state + pipeline must themselves be maintained

The dominant tension is the classic one from the syllabus: availability and elasticity cost money, and the cloud's value is letting you dial each one to exactly the level the SLO requires — paying OPEX for redundancy and headroom only while you need them, rather than a fixed CAPEX data center sized for peak.

6 · Mapping to course learning outcomes

Each syllabus learning objective is exercised by a concrete part of this project:

Cloud architectures & service/delivery models — the 3-tier split across IaaS-style containers and PaaS data services; see §1 and the service-models demo.
Azure & AWS terminology — every component is named in both clouds in the §0 stack table.
Virtualization & containers — the multi-stage Dockerfile and stateless task design (§2), backed by the VM-vs-container and orchestration demos.
Automation technologies (IaC) — the full Terraform stack with remote state + locking, and a CI/CD pipeline (§2); ties to S10 Ansible/Terraform and the drift demo.
Architect solutions with design patterns — load balancer, autoscaling, cache-aside, CDN offload, Multi-AZ failover, token-bucket rate limiting (§1, §3).
FinOps / CAPEX→OPEX — the worked monthly cost model and reservation strategy (§4), extending the cost demo.
Reliability & SRE — the SLO, error-budget math and budget-burn policy (§3), extending the availability demo.

7 · Extensions

A. Multi-region active-passive (toward 99.99%)

To push past three nines you must remove the region itself as a single point of failure. Add a warm standby region with the database cross-region read-replica promotable on failover, S3 cross-region replication for uploads, and DNS health-check failover at the edge:

route53.tf — latency/health failover between regions

resource "aws_route53_record" "primary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "photoshare.example.com"
  type           = "A"
  set_identifier = "eu-west-1"
  failover_routing_policy { type = "PRIMARY" }
  health_check_id = aws_route53_health_check.eu_west_1.id
  alias {
    name                   = aws_lb.app.dns_name
    zone_id                = aws_lb.app.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "secondary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "photoshare.example.com"
  type           = "A"
  set_identifier = "eu-central-1"
  failover_routing_policy { type = "SECONDARY" }   # warm standby region
  alias {
    name                   = aws_lb.app_dr.dns_name
    zone_id                = aws_lb.app_dr.zone_id
    evaluate_target_health = true
  }
}

Trade-off: roughly doubles always-on cost and forces a CAP decision on the cross-region database (asynchronous replication ⇒ a small RPO window of writes can be lost on a hard regional failover).

B. Serverless variant (scale-to-zero)

For a low-or-bursty-traffic profile, swap the always-on app tier for FaaS: an API Gateway in front of functions, with the same Postgres/S3 behind. Capacity becomes per-request and the bill scales to zero when idle — at the price of cold starts (S7, and the serverless demo).

handler.py — AWS Lambda thumbnail generator (event-driven, S3 trigger)

import boto3, os
from io import BytesIO
from PIL import Image

s3 = boto3.client("s3")
THUMB_BUCKET = os.environ["THUMB_BUCKET"]

def handler(event, _context):
    # triggered by an S3 ObjectCreated event on the uploads bucket
    for record in event["Records"]:
        src_bucket = record["s3"]["bucket"]["name"]
        key        = record["s3"]["object"]["key"]

        obj = s3.get_object(Bucket=src_bucket, Key=key)
        img = Image.open(BytesIO(obj["Body"].read()))
        img.thumbnail((256, 256))

        buf = BytesIO()
        img.save(buf, format="JPEG", quality=82)
        buf.seek(0)
        s3.put_object(Bucket=THUMB_BUCKET, Key=f"thumb/{key}",
                      Body=buf, ContentType="image/jpeg")
    return {"thumbnails": len(event["Records"])}

Decision rule: FaaS wins below a crossover utilization where you'd otherwise keep servers idle; the always-on container tier wins for sustained high traffic where per-request pricing overtakes a reserved fleet. The cost demo and serverless demo together let you find that crossover for a given load.

8 · References

J.R. Storment & Mike Fuller (2023). Cloud FinOps, 2nd ed. O'Reilly. ISBN 9781492098355.
Kief Morris (2020). Infrastructure as Code, 2nd ed. O'Reilly. ISBN 9781098114671.
Charity Majors, Liz Fong-Jones & George Miranda (2022). Observability Engineering. O'Reilly. ISBN 9781492076445.
Thomas Erl & Eric Barceló Monroy (2023). Cloud Computing: Concepts, Technology, Security, and Architecture, 2nd ed. Pearson. ISBN 9780138052256.
Jeroen Mulder (2023). Multi-Cloud Strategy for Cloud Architects, 2nd ed. Packt. ISBN 9781804616734.
Beyer, Jones, Petoff & Murphy (eds., 2016). Site Reliability Engineering. Google / O'Reilly — SLOs & error budgets.
AWS Well-Architected Framework; HashiCorp Terraform & AWS provider documentation.

Prices are rounded, region-indicative figures for teaching the cost model, not a live quote. Cloud list prices change frequently — always re-price against the provider's calculator before committing.