cloud-lab worked example · 3-tier web app · IaC · SRE · FinOps

Worked example — design & "deploy" a scalable, reliable 3-tier web app on the cloud

This is the end-to-end project that runs as the common thread through the course: a group takes a single web application from a blank cloud account to a load-balanced, autoscaling, observable production deployment. Below is one fully worked instance of that project — the same decisions, code and numbers a team would produce — grounded in the concepts each session teaches. Every interactive demo in the lab corresponds to one decision made here.

Goal

Run "PhotoShare", a read-heavy image-sharing web app, on a public cloud so it (a) survives the loss of any single instance or availability zone, (b) absorbs a 10× traffic spike automatically, and (c) costs as little as possible at that reliability target. We target a 99.9% monthly availability SLO and a steady state of ~1,200 requests/second with bursts to ~9,000 req/s.

Stack (AWS naming; Azure equivalents in parentheses)

edge / CDN
CloudFront
azure
Front Door
load balancer
ALB
azure
App Gateway
app tier
ECS Fargate
azure
Container Apps
database
RDS Postgres
azure
Azure DB
cache
ElastiCache
azure
Cache for Redis
object store
S3
azure
Blob Storage
IaC
Terraform
also
OpenTofu
delivery model
IaaS + PaaS
compute
containers

Course sessions exercised

On this page: requirements & architecture · IaC, container & deploy · scaling & reliability · FinOps cost model · results & trade-offs · learning-outcome map · extensions · references

1 · Requirements & architecture

Functional & non-functional requirements

Reference architecture (request flow, top → bottom)

Each tier is horizontally redundant across two AZs. Solid downward flow is the read/write request path; the cache and object store sit beside the app tier.

clients → edge
Browser / mobile
CloudFront CDNstatic assets + image cache, TLS, WAF, rate limit
▼   cache-miss / dynamic requests
L7 load balancer (public subnets, 2 AZs)
Application Load Balancerhealth checks · TLS termination · least-outstanding-requests
▼   round-robin / least-conn across healthy targets
app tier — stateless containers (private subnets, 2 AZs, autoscaled)
app taskAZ-a
app taskAZ-a
app taskAZ-b
app taskAZ-b
… N tasks2 → 20
▼   reads hit cache first; writes + cold reads hit DB
managed data tier (private subnets, 2 AZs)
Redis cachesessions + hot reads
Postgres primaryAZ-a · writes
Postgres standbyAZ-b · sync replica, auto-failover
S3 object storeuser uploads · 11 nines durability

The dotted boundary between "you manage" and "provider manages" lands differently per tier: the app tier is closest to IaaS/containers (you own the image and scaling policy), while Postgres, Redis and S3 are PaaS — the provider patches, replicates and backs them up. See the service-models demo for where that line falls.

CAP decision
The relational primary is the system of record and is run CP: under a partition the standby will not accept writes until failover completes, so we never split-brain the ledger of who owns which photo. The Redis read cache and the CDN are run AP: a stale thumbnail for a few seconds is acceptable. This is the same trade-off explored in the CAP demo.

2 · Infrastructure as Code, container & deploy

The whole environment is declarative: one terraform apply creates the network, load balancer, container service, database and cache. Nothing is clicked in a console, so the environment is reproducible and drift is detectable (see the IaC drift demo).

main.tf — network, load balancer & autoscaling app service (Terraform / AWS)
terraform {
  required_version = ">= 1.6.0"
  required_providers {
    aws = { source = "hashicorp/aws", version = "~> 5.40" }
  }
  backend "s3" {                 # remote state => team-safe, lockable
    bucket = "photoshare-tfstate"
    key    = "prod/terraform.tfstate"
    region = "eu-west-1"
    dynamodb_table = "tf-locks"  # state locking prevents concurrent applies
  }
}

provider "aws" { region = "eu-west-1" }

# --- networking: one VPC, public + private subnets in 2 AZs ----------
module "vpc" {
  source             = "terraform-aws-modules/vpc/aws"
  version            = "~> 5.8"
  name               = "photoshare"
  cidr               = "10.0.0.0/16"
  azs                = ["eu-west-1a", "eu-west-1b"]
  public_subnets     = ["10.0.0.0/24",  "10.0.1.0/24"]   # ALB lives here
  private_subnets    = ["10.0.10.0/24", "10.0.11.0/24"]  # app + data here
  enable_nat_gateway = true
  single_nat_gateway = false   # one NAT per AZ => no cross-AZ SPOF
}

# --- L7 load balancer across both public subnets --------------------
resource "aws_lb" "app" {
  name               = "photoshare-alb"
  load_balancer_type = "application"
  subnets            = module.vpc.public_subnets
  security_groups    = [aws_security_group.alb.id]
}

resource "aws_lb_target_group" "app" {
  name        = "photoshare-tg"
  port        = 8080
  protocol    = "HTTP"
  vpc_id      = module.vpc.vpc_id
  target_type = "ip"
  health_check {
    path                = "/healthz"
    healthy_threshold   = 2
    unhealthy_threshold = 3
    interval            = 15
    timeout             = 5
  }
}

# --- stateless container service (ECS Fargate) ----------------------
resource "aws_ecs_service" "app" {
  name            = "photoshare"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.app.arn
  desired_count   = 4                 # baseline; autoscaler overrides
  launch_type     = "FARGATE"

  network_configuration {
    subnets         = module.vpc.private_subnets
    security_groups = [aws_security_group.app.id]
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.app.arn
    container_name   = "web"
    container_port   = 8080
  }
  # spread tasks across AZs so one AZ loss never drops >50% capacity
  placement_constraints { type = "spread"  field = "attribute:ecs.availability-zone" }
}

# --- managed data tier: Multi-AZ Postgres + Redis + S3 --------------
resource "aws_db_instance" "pg" {
  identifier              = "photoshare-pg"
  engine                  = "postgres"
  engine_version          = "16.3"
  instance_class          = "db.r6g.large"
  allocated_storage       = 100
  multi_az                = true       # synchronous standby in AZ-b
  storage_encrypted       = true
  backup_retention_period = 7
  deletion_protection     = true
}

resource "aws_elasticache_replication_group" "redis" {
  replication_group_id       = "photoshare-redis"
  node_type                  = "cache.r6g.large"
  num_cache_clusters         = 2       # primary + replica, 2 AZs
  automatic_failover_enabled = true
}

resource "aws_s3_bucket" "uploads" {
  bucket = "photoshare-user-uploads"
}
resource "aws_s3_bucket_versioning" "uploads" {
  bucket = aws_s3_bucket.uploads.id
  versioning_configuration { status = "Enabled" }
}

The app itself is packaged as a small, layer-cached container image. A multi-stage build keeps the runtime image tiny (no compiler, no build deps) — the density win you measured in the VM-vs-container demo.

Dockerfile — multi-stage build, non-root, minimal runtime
# ---- build stage: has the toolchain, thrown away afterwards --------
FROM node:20-bookworm-slim AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
RUN npm run build

# ---- runtime stage: only the artifacts + node runtime -------------
FROM gcr.io/distroless/nodejs20-debian12 AS runtime
WORKDIR /app
COPY --from=build /app/dist  ./dist
COPY --from=build /app/node_modules ./node_modules
ENV NODE_ENV=production PORT=8080
USER 1000:1000          # never run as root
EXPOSE 8080
HEALTHCHECK --interval=15s --timeout=3s \
  CMD ["/nodejs/bin/node", "dist/healthcheck.js"]
CMD ["dist/server.js"]
.github/workflows/deploy.yml — CI build + push + rolling deploy
name: deploy
on:
  push:
    branches: [main]

permissions:
  id-token: write       # OIDC: no long-lived AWS keys in the repo
  contents: read

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Configure AWS credentials (OIDC)
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/gh-deploy
          aws-region: eu-west-1

      - name: Build, tag & push image
        run: |
          REG=123456789012.dkr.ecr.eu-west-1.amazonaws.com
          aws ecr get-login-password | docker login --username AWS --password-stdin $REG
          docker build -t $REG/photoshare:${{ github.sha }} .
          docker push $REG/photoshare:${{ github.sha }}

      - name: Roll out new task definition
        run: |
          aws ecs update-service \
            --cluster main --service photoshare \
            --force-new-deployment \
            --task-definition photoshare:${{ github.sha }}
          # ECS drains old tasks only after new ones pass health checks
          aws ecs wait services-stable --cluster main --services photoshare
why this is safe
The rollout is blue/green at the task level: ECS starts new-version tasks, waits for them to pass the ALB /healthz check, shifts traffic, then drains the old ones. A bad build never serves traffic. Config (DB host, Redis URL, bucket name) is injected as environment variables from Terraform outputs — the same Ansible/Terraform automation taught in S10.

3 · Scaling & reliability

Autoscaling policy (target tracking)

The app tier uses target-tracking autoscaling: the scaler adds or removes tasks to hold average CPU near a 60% setpoint, with a floor of 2 and a ceiling of 20. CPU is the right signal here because PhotoShare's per-request work (image resize + JSON render) is CPU-bound. This is exactly the control loop in the autoscaling demo.

autoscaling.tf — target-tracking on average CPU
resource "aws_appautoscaling_target" "app" {
  service_namespace  = "ecs"
  resource_id        = "service/main/photoshare"
  scalable_dimension = "ecs:service:DesiredCount"
  min_capacity       = 2
  max_capacity       = 20
}

resource "aws_appautoscaling_policy" "cpu" {
  name               = "track-cpu-60"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.app.resource_id
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"

  target_tracking_scaling_policy_configuration {
    target_value       = 60.0   # hold average CPU at 60%
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    scale_out_cooldown = 60      # add capacity quickly
    scale_in_cooldown  = 300     # remove it slowly => avoid flapping
  }
}

Sizing the band: each task handles about 150 req/s at 60% CPU. So the desired task count for an offered load $L$ (req/s) is

$$ N \;=\; \left\lceil \frac{L}{150 \times 0.60} \right\rceil \;=\; \left\lceil \frac{L}{90} \right\rceil. $$

The asymmetric cooldown (fast out, slow in) is deliberate: the cost of under-provisioning during a spike (dropped users) is far higher than the cost of a few extra tasks for five extra minutes.

SLO & error budget

The composed availability of the request path follows the series/parallel rules from the availability demo. The app tier is redundant (parallel) so it is far more available than any one task; the serial dependency chain is roughly CDN → ALB → app tier → database. With the database Multi-AZ pair as the weakest link (~99.95% effective), the realistic composed availability is ≈ 99.9%, which we adopt as the SLO.

A 99.9% monthly SLO converts to a concrete error budget. For a 30-day month ($43{,}200$ minutes):

$$ \text{budget} = (1 - 0.999)\times 43{,}200 \text{ min} = 0.001 \times 43{,}200 = 43.2 \text{ min/month}. $$

SLOunavailable fractiondowntime / 30-day monthdowntime / year
99.0% (two nines)0.010432 min3.65 days
99.9% (three nines) — our SLO0.00143.2 min8.77 hours
99.95%0.000521.6 min4.38 hours
99.99% (four nines)0.00014.32 min52.6 min

The error budget is a decision tool, the core of the SRE practice from S12. If a month burns less than 43.2 min of downtime, the team is free to ship features fast. If a single incident burns, say, 30 min, only 13.2 min remain — the team freezes risky deploys and spends the rest of the month on reliability work. Concretely, an incident's budget burn is

$$ \text{budget consumed} = \frac{\text{outage minutes}}{43.2} \times 100\%. $$

A 9-minute partial outage that degraded 50% of users consumes $\tfrac{9 \times 0.5}{43.2}\approx 10.4\%$ of the month's budget.

defending the budget — DDoS
A flood of requests could blow the SLO by exhausting capacity. The CDN/WAF runs a token-bucket rate limiter at the edge (explored in the DDoS demo): legitimate users stay under the refill rate and are admitted, while a volumetric attacker drains the bucket and is shed before it ever reaches the origin or burns the error budget.

4 · FinOps cost model

FinOps (S5–6) turns architecture into a monthly bill and then optimizes it. We model the steady-state footprint (2 app tasks average, the always-on data tier, plus per-use storage and egress), then apply the pricing-model levers from the cost demo: reserved/committed-use discounts on the always-on database, and the fact that the CDN moves most bytes off the expensive origin path.

Assumptions: 730 hours/month; eu-west-1 on-demand list prices (rounded, illustrative); 2 Fargate tasks averaged over the month (each 1 vCPU + 2 GB); 92% CDN hit ratio on 5 TB total egress; 2 TB of stored uploads.

on-demand baseline (no commitments)
componentunitqty$/unit$/month
App — Fargate vCPUvCPU-hr1,4600.044564.97
App — Fargate memoryGB-hr2,9200.004914.31
Postgres db.r6g.large (Multi-AZ ⇒ ×2)hr7300.480350.40
Postgres storage (gp3, 100 GB)GB-mo1000.11511.50
ElastiCache Redis (2 × r6g.large)node-hr1,4600.226329.96
ALB (hours + LCU)mo128.0028.00
S3 storage (2 TB)GB-mo2,0480.02347.10
CloudFront egress (4.6 TB @ edge)GB4,7100.085400.35
Origin egress (0.4 TB, 8% miss)GB4100.09036.90
NAT gateways (2 AZ) + datamo175.0075.00
on-demand total1,358.49
optimized — 1-yr reserved on always-on tiers, CDN keeps egress off origin
leverapplies todiscounton-demandoptimized
1-yr Compute Savings PlanFargate vCPU + mem−40%79.2847.57
1-yr Reserved InstancePostgres Multi-AZ−42%350.40203.23
1-yr Reserved nodeRedis (2 nodes)−40%329.96197.98
CDN offload (92% hit)egress pathalready modeled437.25437.25
unchanged (storage, ALB, NAT)0%161.60161.60
optimized total1,358.491,047.63

Result: committing the always-on tiers (compute, database, cache) for one year cuts the bill from $1,358/mo to $1,048/mo — a 23% saving (~$3,730/yr) for zero architectural change, just a pricing-model decision. The largest remaining line is CDN egress; the biggest reliability-vs-cost lever is the Multi-AZ database, which doubles the DB compute line but is the difference between 99.9% and a single-AZ ~99.5%.

FinOps rule of thumb
Reserve what is always on (database, cache, baseline compute); leave bursty capacity on-demand (the autoscaled tasks above the floor); never reserve what you can let the CDN or autoscaler remove. A workload that ran 24/7 always-on without any of these levers — the worst case in the cost demo — would cost meaningfully more for the same architecture.

5 · Results & trade-offs

What the design buys, and what it costs:

dimensionthis designtrade-off accepted
Availability≈ 99.9% (43 min/mo budget), survives 1 AZ lossMulti-AZ doubles DB cost; 99.99% would need multi-region (S8 extension)
Elasticity2 → 20 tasks, scale-out in ~60 scold capacity is not instant; very spiky load benefits from a warm floor
ConsistencyCP primary DB, AP cache/CDNreaders can see a few seconds of stale cached data
Cost~$1,048/mo optimizedreservations lock in 1-yr commitment; CDN egress dominates
Operability100% IaC, CI rolling deploys, error-budget drivenTerraform state + pipeline must themselves be maintained

The dominant tension is the classic one from the syllabus: availability and elasticity cost money, and the cloud's value is letting you dial each one to exactly the level the SLO requires — paying OPEX for redundancy and headroom only while you need them, rather than a fixed CAPEX data center sized for peak.

6 · Mapping to course learning outcomes

Each syllabus learning objective is exercised by a concrete part of this project:

  • Cloud architectures & service/delivery models — the 3-tier split across IaaS-style containers and PaaS data services; see §1 and the service-models demo.
  • Azure & AWS terminology — every component is named in both clouds in the §0 stack table.
  • Virtualization & containers — the multi-stage Dockerfile and stateless task design (§2), backed by the VM-vs-container and orchestration demos.
  • Automation technologies (IaC) — the full Terraform stack with remote state + locking, and a CI/CD pipeline (§2); ties to S10 Ansible/Terraform and the drift demo.
  • Architect solutions with design patterns — load balancer, autoscaling, cache-aside, CDN offload, Multi-AZ failover, token-bucket rate limiting (§1, §3).
  • FinOps / CAPEX→OPEX — the worked monthly cost model and reservation strategy (§4), extending the cost demo.
  • Reliability & SRE — the SLO, error-budget math and budget-burn policy (§3), extending the availability demo.

7 · Extensions

A. Multi-region active-passive (toward 99.99%)

To push past three nines you must remove the region itself as a single point of failure. Add a warm standby region with the database cross-region read-replica promotable on failover, S3 cross-region replication for uploads, and DNS health-check failover at the edge:

route53.tf — latency/health failover between regions
resource "aws_route53_record" "primary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "photoshare.example.com"
  type           = "A"
  set_identifier = "eu-west-1"
  failover_routing_policy { type = "PRIMARY" }
  health_check_id = aws_route53_health_check.eu_west_1.id
  alias {
    name                   = aws_lb.app.dns_name
    zone_id                = aws_lb.app.zone_id
    evaluate_target_health = true
  }
}

resource "aws_route53_record" "secondary" {
  zone_id        = aws_route53_zone.main.zone_id
  name           = "photoshare.example.com"
  type           = "A"
  set_identifier = "eu-central-1"
  failover_routing_policy { type = "SECONDARY" }   # warm standby region
  alias {
    name                   = aws_lb.app_dr.dns_name
    zone_id                = aws_lb.app_dr.zone_id
    evaluate_target_health = true
  }
}

Trade-off: roughly doubles always-on cost and forces a CAP decision on the cross-region database (asynchronous replication ⇒ a small RPO window of writes can be lost on a hard regional failover).

B. Serverless variant (scale-to-zero)

For a low-or-bursty-traffic profile, swap the always-on app tier for FaaS: an API Gateway in front of functions, with the same Postgres/S3 behind. Capacity becomes per-request and the bill scales to zero when idle — at the price of cold starts (S7, and the serverless demo).

handler.py — AWS Lambda thumbnail generator (event-driven, S3 trigger)
import boto3, os
from io import BytesIO
from PIL import Image

s3 = boto3.client("s3")
THUMB_BUCKET = os.environ["THUMB_BUCKET"]

def handler(event, _context):
    # triggered by an S3 ObjectCreated event on the uploads bucket
    for record in event["Records"]:
        src_bucket = record["s3"]["bucket"]["name"]
        key        = record["s3"]["object"]["key"]

        obj = s3.get_object(Bucket=src_bucket, Key=key)
        img = Image.open(BytesIO(obj["Body"].read()))
        img.thumbnail((256, 256))

        buf = BytesIO()
        img.save(buf, format="JPEG", quality=82)
        buf.seek(0)
        s3.put_object(Bucket=THUMB_BUCKET, Key=f"thumb/{key}",
                      Body=buf, ContentType="image/jpeg")
    return {"thumbnails": len(event["Records"])}

Decision rule: FaaS wins below a crossover utilization where you'd otherwise keep servers idle; the always-on container tier wins for sustained high traffic where per-request pricing overtakes a reserved fleet. The cost demo and serverless demo together let you find that crossover for a given load.

8 · References

Prices are rounded, region-indicative figures for teaching the cost model, not a live quote. Cloud list prices change frequently — always re-price against the provider's calculator before committing.