Worked example — design & "deploy" a scalable, reliable 3-tier web app on the cloud
This is the end-to-end project that runs as the common thread through the course: a group takes a single web application from a blank cloud account to a load-balanced, autoscaling, observable production deployment. Below is one fully worked instance of that project — the same decisions, code and numbers a team would produce — grounded in the concepts each session teaches. Every interactive demo in the lab corresponds to one decision made here.
Goal
Run "PhotoShare", a read-heavy image-sharing web app, on a public cloud so it (a) survives the loss of any single instance or availability zone, (b) absorbs a 10× traffic spike automatically, and (c) costs as little as possible at that reliability target. We target a 99.9% monthly availability SLO and a steady state of ~1,200 requests/second with bursts to ~9,000 req/s.
Stack (AWS naming; Azure equivalents in parentheses)
Course sessions exercised
On this page: requirements & architecture · IaC, container & deploy · scaling & reliability · FinOps cost model · results & trade-offs · learning-outcome map · extensions · references
1 · Requirements & architecture
Functional & non-functional requirements
- Stateless app tier. No request may depend on a specific instance, so any node can serve any user and the autoscaler can add/remove nodes freely. Session state lives in the cache, not in memory.
- No single point of failure. Every tier spans at least two Availability Zones (AZs); the database runs Multi-AZ with a standby replica.
- Elasticity. Capacity tracks load within a target utilization band (see autoscaling demo).
- Durable user uploads. Images go to object storage (11 nines of durability), never to the local disk of an ephemeral container.
- 99.9% monthly availability SLO ⇒ ≤ 43.2 min downtime / 30-day month.
Reference architecture (request flow, top → bottom)
Each tier is horizontally redundant across two AZs. Solid downward flow is the read/write request path; the cache and object store sit beside the app tier.
The dotted boundary between "you manage" and "provider manages" lands differently per tier: the app tier is closest to IaaS/containers (you own the image and scaling policy), while Postgres, Redis and S3 are PaaS — the provider patches, replicates and backs them up. See the service-models demo for where that line falls.
2 · Infrastructure as Code, container & deploy
The whole environment is declarative: one terraform apply creates the
network, load balancer, container service, database and cache. Nothing is clicked in a console, so the
environment is reproducible and drift is detectable (see the IaC drift demo).
terraform {
required_version = ">= 1.6.0"
required_providers {
aws = { source = "hashicorp/aws", version = "~> 5.40" }
}
backend "s3" { # remote state => team-safe, lockable
bucket = "photoshare-tfstate"
key = "prod/terraform.tfstate"
region = "eu-west-1"
dynamodb_table = "tf-locks" # state locking prevents concurrent applies
}
}
provider "aws" { region = "eu-west-1" }
# --- networking: one VPC, public + private subnets in 2 AZs ----------
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 5.8"
name = "photoshare"
cidr = "10.0.0.0/16"
azs = ["eu-west-1a", "eu-west-1b"]
public_subnets = ["10.0.0.0/24", "10.0.1.0/24"] # ALB lives here
private_subnets = ["10.0.10.0/24", "10.0.11.0/24"] # app + data here
enable_nat_gateway = true
single_nat_gateway = false # one NAT per AZ => no cross-AZ SPOF
}
# --- L7 load balancer across both public subnets --------------------
resource "aws_lb" "app" {
name = "photoshare-alb"
load_balancer_type = "application"
subnets = module.vpc.public_subnets
security_groups = [aws_security_group.alb.id]
}
resource "aws_lb_target_group" "app" {
name = "photoshare-tg"
port = 8080
protocol = "HTTP"
vpc_id = module.vpc.vpc_id
target_type = "ip"
health_check {
path = "/healthz"
healthy_threshold = 2
unhealthy_threshold = 3
interval = 15
timeout = 5
}
}
# --- stateless container service (ECS Fargate) ----------------------
resource "aws_ecs_service" "app" {
name = "photoshare"
cluster = aws_ecs_cluster.main.id
task_definition = aws_ecs_task_definition.app.arn
desired_count = 4 # baseline; autoscaler overrides
launch_type = "FARGATE"
network_configuration {
subnets = module.vpc.private_subnets
security_groups = [aws_security_group.app.id]
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = "web"
container_port = 8080
}
# spread tasks across AZs so one AZ loss never drops >50% capacity
placement_constraints { type = "spread" field = "attribute:ecs.availability-zone" }
}
# --- managed data tier: Multi-AZ Postgres + Redis + S3 --------------
resource "aws_db_instance" "pg" {
identifier = "photoshare-pg"
engine = "postgres"
engine_version = "16.3"
instance_class = "db.r6g.large"
allocated_storage = 100
multi_az = true # synchronous standby in AZ-b
storage_encrypted = true
backup_retention_period = 7
deletion_protection = true
}
resource "aws_elasticache_replication_group" "redis" {
replication_group_id = "photoshare-redis"
node_type = "cache.r6g.large"
num_cache_clusters = 2 # primary + replica, 2 AZs
automatic_failover_enabled = true
}
resource "aws_s3_bucket" "uploads" {
bucket = "photoshare-user-uploads"
}
resource "aws_s3_bucket_versioning" "uploads" {
bucket = aws_s3_bucket.uploads.id
versioning_configuration { status = "Enabled" }
}
The app itself is packaged as a small, layer-cached container image. A multi-stage build keeps the runtime image tiny (no compiler, no build deps) — the density win you measured in the VM-vs-container demo.
# ---- build stage: has the toolchain, thrown away afterwards --------
FROM node:20-bookworm-slim AS build
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev
COPY . .
RUN npm run build
# ---- runtime stage: only the artifacts + node runtime -------------
FROM gcr.io/distroless/nodejs20-debian12 AS runtime
WORKDIR /app
COPY --from=build /app/dist ./dist
COPY --from=build /app/node_modules ./node_modules
ENV NODE_ENV=production PORT=8080
USER 1000:1000 # never run as root
EXPOSE 8080
HEALTHCHECK --interval=15s --timeout=3s \
CMD ["/nodejs/bin/node", "dist/healthcheck.js"]
CMD ["dist/server.js"]
name: deploy
on:
push:
branches: [main]
permissions:
id-token: write # OIDC: no long-lived AWS keys in the repo
contents: read
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Configure AWS credentials (OIDC)
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/gh-deploy
aws-region: eu-west-1
- name: Build, tag & push image
run: |
REG=123456789012.dkr.ecr.eu-west-1.amazonaws.com
aws ecr get-login-password | docker login --username AWS --password-stdin $REG
docker build -t $REG/photoshare:${{ github.sha }} .
docker push $REG/photoshare:${{ github.sha }}
- name: Roll out new task definition
run: |
aws ecs update-service \
--cluster main --service photoshare \
--force-new-deployment \
--task-definition photoshare:${{ github.sha }}
# ECS drains old tasks only after new ones pass health checks
aws ecs wait services-stable --cluster main --services photoshare
/healthz check, shifts traffic, then drains the old ones. A bad build
never serves traffic. Config (DB host, Redis URL, bucket name) is injected as environment variables
from Terraform outputs — the same Ansible/Terraform automation taught in S10.
3 · Scaling & reliability
Autoscaling policy (target tracking)
The app tier uses target-tracking autoscaling: the scaler adds or removes tasks to hold average CPU near a 60% setpoint, with a floor of 2 and a ceiling of 20. CPU is the right signal here because PhotoShare's per-request work (image resize + JSON render) is CPU-bound. This is exactly the control loop in the autoscaling demo.
resource "aws_appautoscaling_target" "app" {
service_namespace = "ecs"
resource_id = "service/main/photoshare"
scalable_dimension = "ecs:service:DesiredCount"
min_capacity = 2
max_capacity = 20
}
resource "aws_appautoscaling_policy" "cpu" {
name = "track-cpu-60"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.app.resource_id
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
target_tracking_scaling_policy_configuration {
target_value = 60.0 # hold average CPU at 60%
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
scale_out_cooldown = 60 # add capacity quickly
scale_in_cooldown = 300 # remove it slowly => avoid flapping
}
}
Sizing the band: each task handles about 150 req/s at 60% CPU. So the desired task count for an offered load $L$ (req/s) is
$$ N \;=\; \left\lceil \frac{L}{150 \times 0.60} \right\rceil \;=\; \left\lceil \frac{L}{90} \right\rceil. $$
- Steady state $L = 1{,}200$ req/s ⇒ $N = \lceil 13.3 \rceil = 14$ tasks.
- Wait — that exceeds the design baseline because the CDN absorbs most reads. With a measured CDN cache-hit ratio of 92%, only ~8% of the 1,200 req/s (≈ 96 req/s) reaches the origin ⇒ $N = \lceil 96/90 \rceil = 2$ tasks at steady state.
- A 9,000 req/s burst at the edge ⇒ ~720 req/s at origin ⇒ $N = \lceil 720/90 \rceil = 8$ tasks, well inside the ceiling of 20.
The asymmetric cooldown (fast out, slow in) is deliberate: the cost of under-provisioning during a spike (dropped users) is far higher than the cost of a few extra tasks for five extra minutes.
SLO & error budget
The composed availability of the request path follows the series/parallel rules from the availability demo. The app tier is redundant (parallel) so it is far more available than any one task; the serial dependency chain is roughly CDN → ALB → app tier → database. With the database Multi-AZ pair as the weakest link (~99.95% effective), the realistic composed availability is ≈ 99.9%, which we adopt as the SLO.
A 99.9% monthly SLO converts to a concrete error budget. For a 30-day month ($43{,}200$ minutes):
$$ \text{budget} = (1 - 0.999)\times 43{,}200 \text{ min} = 0.001 \times 43{,}200 = 43.2 \text{ min/month}. $$
| SLO | unavailable fraction | downtime / 30-day month | downtime / year |
|---|---|---|---|
| 99.0% (two nines) | 0.010 | 432 min | 3.65 days |
| 99.9% (three nines) — our SLO | 0.001 | 43.2 min | 8.77 hours |
| 99.95% | 0.0005 | 21.6 min | 4.38 hours |
| 99.99% (four nines) | 0.0001 | 4.32 min | 52.6 min |
The error budget is a decision tool, the core of the SRE practice from S12. If a month burns less than 43.2 min of downtime, the team is free to ship features fast. If a single incident burns, say, 30 min, only 13.2 min remain — the team freezes risky deploys and spends the rest of the month on reliability work. Concretely, an incident's budget burn is
$$ \text{budget consumed} = \frac{\text{outage minutes}}{43.2} \times 100\%. $$
A 9-minute partial outage that degraded 50% of users consumes $\tfrac{9 \times 0.5}{43.2}\approx 10.4\%$ of the month's budget.
4 · FinOps cost model
FinOps (S5–6) turns architecture into a monthly bill and then optimizes it. We model the steady-state footprint (2 app tasks average, the always-on data tier, plus per-use storage and egress), then apply the pricing-model levers from the cost demo: reserved/committed-use discounts on the always-on database, and the fact that the CDN moves most bytes off the expensive origin path.
Assumptions: 730 hours/month; eu-west-1 on-demand list prices (rounded,
illustrative); 2 Fargate tasks averaged over the month (each 1 vCPU + 2 GB); 92% CDN hit ratio on
5 TB total egress; 2 TB of stored uploads.
| component | unit | qty | $/unit | $/month |
|---|---|---|---|---|
| App — Fargate vCPU | vCPU-hr | 1,460 | 0.0445 | 64.97 |
| App — Fargate memory | GB-hr | 2,920 | 0.0049 | 14.31 |
| Postgres db.r6g.large (Multi-AZ ⇒ ×2) | hr | 730 | 0.480 | 350.40 |
| Postgres storage (gp3, 100 GB) | GB-mo | 100 | 0.115 | 11.50 |
| ElastiCache Redis (2 × r6g.large) | node-hr | 1,460 | 0.226 | 329.96 |
| ALB (hours + LCU) | mo | 1 | 28.00 | 28.00 |
| S3 storage (2 TB) | GB-mo | 2,048 | 0.023 | 47.10 |
| CloudFront egress (4.6 TB @ edge) | GB | 4,710 | 0.085 | 400.35 |
| Origin egress (0.4 TB, 8% miss) | GB | 410 | 0.090 | 36.90 |
| NAT gateways (2 AZ) + data | mo | 1 | 75.00 | 75.00 |
| on-demand total | 1,358.49 | |||
| lever | applies to | discount | on-demand | optimized |
|---|---|---|---|---|
| 1-yr Compute Savings Plan | Fargate vCPU + mem | −40% | 79.28 | 47.57 |
| 1-yr Reserved Instance | Postgres Multi-AZ | −42% | 350.40 | 203.23 |
| 1-yr Reserved node | Redis (2 nodes) | −40% | 329.96 | 197.98 |
| CDN offload (92% hit) | egress path | already modeled | 437.25 | 437.25 |
| unchanged (storage, ALB, NAT) | — | 0% | 161.60 | 161.60 |
| optimized total | 1,358.49 | 1,047.63 | ||
Result: committing the always-on tiers (compute, database, cache) for one year cuts the bill from $1,358/mo to $1,048/mo — a 23% saving (~$3,730/yr) for zero architectural change, just a pricing-model decision. The largest remaining line is CDN egress; the biggest reliability-vs-cost lever is the Multi-AZ database, which doubles the DB compute line but is the difference between 99.9% and a single-AZ ~99.5%.
5 · Results & trade-offs
What the design buys, and what it costs:
| dimension | this design | trade-off accepted |
|---|---|---|
| Availability | ≈ 99.9% (43 min/mo budget), survives 1 AZ loss | Multi-AZ doubles DB cost; 99.99% would need multi-region (S8 extension) |
| Elasticity | 2 → 20 tasks, scale-out in ~60 s | cold capacity is not instant; very spiky load benefits from a warm floor |
| Consistency | CP primary DB, AP cache/CDN | readers can see a few seconds of stale cached data |
| Cost | ~$1,048/mo optimized | reservations lock in 1-yr commitment; CDN egress dominates |
| Operability | 100% IaC, CI rolling deploys, error-budget driven | Terraform state + pipeline must themselves be maintained |
The dominant tension is the classic one from the syllabus: availability and elasticity cost money, and the cloud's value is letting you dial each one to exactly the level the SLO requires — paying OPEX for redundancy and headroom only while you need them, rather than a fixed CAPEX data center sized for peak.
6 · Mapping to course learning outcomes
Each syllabus learning objective is exercised by a concrete part of this project:
- Cloud architectures & service/delivery models — the 3-tier split across IaaS-style containers and PaaS data services; see §1 and the service-models demo.
- Azure & AWS terminology — every component is named in both clouds in the §0 stack table.
- Virtualization & containers — the multi-stage Dockerfile and stateless task design (§2), backed by the VM-vs-container and orchestration demos.
- Automation technologies (IaC) — the full Terraform stack with remote state + locking, and a CI/CD pipeline (§2); ties to S10 Ansible/Terraform and the drift demo.
- Architect solutions with design patterns — load balancer, autoscaling, cache-aside, CDN offload, Multi-AZ failover, token-bucket rate limiting (§1, §3).
- FinOps / CAPEX→OPEX — the worked monthly cost model and reservation strategy (§4), extending the cost demo.
- Reliability & SRE — the SLO, error-budget math and budget-burn policy (§3), extending the availability demo.
7 · Extensions
A. Multi-region active-passive (toward 99.99%)
To push past three nines you must remove the region itself as a single point of failure. Add a warm standby region with the database cross-region read-replica promotable on failover, S3 cross-region replication for uploads, and DNS health-check failover at the edge:
resource "aws_route53_record" "primary" {
zone_id = aws_route53_zone.main.zone_id
name = "photoshare.example.com"
type = "A"
set_identifier = "eu-west-1"
failover_routing_policy { type = "PRIMARY" }
health_check_id = aws_route53_health_check.eu_west_1.id
alias {
name = aws_lb.app.dns_name
zone_id = aws_lb.app.zone_id
evaluate_target_health = true
}
}
resource "aws_route53_record" "secondary" {
zone_id = aws_route53_zone.main.zone_id
name = "photoshare.example.com"
type = "A"
set_identifier = "eu-central-1"
failover_routing_policy { type = "SECONDARY" } # warm standby region
alias {
name = aws_lb.app_dr.dns_name
zone_id = aws_lb.app_dr.zone_id
evaluate_target_health = true
}
}
Trade-off: roughly doubles always-on cost and forces a CAP decision on the cross-region database (asynchronous replication ⇒ a small RPO window of writes can be lost on a hard regional failover).
B. Serverless variant (scale-to-zero)
For a low-or-bursty-traffic profile, swap the always-on app tier for FaaS: an API Gateway in front of functions, with the same Postgres/S3 behind. Capacity becomes per-request and the bill scales to zero when idle — at the price of cold starts (S7, and the serverless demo).
import boto3, os
from io import BytesIO
from PIL import Image
s3 = boto3.client("s3")
THUMB_BUCKET = os.environ["THUMB_BUCKET"]
def handler(event, _context):
# triggered by an S3 ObjectCreated event on the uploads bucket
for record in event["Records"]:
src_bucket = record["s3"]["bucket"]["name"]
key = record["s3"]["object"]["key"]
obj = s3.get_object(Bucket=src_bucket, Key=key)
img = Image.open(BytesIO(obj["Body"].read()))
img.thumbnail((256, 256))
buf = BytesIO()
img.save(buf, format="JPEG", quality=82)
buf.seek(0)
s3.put_object(Bucket=THUMB_BUCKET, Key=f"thumb/{key}",
Body=buf, ContentType="image/jpeg")
return {"thumbnails": len(event["Records"])}
Decision rule: FaaS wins below a crossover utilization where you'd otherwise keep servers idle; the always-on container tier wins for sustained high traffic where per-request pricing overtakes a reserved fleet. The cost demo and serverless demo together let you find that crossover for a given load.
8 · References
- J.R. Storment & Mike Fuller (2023). Cloud FinOps, 2nd ed. O'Reilly. ISBN 9781492098355.
- Kief Morris (2020). Infrastructure as Code, 2nd ed. O'Reilly. ISBN 9781098114671.
- Charity Majors, Liz Fong-Jones & George Miranda (2022). Observability Engineering. O'Reilly. ISBN 9781492076445.
- Thomas Erl & Eric Barceló Monroy (2023). Cloud Computing: Concepts, Technology, Security, and Architecture, 2nd ed. Pearson. ISBN 9780138052256.
- Jeroen Mulder (2023). Multi-Cloud Strategy for Cloud Architects, 2nd ed. Packt. ISBN 9781804616734.
- Beyer, Jones, Petoff & Murphy (eds., 2016). Site Reliability Engineering. Google / O'Reilly — SLOs & error budgets.
- AWS Well-Architected Framework; HashiCorp Terraform & AWS provider documentation.
Prices are rounded, region-indicative figures for teaching the cost model, not a live quote. Cloud list prices change frequently — always re-price against the provider's calculator before committing.