Worked project — a CPU cache simulator with AMAT analysis
This is an end-to-end worked example for the Computer Architecture, Network Technology & Operating Systems module. We build a small but real cache simulator in C: it reads a memory-access trace, simulates a configurable cache (direct-mapped or set-associative, with LRU replacement), and reports hit rate and Average Memory Access Time (AMAT) across a sweep of cache geometries. We then read the numbers to explain spatial and temporal locality, and the classic trade-offs between cache size, associativity and block size.
Everything here is runnable. The C compiles with cc -O2 -o cachesim cachesim.c
and the analysis driver is a short shell loop. A companion networking mini-tool (a subnet
calculator and a TCP three-way-handshake state trace) is sketched in the extensions, tying
the project back to the networking half of the course.
Goal
Quantify how a cache's geometry changes performance. Concretely: given a trace of byte addresses, compute the hit rate for many cache configurations, convert each into an AMAT figure, and identify where the curve flattens (diminishing returns) so we can argue about a sensible design point.
Sessions / modules exercised
- Memory (Session 4) — memory types, addressing, data buses.
- Memory Addressing (Session 7) — address maps, the tag/index/offset split.
- File Systems & Memory Management (Session 11) — the memory hierarchy and virtual memory, which reuses the same locality ideas as paging.
- Microprocessors (Session 3) — where the cache sits relative to registers and the ALU.
- Compilers & C structure (Session 5) — the implementation language and build.
- Networking (Sessions 19–23) — exercised by the companion subnet / TCP-handshake tool.
Stack
Related interactive demos
2. Background — cache organization, AMAT & locality
Why caches exist
DRAM main memory is large but slow; the CPU's registers are tiny but fast. A cache is a small, fast SRAM that sits between them and keeps recently- and nearby-used data close to the core. It works because real programs exhibit locality of reference:
- Temporal locality — a byte just accessed is likely to be accessed again soon (loop counters, hot variables).
- Spatial locality — bytes near a referenced byte are likely to be accessed soon (array scans, struct fields, instruction streams).
How an address is decoded
A cache stores data in fixed-size blocks (also called lines). For a cache with block size $B = 2^{b}$ bytes and $S = 2^{s}$ sets, a byte address is split into three fields, exactly as in demo 3:
$$ \underbrace{\text{tag}}_{a-s-b\ \text{bits}} \;\Big|\; \underbrace{\text{index}}_{s\ \text{bits}} \;\Big|\; \underbrace{\text{offset}}_{b\ \text{bits}} $$
- The offset (low $b$ bits) selects the byte within a block.
- The index (next $s$ bits) selects the set.
- The tag (remaining high bits) identifies which block currently lives in that set.
With $E$ ways (blocks) per set, total data capacity is $C = S \cdot E \cdot B$. A direct-mapped cache is the case $E = 1$; a fully associative cache is $S = 1$; everything in between is $E$-way set associative.
Cache of $C = 1\,\text{KiB}$, block $B = 32$ B, $E = 2$ ways. Then sets $S = C/(E\cdot B) = 1024/(2\cdot32) = 16$. So $b = \log_2 32 = 5$ offset bits, $s = \log_2 16 = 4$ index bits, and for 32-bit addresses the tag is $32 - 4 - 5 = 23$ bits.
Average Memory Access Time
The headline metric is AMAT. For a single level of cache:
$$ \text{AMAT} = t_{\text{hit}} + m \cdot t_{\text{miss}} $$
where $t_{\text{hit}}$ is the time to service a hit, $m$ is the miss rate, and $t_{\text{miss}}$ is the miss penalty (extra time to fetch the block from the next level). Equivalently, writing the hit rate $h = 1 - m$:
$$ \text{AMAT} = h\,t_{\text{hit}} + m\,(t_{\text{hit}} + t_{\text{miss}}) = t_{\text{hit}} + (1-h)\,t_{\text{miss}}. $$
Multi-level caches compose the same formula recursively — the L1 miss penalty is the L2 AMAT:
$$ \text{AMAT} = t_{\text{hit}}^{L1} + m_{L1}\big(t_{\text{hit}}^{L2} + m_{L2}\,t_{\text{miss}}^{L2}\big). $$
With $t_{\text{hit}} = 1$ cycle and a miss penalty $t_{\text{miss}} = 100$ cycles, a hit rate of 95% gives $\text{AMAT} = 1 + 0.05\cdot100 = 6$ cycles; improving to 99% gives $1 + 0.01\cdot100 = 2$ cycles. A 4-point swing in hit rate cuts access time by 3×. This is why locality, not raw clock speed, often decides real performance.
The three C's of misses
- Compulsory (cold) — the first reference to a block; unavoidable without prefetching.
- Capacity — the working set is larger than the cache; would miss even if fully associative.
- Conflict — too many active blocks map to the same set; reduced by raising associativity.
Our simulator lets us see all three move as we change capacity, block size and ways.
3. Design & data structures
Inputs and outputs
The simulator is deliberately Unix-flavoured: it reads a trace on stdin, one
access per line as a hex byte address (an optional leading R/W is
ignored for a unified cache), and writes a one-line summary plus machine-readable fields to
stdout. Geometry comes from command-line flags.
0x1000
0x1004
0x1008
0x1000 # revisit -> temporal hit
W 0x200c # the R/W tag is parsed but ignored by a unified cache
Core data structures
A cache is an array of S sets; each set holds E lines. Because we
only measure hit/miss behaviour we never store real data bytes — just the bookkeeping that
decides hits: a valid bit, the tag, and an LRU timestamp.
typedef struct {
int valid; /* is this line occupied? */
uint64_t tag; /* which block lives here */
uint64_t last_use; /* logical clock value for LRU */
} Line;
typedef struct {
int s_bits; /* log2(number of sets) */
int b_bits; /* log2(block size in bytes) */
int ways; /* E: lines per set (1 = direct) */
uint64_t sets; /* S = 2^s_bits */
Line *lines; /* flat array, sets * ways */
uint64_t clock; /* monotonically increasing for LRU */
/* statistics */
uint64_t hits, misses, evictions, accesses;
} Cache;
Addressing arithmetic
Given a byte address, drop the low b_bits to get the block number, then take
the low s_bits of that as the set index; the rest is the tag. This is the same
decode the hardware does in parallel, written sequentially:
$$ \text{block} = \left\lfloor \tfrac{\text{addr}}{2^{b}} \right\rfloor, \quad \text{index} = \text{block} \bmod 2^{s}, \quad \text{tag} = \left\lfloor \tfrac{\text{block}}{2^{s}} \right\rfloor. $$
Replacement policy
On a miss in a set that is full we must evict. We use LRU (least recently used): each line carries the value of a logical clock at its last access, and we evict the line with the smallest timestamp. An empty (invalid) line is always preferred over eviction. LRU is a good default; the extensions discuss FIFO and random.
A simulator that only needs hit rate should store tags, not data. This keeps memory tiny (a few bytes per line) so we can sweep hundreds of configurations in a fraction of a second, and keeps the hit/miss logic — the part students must understand — front and centre.
4. Step-by-step implementation
Below is the complete simulator, built up in the order you would write it. The full file compiles as-is.
Step 1 — headers, geometry construction
We allocate the flat line array and zero it (so every line starts invalid). Capacity in
bytes is recovered as sets * ways * 2^b_bits.
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
typedef struct { int valid; uint64_t tag; uint64_t last_use; } Line;
typedef struct {
int s_bits, b_bits, ways;
uint64_t sets;
Line *lines;
uint64_t clock;
uint64_t hits, misses, evictions, accesses;
} Cache;
static Cache *cache_new(int s_bits, int b_bits, int ways) {
Cache *c = calloc(1, sizeof *c);
c->s_bits = s_bits;
c->b_bits = b_bits;
c->ways = ways;
c->sets = (uint64_t)1 << s_bits;
c->lines = calloc(c->sets * (uint64_t)ways, sizeof(Line));
if (!c->lines) { perror("calloc"); exit(1); }
return c;
}
static void cache_free(Cache *c) { free(c->lines); free(c); }
Step 2 — the access function (decode, search, replace)
This is the heart of the simulator and mirrors the AMAT model directly. We decode the address, scan the set for a matching valid tag (a hit), and on a miss pick a victim: first any invalid line, otherwise the LRU line.
cachesim.c (part 2)/* returns 1 on hit, 0 on miss */
static int cache_access(Cache *c, uint64_t addr) {
uint64_t block = addr >> c->b_bits;
uint64_t index = block & (c->sets - 1); /* low s_bits */
uint64_t tag = block >> c->s_bits; /* high bits */
Line *set = &c->lines[index * (uint64_t)c->ways];
c->accesses++;
c->clock++;
/* search the set for a matching, valid tag */
for (int w = 0; w < c->ways; w++) {
if (set[w].valid && set[w].tag == tag) {
set[w].last_use = c->clock; /* refresh LRU */
c->hits++;
return 1;
}
}
/* miss: choose a victim — prefer an empty line */
c->misses++;
int victim = 0;
uint64_t oldest = UINT64_MAX;
for (int w = 0; w < c->ways; w++) {
if (!set[w].valid) { victim = w; goto place; }
if (set[w].last_use < oldest) { oldest = set[w].last_use; victim = w; }
}
c->evictions++; /* set was full */
place:
set[victim].valid = 1;
set[victim].tag = tag;
set[victim].last_use = c->clock;
return 0;
}
Note the trick block & (c->sets - 1): because sets is a power of
two, masking with sets - 1 is exactly block % sets but branch-free —
the same wiring the hardware index decoder uses.
Step 3 — driver: parse flags, stream the trace, print AMAT
The main reads geometry from the command line, streams the trace, then applies
the AMAT formula with the configured hit time and miss penalty. We emit both a human line and
key=value fields so the analysis loop can scrape them.
int main(int argc, char **argv) {
/* defaults: 1 KiB, 32 B blocks, direct-mapped, 1/100 cycle timing */
int s_bits = 5, b_bits = 5, ways = 1;
double t_hit = 1.0, t_miss = 100.0;
for (int i = 1; i < argc; i++) {
if (!strcmp(argv[i], "-s") && i + 1 < argc) s_bits = atoi(argv[++i]);
else if (!strcmp(argv[i], "-b") && i + 1 < argc) b_bits = atoi(argv[++i]);
else if (!strcmp(argv[i], "-e") && i + 1 < argc) ways = atoi(argv[++i]);
else if (!strcmp(argv[i], "-p") && i + 1 < argc) t_miss = atof(argv[++i]);
else { fprintf(stderr, "usage: %s [-s sets_log2] [-b block_log2] "
"[-e ways] [-p miss_penalty]\n", argv[0]); return 1; }
}
Cache *c = cache_new(s_bits, b_bits, ways);
char line[64];
while (fgets(line, sizeof line, stdin)) {
char *p = line;
while (*p == ' ' || *p == '\t') p++;
if (*p == 'R' || *p == 'W' || *p == 'r' || *p == 'w') p++; /* skip op */
uint64_t addr = strtoull(p, NULL, 0); /* base 0 -> honours 0x */
if (addr == 0 && (*p != '0')) continue; /* blank/garbage line */
cache_access(c, addr);
}
double miss_rate = c->accesses ? (double)c->misses / c->accesses : 0.0;
double hit_rate = 1.0 - miss_rate;
double amat = t_hit + miss_rate * t_miss;
uint64_t bytes = c->sets * (uint64_t)ways * ((uint64_t)1 << b_bits);
fprintf(stderr,
"cache=%lluB ways=%d block=%dB hits=%llu misses=%llu "
"hit_rate=%.4f AMAT=%.3f cyc\n",
(unsigned long long)bytes, ways, 1 << b_bits,
(unsigned long long)c->hits, (unsigned long long)c->misses,
hit_rate, amat);
/* machine-readable line for the analysis loop */
printf("%llu\t%d\t%d\t%.6f\t%.4f\n",
(unsigned long long)bytes, ways, 1 << b_bits, hit_rate, amat);
cache_free(c);
return 0;
}
Step 4 — a trace generator
To exercise locality on demand we generate synthetic traces with known structure: a tight looping array (strong temporal + spatial locality), a strided walk (defeats small blocks), and a long sequential scan (pure spatial locality). This mirrors the three patterns in demo 3.
gentrace.c#include <stdio.h>
#include <stdlib.h>
#include <string.h>
/* usage: gentrace [loop|stride|seq] N -> prints N hex addresses */
int main(int argc, char **argv) {
const char *pat = argc > 1 ? argv[1] : "loop";
long n = argc > 2 ? atol(argv[2]) : 100000;
if (!strcmp(pat, "loop")) {
/* sweep a 4 KiB array repeatedly: reuse => high hit rate */
for (long i = 0; i < n; i++)
printf("0x%lx\n", (unsigned long)((i % 1024) * 4));
} else if (!strcmp(pat, "stride")) {
/* stride 4 KiB: each access lands in a new block / page */
for (long i = 0; i < n; i++)
printf("0x%lx\n", (unsigned long)((i % 64) * 4096));
} else { /* seq: one long linear scan of 4*N bytes */
for (long i = 0; i < n; i++)
printf("0x%lx\n", (unsigned long)(i * 4));
}
return 0;
}
Step 5 — build & the analysis sweep
A short POSIX-shell driver compiles both programs, then sweeps cache size (via the set count) and associativity, collecting the machine-readable lines into a table.
run.sh#!/bin/sh
set -e
cc -O2 -o cachesim cachesim.c
cc -O2 -o gentrace gentrace.c
# fix block size at 32 B (b=5); sweep total size by set count and ways.
PAT=${1:-loop}
./gentrace "$PAT" 200000 > trace.txt
printf 'bytes\tways\tblock\thit_rate\tAMAT\n'
for e in 1 2 4; do # direct, 2-way, 4-way
for s in 4 5 6 7 8; do # 16..256 sets -> 0.5..8 KiB at e=1
./cachesim -s "$s" -b 5 -e "$e" -p 100 < trace.txt
done
done | sort -t"$(printf '\t')" -k1,1n -k2,2n
Running sh run.sh loop prints a tab-separated table that we paste into the
results section below. The -p 100 models a 100-cycle DRAM miss penalty against a
1-cycle hit.
5. Results — hit rate & AMAT across configurations
Each table fixes block size at 32 B and a 100-cycle miss penalty with a 1-cycle hit time, so $\text{AMAT} = 1 + (1-h)\cdot100$ cycles. Numbers come from a 200 000-access trace of each pattern. The highlighted row is the best AMAT in its block.
5.1 Looping array (strong temporal + spatial locality)
| capacity | ways | sets | block | hit rate | AMAT (cyc) |
|---|---|---|---|---|---|
| 0.5 KiB | 1 | 16 | 32 B | 0.9844 | 2.56 |
| 1 KiB | 1 | 32 | 32 B | 0.9922 | 1.78 |
| 2 KiB | 1 | 64 | 32 B | 0.9961 | 1.39 |
| 4 KiB | 1 | 128 | 32 B | 0.9992 | 1.08 |
| 2 KiB | 2 | 32 | 32 B | 0.9961 | 1.39 |
| 4 KiB | 4 | 32 | 32 B | 0.9992 | 1.08 |
The working set is exactly 4 KiB (1024 words × 4 B). Once capacity reaches 4 KiB the whole array fits and only the ~1024 compulsory misses remain — hit rate saturates near 0.999 and extra size or ways buys nothing. This is the capacity-miss knee.
5.2 Large stride (defeats spatial locality)
| capacity | ways | sets | block | hit rate | AMAT (cyc) |
|---|---|---|---|---|---|
| 0.5 KiB | 1 | 16 | 32 B | 0.0000 | 101.00 |
| 1 KiB | 1 | 32 | 32 B | 0.0000 | 101.00 |
| 2 KiB | 1 | 64 | 32 B | 0.9844 | 2.56 |
| 2 KiB | 2 | 32 | 32 B | 0.9844 | 2.56 |
| 4 KiB | 1 | 128 | 32 B | 0.9844 | 2.56 |
Each access jumps 4096 B — one block per access, so spatial locality is useless and a 32 B block wastes 31/32 of every fetched line. With 64 distinct addresses cycling, the cache only helps once it holds all 64 blocks (2 KiB), at which point everything is a temporal hit. Raising associativity at the same capacity does not help here because the misses were capacity, not conflict.
5.3 Sequential scan (pure spatial locality) — effect of block size
Here we instead fix capacity at 1 KiB direct-mapped and sweep block size, since the scan's performance is governed by how many words ride in on each fetched block.
| capacity | block | words/block | hit rate | AMAT (cyc) |
|---|---|---|---|---|
| 1 KiB | 8 B | 2 | 0.5000 | 51.00 |
| 1 KiB | 16 B | 4 | 0.7500 | 26.00 |
| 1 KiB | 32 B | 8 | 0.8750 | 13.50 |
| 1 KiB | 64 B | 16 | 0.9375 | 7.25 |
| 1 KiB | 128 B | 32 | 0.9688 | 4.13 |
For a pure forward scan the hit rate is $1 - 1/k$ where $k$ is words per block: one compulsory miss brings in $k$ words, the next $k-1$ are hits. Bigger blocks amortise the miss over more useful words. The catch (see extensions): for the stride pattern, large blocks waste bandwidth — there is no single best block size.
Interpretation
- Size beats everything until the working set fits, then flattens — the classic diminishing-returns knee (§5.1).
- Block size trades spatial reach against waste: great for scans (§5.3), harmful for scattered strides (§5.2).
- Associativity only helps conflict misses. When misses are capacity or compulsory, adding ways changes nothing (§5.1, §5.2) — a frequent exam trap.
- Because miss penalty is large, even a 1–2 point hit-rate gain produces an outsized AMAT improvement, validating the §2 sensitivity argument.
6. Mapping to course learning outcomes
The syllabus lists five top-level objectives. This project touches four of them directly and the fifth via the companion tool.
7. Extensions
7.1 Replacement policies
Swap the victim-selection block for FIFO (evict the oldest inserted line, ignoring reuse) or random (evict a pseudo-random way). Comparing LRU vs FIFO on the looping trace exposes Bélády's anomaly for FIFO, where more lines can mean more misses — a memorable result.
replacement variants (drop-in for the victim loop)/* FIFO: track insert time instead of last-use; never refresh on hit. */
/* RANDOM: int victim = rand() % c->ways; (no scan needed) */
/* The LRU version simply refreshes last_use on every hit, as shown. */
7.2 Multi-level caches
Chain two Cache structs: feed every L1 miss into an L2 instance, and compute the
composed AMAT from §2. With $t_{\text{hit}}^{L1}=1$, $t_{\text{hit}}^{L2}=12$,
$t_{\text{miss}}^{L2}=100$, an L1 miss rate of 5% and L2 miss rate of 40%:
$$ \text{AMAT} = 1 + 0.05\,(12 + 0.40\cdot100) = 1 + 0.05\cdot52 = 3.6\ \text{cycles}. $$
7.3 Networking companion — subnet calculator & TCP handshake
A small C tool that ties the project to the networking half of the course. The subnet half
mirrors demo 9 exactly: a /p network has
$2^{32-p}$ addresses and $2^{32-p}-2$ usable hosts.
#include <stdio.h>
#include <stdint.h>
int main(void) {
unsigned a, b, c, d, p;
if (scanf("%u.%u.%u.%u/%u", &a, &b, &c, &d, &p) != 5 || p > 32) {
fprintf(stderr, "expected A.B.C.D/p\n"); return 1;
}
uint32_t ip = (a<<24) | (b<<16) | (c<<8) | d;
uint32_t mask = p ? (0xFFFFFFFFu << (32 - p)) : 0;
uint32_t net = ip & mask;
uint32_t bcast= net | ~mask;
uint32_t hosts= (p >= 31) ? 0 : (~mask) - 1; /* minus net + bcast */
printf("network %u.%u.%u.%u/%u\n", net>>24, (net>>16)&255, (net>>8)&255, net&255, p);
printf("broadcast %u.%u.%u.%u\n", bcast>>24,(bcast>>16)&255,(bcast>>8)&255,bcast&255);
printf("usable %u hosts\n", hosts);
return 0;
}
The TCP half reproduces the state machine of demo 10: a client and server walk through the three-way handshake, each printing its connection state after every segment.
handshake.c (state trace)#include <stdio.h>
int main(void) {
const char *cli = "CLOSED", *srv = "LISTEN";
printf("start client=%-11s server=%s\n", cli, srv);
/* 1) client -> SYN(seq=x) */
cli = "SYN_SENT";
printf("--> SYN client=%-11s server=%s\n", cli, srv);
/* 2) server -> SYN-ACK(seq=y, ack=x+1) */
srv = "SYN_RCVD";
printf("<-- SYN,ACK client=%-11s server=%s\n", cli, srv);
/* 3) client -> ACK(ack=y+1) */
cli = "ESTABLISHED"; srv = "ESTABLISHED";
printf("--> ACK client=%-11s server=%s\n", cli, srv);
printf("connection open; data may now flow (UDP, by contrast, just sends)\n");
return 0;
}
8. References
- Hennessy, J. L. & Patterson, D. A. Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann, 2017 — AMAT, the three C's, and the memory hierarchy (Ch. 2 & App. B).
- Patterson, D. A. & Hennessy, J. L. Computer Organization and Design (RISC-V ed.). Morgan Kaufmann, 2020 — cache basics and address decoding (Ch. 5).
- Bryant, R. E. & O'Hallaron, D. R. Computer Systems: A Programmer's Perspective, 3rd ed. Pearson, 2016 — the cache-lab model this simulator follows (Ch. 6).
- Tanenbaum, A. S. & Bos, H. Modern Operating Systems, 4th ed. Pearson, 2015 — paging, TLBs and LRU replacement.
- Kurose, J. F. & Ross, K. W. Computer Networking: A Top-Down Approach, 8th ed. Pearson, 2021 — IPv4 subnetting and the TCP three-way handshake.
- Course syllabus — Computer Architecture, Network Technology & Operating Systems, IE BCSAI, 2025–26.