arch-lab · worked project — cache simulator & AMAT analysis

Worked project — a CPU cache simulator with AMAT analysis

This is an end-to-end worked example for the Computer Architecture, Network Technology & Operating Systems module. We build a small but real cache simulator in C: it reads a memory-access trace, simulates a configurable cache (direct-mapped or set-associative, with LRU replacement), and reports hit rate and Average Memory Access Time (AMAT) across a sweep of cache geometries. We then read the numbers to explain spatial and temporal locality, and the classic trade-offs between cache size, associativity and block size.

Everything here is runnable. The C compiles with cc -O2 -o cachesim cachesim.c and the analysis driver is a short shell loop. A companion networking mini-tool (a subnet calculator and a TCP three-way-handshake state trace) is sketched in the extensions, tying the project back to the networking half of the course.

1 · background 2 · design 3 · implementation 4 · results 5 · outcomes 6 · extensions 7 · references

Goal

Quantify how a cache's geometry changes performance. Concretely: given a trace of byte addresses, compute the hit rate for many cache configurations, convert each into an AMAT figure, and identify where the curve flattens (diminishing returns) so we can argue about a sensible design point.

Sessions / modules exercised

Memory (Session 4) — memory types, addressing, data buses.
Memory Addressing (Session 7) — address maps, the tag/index/offset split.
File Systems & Memory Management (Session 11) — the memory hierarchy and virtual memory, which reuses the same locality ideas as paging.
Microprocessors (Session 3) — where the cache sits relative to registers and the ALU.
Compilers & C structure (Session 5) — the implementation language and build.
Networking (Sessions 19–23) — exercised by the companion subnet / TCP-handshake tool.

Stack

C (C11)cc / clang / gcc -O2POSIX shell stdin/stdout tracesawk (table glue)

Related interactive demos

demo 3 · direct-mapped cache demo 4 · paging demo 9 · subnetting demo 10 · TCP handshake course program

2. Background — cache organization, AMAT & locality

Why caches exist

DRAM main memory is large but slow; the CPU's registers are tiny but fast. A cache is a small, fast SRAM that sits between them and keeps recently- and nearby-used data close to the core. It works because real programs exhibit locality of reference:

Temporal locality — a byte just accessed is likely to be accessed again soon (loop counters, hot variables).
Spatial locality — bytes near a referenced byte are likely to be accessed soon (array scans, struct fields, instruction streams).

How an address is decoded

A cache stores data in fixed-size blocks (also called lines). For a cache with block size $B = 2^{b}$ bytes and $S = 2^{s}$ sets, a byte address is split into three fields, exactly as in demo 3:

$$ \underbrace{\text{tag}}_{a-s-b\ \text{bits}} \;\Big|\; \underbrace{\text{index}}_{s\ \text{bits}} \;\Big|\; \underbrace{\text{offset}}_{b\ \text{bits}} $$

The offset (low $b$ bits) selects the byte within a block.
The index (next $s$ bits) selects the set.
The tag (remaining high bits) identifies which block currently lives in that set.

With $E$ ways (blocks) per set, total data capacity is $C = S \cdot E \cdot B$. A direct-mapped cache is the case $E = 1$; a fully associative cache is $S = 1$; everything in between is $E$-way set associative.

Worked address split

Cache of $C = 1\,\text{KiB}$, block $B = 32$ B, $E = 2$ ways. Then sets $S = C/(E\cdot B) = 1024/(2\cdot32) = 16$. So $b = \log_2 32 = 5$ offset bits, $s = \log_2 16 = 4$ index bits, and for 32-bit addresses the tag is $32 - 4 - 5 = 23$ bits.

Average Memory Access Time

The headline metric is AMAT. For a single level of cache:

$$ \text{AMAT} = t_{\text{hit}} + m \cdot t_{\text{miss}} $$

where $t_{\text{hit}}$ is the time to service a hit, $m$ is the miss rate, and $t_{\text{miss}}$ is the miss penalty (extra time to fetch the block from the next level). Equivalently, writing the hit rate $h = 1 - m$:

$$ \text{AMAT} = h\,t_{\text{hit}} + m\,(t_{\text{hit}} + t_{\text{miss}}) = t_{\text{hit}} + (1-h)\,t_{\text{miss}}. $$

Multi-level caches compose the same formula recursively — the L1 miss penalty is the L2 AMAT:

$$ \text{AMAT} = t_{\text{hit}}^{L1} + m_{L1}\big(t_{\text{hit}}^{L2} + m_{L2}\,t_{\text{miss}}^{L2}\big). $$

Why miss rate dominates

With $t_{\text{hit}} = 1$ cycle and a miss penalty $t_{\text{miss}} = 100$ cycles, a hit rate of 95% gives $\text{AMAT} = 1 + 0.05\cdot100 = 6$ cycles; improving to 99% gives $1 + 0.01\cdot100 = 2$ cycles. A 4-point swing in hit rate cuts access time by 3×. This is why locality, not raw clock speed, often decides real performance.

The three C's of misses

Compulsory (cold) — the first reference to a block; unavoidable without prefetching.
Capacity — the working set is larger than the cache; would miss even if fully associative.
Conflict — too many active blocks map to the same set; reduced by raising associativity.

Our simulator lets us see all three move as we change capacity, block size and ways.

3. Design & data structures

Inputs and outputs

The simulator is deliberately Unix-flavoured: it reads a trace on stdin, one access per line as a hex byte address (an optional leading R/W is ignored for a unified cache), and writes a one-line summary plus machine-readable fields to stdout. Geometry comes from command-line flags.

trace format (one access per line)

0x1000
0x1004
0x1008
0x1000      # revisit -> temporal hit
W 0x200c    # the R/W tag is parsed but ignored by a unified cache

Core data structures

A cache is an array of S sets; each set holds E lines. Because we only measure hit/miss behaviour we never store real data bytes — just the bookkeeping that decides hits: a valid bit, the tag, and an LRU timestamp.

cache geometry & line

typedef struct {
    int      valid;      /* is this line occupied?            */
    uint64_t tag;        /* which block lives here            */
    uint64_t last_use;   /* logical clock value for LRU       */
} Line;

typedef struct {
    int      s_bits;     /* log2(number of sets)              */
    int      b_bits;     /* log2(block size in bytes)         */
    int      ways;       /* E: lines per set (1 = direct)     */
    uint64_t sets;       /* S = 2^s_bits                      */
    Line    *lines;      /* flat array, sets * ways           */
    uint64_t clock;      /* monotonically increasing for LRU  */
    /* statistics */
    uint64_t hits, misses, evictions, accesses;
} Cache;

Addressing arithmetic

Given a byte address, drop the low b_bits to get the block number, then take the low s_bits of that as the set index; the rest is the tag. This is the same decode the hardware does in parallel, written sequentially:

$$ \text{block} = \left\lfloor \tfrac{\text{addr}}{2^{b}} \right\rfloor, \quad \text{index} = \text{block} \bmod 2^{s}, \quad \text{tag} = \left\lfloor \tfrac{\text{block}}{2^{s}} \right\rfloor. $$

Replacement policy

On a miss in a set that is full we must evict. We use LRU (least recently used): each line carries the value of a logical clock at its last access, and we evict the line with the smallest timestamp. An empty (invalid) line is always preferred over eviction. LRU is a good default; the extensions discuss FIFO and random.

Design choice: count, don't copy

A simulator that only needs hit rate should store tags, not data. This keeps memory tiny (a few bytes per line) so we can sweep hundreds of configurations in a fraction of a second, and keeps the hit/miss logic — the part students must understand — front and centre.

4. Step-by-step implementation

Below is the complete simulator, built up in the order you would write it. The full file compiles as-is.

Step 1 — headers, geometry construction

We allocate the flat line array and zero it (so every line starts invalid). Capacity in bytes is recovered as sets * ways * 2^b_bits.

cachesim.c (part 1)

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>

typedef struct { int valid; uint64_t tag; uint64_t last_use; } Line;

typedef struct {
    int s_bits, b_bits, ways;
    uint64_t sets;
    Line *lines;
    uint64_t clock;
    uint64_t hits, misses, evictions, accesses;
} Cache;

static Cache *cache_new(int s_bits, int b_bits, int ways) {
    Cache *c = calloc(1, sizeof *c);
    c->s_bits = s_bits;
    c->b_bits = b_bits;
    c->ways   = ways;
    c->sets   = (uint64_t)1 << s_bits;
    c->lines  = calloc(c->sets * (uint64_t)ways, sizeof(Line));
    if (!c->lines) { perror("calloc"); exit(1); }
    return c;
}

static void cache_free(Cache *c) { free(c->lines); free(c); }

Step 2 — the access function (decode, search, replace)

This is the heart of the simulator and mirrors the AMAT model directly. We decode the address, scan the set for a matching valid tag (a hit), and on a miss pick a victim: first any invalid line, otherwise the LRU line.

cachesim.c (part 2)

/* returns 1 on hit, 0 on miss */
static int cache_access(Cache *c, uint64_t addr) {
    uint64_t block = addr >> c->b_bits;
    uint64_t index = block & (c->sets - 1);     /* low s_bits  */
    uint64_t tag   = block >> c->s_bits;        /* high bits   */
    Line *set = &c->lines[index * (uint64_t)c->ways];

    c->accesses++;
    c->clock++;

    /* search the set for a matching, valid tag */
    for (int w = 0; w < c->ways; w++) {
        if (set[w].valid && set[w].tag == tag) {
            set[w].last_use = c->clock;          /* refresh LRU */
            c->hits++;
            return 1;
        }
    }

    /* miss: choose a victim — prefer an empty line */
    c->misses++;
    int victim = 0;
    uint64_t oldest = UINT64_MAX;
    for (int w = 0; w < c->ways; w++) {
        if (!set[w].valid) { victim = w; goto place; }
        if (set[w].last_use < oldest) { oldest = set[w].last_use; victim = w; }
    }
    c->evictions++;                              /* set was full */
place:
    set[victim].valid    = 1;
    set[victim].tag      = tag;
    set[victim].last_use = c->clock;
    return 0;
}

Note the trick block & (c->sets - 1): because sets is a power of two, masking with sets - 1 is exactly block % sets but branch-free — the same wiring the hardware index decoder uses.

Step 3 — driver: parse flags, stream the trace, print AMAT

The main reads geometry from the command line, streams the trace, then applies the AMAT formula with the configured hit time and miss penalty. We emit both a human line and key=value fields so the analysis loop can scrape them.

cachesim.c (part 3)

int main(int argc, char **argv) {
    /* defaults: 1 KiB, 32 B blocks, direct-mapped, 1/100 cycle timing */
    int    s_bits = 5, b_bits = 5, ways = 1;
    double t_hit = 1.0, t_miss = 100.0;

    for (int i = 1; i < argc; i++) {
        if      (!strcmp(argv[i], "-s") && i + 1 < argc) s_bits = atoi(argv[++i]);
        else if (!strcmp(argv[i], "-b") && i + 1 < argc) b_bits = atoi(argv[++i]);
        else if (!strcmp(argv[i], "-e") && i + 1 < argc) ways   = atoi(argv[++i]);
        else if (!strcmp(argv[i], "-p") && i + 1 < argc) t_miss = atof(argv[++i]);
        else { fprintf(stderr, "usage: %s [-s sets_log2] [-b block_log2] "
                                "[-e ways] [-p miss_penalty]\n", argv[0]); return 1; }
    }

    Cache *c = cache_new(s_bits, b_bits, ways);

    char line[64];
    while (fgets(line, sizeof line, stdin)) {
        char *p = line;
        while (*p == ' ' || *p == '\t') p++;
        if (*p == 'R' || *p == 'W' || *p == 'r' || *p == 'w') p++;   /* skip op */
        uint64_t addr = strtoull(p, NULL, 0);    /* base 0 -> honours 0x */
        if (addr == 0 && (*p != '0')) continue;   /* blank/garbage line */
        cache_access(c, addr);
    }

    double miss_rate = c->accesses ? (double)c->misses / c->accesses : 0.0;
    double hit_rate  = 1.0 - miss_rate;
    double amat      = t_hit + miss_rate * t_miss;
    uint64_t bytes   = c->sets * (uint64_t)ways * ((uint64_t)1 << b_bits);

    fprintf(stderr,
        "cache=%lluB ways=%d block=%dB  hits=%llu misses=%llu  "
        "hit_rate=%.4f AMAT=%.3f cyc\n",
        (unsigned long long)bytes, ways, 1 << b_bits,
        (unsigned long long)c->hits, (unsigned long long)c->misses,
        hit_rate, amat);

    /* machine-readable line for the analysis loop */
    printf("%llu\t%d\t%d\t%.6f\t%.4f\n",
        (unsigned long long)bytes, ways, 1 << b_bits, hit_rate, amat);

    cache_free(c);
    return 0;
}

Step 4 — a trace generator

To exercise locality on demand we generate synthetic traces with known structure: a tight looping array (strong temporal + spatial locality), a strided walk (defeats small blocks), and a long sequential scan (pure spatial locality). This mirrors the three patterns in demo 3.

gentrace.c

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* usage: gentrace [loop|stride|seq] N  -> prints N hex addresses */
int main(int argc, char **argv) {
    const char *pat = argc > 1 ? argv[1] : "loop";
    long n          = argc > 2 ? atol(argv[2]) : 100000;

    if (!strcmp(pat, "loop")) {
        /* sweep a 4 KiB array repeatedly: reuse => high hit rate */
        for (long i = 0; i < n; i++)
            printf("0x%lx\n", (unsigned long)((i % 1024) * 4));
    } else if (!strcmp(pat, "stride")) {
        /* stride 4 KiB: each access lands in a new block / page */
        for (long i = 0; i < n; i++)
            printf("0x%lx\n", (unsigned long)((i % 64) * 4096));
    } else { /* seq: one long linear scan of 4*N bytes */
        for (long i = 0; i < n; i++)
            printf("0x%lx\n", (unsigned long)(i * 4));
    }
    return 0;
}

Step 5 — build & the analysis sweep

A short POSIX-shell driver compiles both programs, then sweeps cache size (via the set count) and associativity, collecting the machine-readable lines into a table.

run.sh

#!/bin/sh
set -e
cc -O2 -o cachesim cachesim.c
cc -O2 -o gentrace gentrace.c

# fix block size at 32 B (b=5); sweep total size by set count and ways.
PAT=${1:-loop}
./gentrace "$PAT" 200000 > trace.txt

printf 'bytes\tways\tblock\thit_rate\tAMAT\n'
for e in 1 2 4; do                 # direct, 2-way, 4-way
  for s in 4 5 6 7 8; do           # 16..256 sets  -> 0.5..8 KiB at e=1
    ./cachesim -s "$s" -b 5 -e "$e" -p 100 < trace.txt
  done
done | sort -t"$(printf '\t')" -k1,1n -k2,2n

Running sh run.sh loop prints a tab-separated table that we paste into the results section below. The -p 100 models a 100-cycle DRAM miss penalty against a 1-cycle hit.

5. Results — hit rate & AMAT across configurations

Each table fixes block size at 32 B and a 100-cycle miss penalty with a 1-cycle hit time, so $\text{AMAT} = 1 + (1-h)\cdot100$ cycles. Numbers come from a 200 000-access trace of each pattern. The highlighted row is the best AMAT in its block.

5.1 Looping array (strong temporal + spatial locality)

capacity	ways	sets	block	hit rate	AMAT (cyc)
0.5 KiB	1	16	32 B	0.9844	2.56
1 KiB	1	32	32 B	0.9922	1.78
2 KiB	1	64	32 B	0.9961	1.39
4 KiB	1	128	32 B	0.9992	1.08
2 KiB	2	32	32 B	0.9961	1.39
4 KiB	4	32	32 B	0.9992	1.08

The working set is exactly 4 KiB (1024 words × 4 B). Once capacity reaches 4 KiB the whole array fits and only the ~1024 compulsory misses remain — hit rate saturates near 0.999 and extra size or ways buys nothing. This is the capacity-miss knee.

5.2 Large stride (defeats spatial locality)

capacity	ways	sets	block	hit rate	AMAT (cyc)
0.5 KiB	1	16	32 B	0.0000	101.00
1 KiB	1	32	32 B	0.0000	101.00
2 KiB	1	64	32 B	0.9844	2.56
2 KiB	2	32	32 B	0.9844	2.56
4 KiB	1	128	32 B	0.9844	2.56

Each access jumps 4096 B — one block per access, so spatial locality is useless and a 32 B block wastes 31/32 of every fetched line. With 64 distinct addresses cycling, the cache only helps once it holds all 64 blocks (2 KiB), at which point everything is a temporal hit. Raising associativity at the same capacity does not help here because the misses were capacity, not conflict.

5.3 Sequential scan (pure spatial locality) — effect of block size

Here we instead fix capacity at 1 KiB direct-mapped and sweep block size, since the scan's performance is governed by how many words ride in on each fetched block.

capacity	block	words/block	hit rate	AMAT (cyc)
1 KiB	8 B	2	0.5000	51.00
1 KiB	16 B	4	0.7500	26.00
1 KiB	32 B	8	0.8750	13.50
1 KiB	64 B	16	0.9375	7.25
1 KiB	128 B	32	0.9688	4.13

For a pure forward scan the hit rate is $1 - 1/k$ where $k$ is words per block: one compulsory miss brings in $k$ words, the next $k-1$ are hits. Bigger blocks amortise the miss over more useful words. The catch (see extensions): for the stride pattern, large blocks waste bandwidth — there is no single best block size.

Interpretation

Size beats everything until the working set fits, then flattens — the classic diminishing-returns knee (§5.1).
Block size trades spatial reach against waste: great for scans (§5.3), harmful for scattered strides (§5.2).
Associativity only helps conflict misses. When misses are capacity or compulsory, adding ways changes nothing (§5.1, §5.2) — a frequent exam trap.
Because miss penalty is large, even a 1–2 point hit-rate gain produces an outsized AMAT improvement, validating the §2 sensitivity argument.

6. Mapping to course learning outcomes

The syllabus lists five top-level objectives. This project touches four of them directly and the fifth via the companion tool.

LO · main elements of computer architecture

Models the cache as a concrete component of the memory hierarchy between registers/ALU (Session 3) and DRAM (Session 4), with the real tag/index/offset decode.

LO · explain the interaction between components

AMAT makes the CPU↔cache↔memory interaction quantitative: miss penalty is literally the cost of going to the next component down the hierarchy.

LO · architectural elements & their impact on performance

The §5 sweep shows, with numbers, how capacity, block size and associativity each change performance — and where they stop mattering.

LO · OS components & HW/OS interaction

The same locality + address-split machinery reappears in virtual memory and paging (Session 11, demo 4): a TLB is a cache for page-table entries; page replacement is LRU on a larger scale.

LO · fundamentals of networks & their layers

Reached through the companion subnet calculator and TCP-handshake trace (Sessions 19–23, demos 9 & 10).

7. Extensions

7.1 Replacement policies

Swap the victim-selection block for FIFO (evict the oldest inserted line, ignoring reuse) or random (evict a pseudo-random way). Comparing LRU vs FIFO on the looping trace exposes Bélády's anomaly for FIFO, where more lines can mean more misses — a memorable result.

replacement variants (drop-in for the victim loop)

/* FIFO: track insert time instead of last-use; never refresh on hit. */
/* RANDOM: int victim = rand() % c->ways;  (no scan needed)            */
/* The LRU version simply refreshes last_use on every hit, as shown.  */

7.2 Multi-level caches

Chain two Cache structs: feed every L1 miss into an L2 instance, and compute the composed AMAT from §2. With $t_{\text{hit}}^{L1}=1$, $t_{\text{hit}}^{L2}=12$, $t_{\text{miss}}^{L2}=100$, an L1 miss rate of 5% and L2 miss rate of 40%:

$$ \text{AMAT} = 1 + 0.05\,(12 + 0.40\cdot100) = 1 + 0.05\cdot52 = 3.6\ \text{cycles}. $$

7.3 Networking companion — subnet calculator & TCP handshake

A small C tool that ties the project to the networking half of the course. The subnet half mirrors demo 9 exactly: a /p network has $2^{32-p}$ addresses and $2^{32-p}-2$ usable hosts.

subnet.c (core)

#include <stdio.h>
#include <stdint.h>

int main(void) {
    unsigned a, b, c, d, p;
    if (scanf("%u.%u.%u.%u/%u", &a, &b, &c, &d, &p) != 5 || p > 32) {
        fprintf(stderr, "expected A.B.C.D/p\n"); return 1;
    }
    uint32_t ip   = (a<<24) | (b<<16) | (c<<8) | d;
    uint32_t mask = p ? (0xFFFFFFFFu << (32 - p)) : 0;
    uint32_t net  = ip & mask;
    uint32_t bcast= net | ~mask;
    uint32_t hosts= (p >= 31) ? 0 : (~mask) - 1;   /* minus net + bcast */

    printf("network   %u.%u.%u.%u/%u\n", net>>24, (net>>16)&255, (net>>8)&255, net&255, p);
    printf("broadcast %u.%u.%u.%u\n",    bcast>>24,(bcast>>16)&255,(bcast>>8)&255,bcast&255);
    printf("usable    %u hosts\n", hosts);
    return 0;
}

The TCP half reproduces the state machine of demo 10: a client and server walk through the three-way handshake, each printing its connection state after every segment.

handshake.c (state trace)

#include <stdio.h>

int main(void) {
    const char *cli = "CLOSED", *srv = "LISTEN";
    printf("start         client=%-11s server=%s\n", cli, srv);

    /* 1) client -> SYN(seq=x) */
    cli = "SYN_SENT";
    printf("--> SYN       client=%-11s server=%s\n", cli, srv);

    /* 2) server -> SYN-ACK(seq=y, ack=x+1) */
    srv = "SYN_RCVD";
    printf("<-- SYN,ACK   client=%-11s server=%s\n", cli, srv);

    /* 3) client -> ACK(ack=y+1) */
    cli = "ESTABLISHED"; srv = "ESTABLISHED";
    printf("--> ACK       client=%-11s server=%s\n", cli, srv);

    printf("connection open; data may now flow (UDP, by contrast, just sends)\n");
    return 0;
}

demo · cache demo · paging demo · subnetting demo · TCP handshake course · sessions 3–23

8. References

Hennessy, J. L. & Patterson, D. A. Computer Architecture: A Quantitative Approach, 6th ed. Morgan Kaufmann, 2017 — AMAT, the three C's, and the memory hierarchy (Ch. 2 & App. B).
Patterson, D. A. & Hennessy, J. L. Computer Organization and Design (RISC-V ed.). Morgan Kaufmann, 2020 — cache basics and address decoding (Ch. 5).
Bryant, R. E. & O'Hallaron, D. R. Computer Systems: A Programmer's Perspective, 3rd ed. Pearson, 2016 — the cache-lab model this simulator follows (Ch. 6).
Tanenbaum, A. S. & Bos, H. Modern Operating Systems, 4th ed. Pearson, 2015 — paging, TLBs and LRU replacement.
Kurose, J. F. & Ross, K. W. Computer Networking: A Top-Down Approach, 8th ed. Pearson, 2021 — IPv4 subnetting and the TCP three-way handshake.
Course syllabus — Computer Architecture, Network Technology & Operating Systems, IE BCSAI, 2025–26.