Cache Lines

Here’s a fact that will reorganize how you think about performance: the CPU never reads a single byte from RAM. It always reads 64 bytes — a chunk called a cache line. Even if your code asks for one float, the hardware drags 64 bytes of RAM into the on-chip cache and hands you 4 of those bytes.

This is why your program’s memory access pattern matters more than its number of operations. A loop that reads adjacent elements gets ~16 floats per cache fetch. A loop that jumps around fetches 64 bytes and uses 4. Same FLOPs; vastly different wall-clock time.

The whole story of CPU performance — the L1/L2/L3 hierarchy, why arrays beat linked lists, why matters, why false sharing ruins multi-threaded counters — comes down to this single picture: 64 bytes at a time, getting more expensive the further from the core you go.

TL;DR

A cache line is the unit of memory transfer between RAM and CPU caches: typically 64 bytes on x86 / ARM, occasionally 128. You never read 1 byte from RAM; you always read 64.
Sequential access uses every byte of each cache line you fetch → bandwidth-efficient. Strided access wastes 87% of the bandwidth (using 8 bytes of every 64-byte fetch).
L1 (32–64 KB), L2 (256 KB – 2 MB), L3 (32 MB – 100+ MB), DRAM (TB) — each level is ~5–10× slower and ~10× larger than the one above. Knowing this hierarchy is the entire CPU performance story.
Cache-friendly = arrays > pointer chasing. Linked lists, hash tables with chaining, tree traversal all suffer from ; flat arrays and contiguous structs win on modern hardware.
ML systems engineering applies the same idea on GPU: HBM cache lines are 32–128 bytes; the same locality discipline that wins on CPU wins on GPU coalescing.

The latency hierarchy

Each level is roughly 10× slower than the one above it. Code that fits in L1 runs ~300× faster than code that misses to DRAM.

That ratio is why a “fast algorithm” that miss-thrashes the cache often loses to a “slow algorithm” that streams memory linearly.

The 64-byte unit

When the CPU wants byte at address 0x1234, the memory controller doesn’t fetch one byte — it fetches the whole 64-byte cache line containing 0x1234. That entire line gets put into L1. Subsequent reads of any bytes in that line are now ~free.

For a 4-byte float, one cache line = 16 floats.


// Sequential walk: 1 cache miss per 16 elements
float sum = 0;
for (int i = 0; i < N; ++i) {
    sum += data[i];             // i=0 misses, i=1..15 hit, i=16 misses again
}
 
// Strided walk: 1 cache miss per element when stride > 16
sum = 0;
for (int i = 0; i < N; i += 32) {
    sum += data[i];             // every read misses a fresh cache line
}

The strided walk does fewer total operations but generates 32× more cache misses → ~10× slower in practice, because it’s now bound by memory wait time.

Real numbers

Cache level	Size	Latency	Bandwidth (per core)
L1d	32–48 KB	1 ns	~200 GB/s
L2	256 KB – 2 MB	3 ns	~150 GB/s
L3	32 MB – 100 MB	10 ns	~80 GB/s
DRAM	TB	100–300 ns	~30 GB/s (single core)

These vary by CPU generation but the ratios are stable. The latency gap from L1 to DRAM is the gap between “fast” and “slow” code.

Why arrays beat linked lists

A linked-list traversal walks pointers — and each pointer hop lands on a different RAM address:


struct Node { int val; Node* next; };
Node* p = head;
while (p) { sum += p->val; p = p->next; }

Every iteration: a fresh DRAM address, a cache miss. For a 1M-node list, ~100 ms of pure DRAM-wait time before any actual addition happens.

An array traversal walks adjacent memory:


std::vector<int> v(1'000'000);
for (int x : v) sum += x;

Sequential reads — 1 miss per 16 elements (the prefetcher hides the rest). For 1M elements: ~62K misses × 100 ns = ~6 ms. 17× faster on the same total data.

This is why C++ veterans always reach for std::vector over std::list. Linked lists are correct; they’re almost never fast.

Struct layout matters


// Bad: sizeof = 80. Reading {x, y, z, mass} per step pulls TWO cache lines —
// the cold tag/padding sits between the hot fields and pushes `mass` past byte 64.
struct Particle {
    double x, y, z;       // bytes 0..23   (line 1)
    char tag;              // byte 24
    char padding[47];     // bytes 25..71 (cold metadata)
    double mass;           // bytes 72..79 (line 2 — second cache miss per step)
};
 
// Good: hot fields adjacent. sizeof = 32, fits in one 64-byte line. Cold
// metadata moves to a parallel array or a trailing struct.
struct Particle {
    double x, y, z, mass;  // bytes 0..31 — all in one cache line
};

Order struct fields by access frequency. Hot fields together; cold fields after. Tools like the Linux pahole (struct-layout pretty-printer) help.

vs


// AoS (Array of Structures): layout is x0, y0, z0, x1, y1, z1, ...
struct Particle { float x, y, z; };
std::vector<Particle> particles;
 
// SoA (Structure of Arrays): layout is x0, x1, x2, ..., y0, y1, ..., z0, z1, ...
struct Particles {
    std::vector<float> x, y, z;
};

If your hot loop only touches x (e.g., a sort by x), SoA is much better — every cache line is full of x-values. If your hot loop touches all three, AoS may be better — each iteration reads a contiguous (x, y, z) triple.

Most ML / numerics code is SoA-shaped. Most game-engine code is AoS or a mix. The right answer depends on your access pattern.

Cache lines on GPU — same idea, scaled up

A GPU “cache line” is the memory transaction size — 32 bytes (Hopper L2) or 128 bytes (Hopper HBM). For coalesced loads to be efficient, the 32 threads of a warp must read 32 adjacent 4-byte words → one transaction.

It’s the same cache-line discipline, scaled up: on CPU, sequential access wins; on GPU, coalesced access wins. The principles transfer directly.


// Two threads each update their own counter
struct Counters {
    std::atomic<int> a, b;     // both fit in one 64-byte cache line!
};
 
// Thread 1 increments .a; Thread 2 increments .b. The cache line ping-pongs
// between the two CPU cores, even though they're updating different fields.

Fix: pad to ensure each counter has its own cache line.


struct alignas(64) PaddedCounter { std::atomic<int> v; };
struct Counters {
    PaddedCounter a;
    PaddedCounter b;
};

alignas(64) guarantees the counter is on its own cache line. Slowdowns from false sharing can be 2–10×; the fix is cheap.

Run it in your browser — sequential vs strided cost

Python — editableTime sequential vs strided list access. Stride > cache_line_size / element_size triggers a miss per access.

import time

N = 1_000_000
data = list(range(N))

def sequential_sum():
  s = 0
  for x in data: s += x
  return s

def strided_sum(stride):
  s = 0
  for i in range(0, N, stride):
      s += data[i]
  return s

def benchmark(label, fn, iters=20):
  t0 = time.perf_counter()
  for _ in range(iters): fn()
  return (time.perf_counter() - t0) * 1000 / iters  # ms per iter

print(f"sequential sum (N=1M):           {benchmark('seq', sequential_sum):>7.1f} ms")
for s in (16, 64, 256, 1024, 16384):
  fn = lambda s=s: strided_sum(s)
  print(f"strided sum (stride={s:>5}):       {benchmark('s', fn):>7.1f} ms  ({N//s:>5} reads)")

print()
print("Higher strides → fewer total reads, but each one misses the cache.")
print("On real hardware (no Python interpreter overhead), the gap is much wider.")
print("On C++ raw memory: sequential ~6 ms / 1M floats; strided every cache line ~60 ms.")

import time

N = 1_000_000
data = list(range(N))

def sequential_sum():
  s = 0
  for x in data: s += x
  return s

def strided_sum(stride):
  s = 0
  for i in range(0, N, stride):
      s += data[i]
  return s

def benchmark(label, fn, iters=20):
  t0 = time.perf_counter()
  for _ in range(iters): fn()
  return (time.perf_counter() - t0) * 1000 / iters  # ms per iter

print(f"sequential sum (N=1M):           {benchmark('seq', sequential_sum):>7.1f} ms")
for s in (16, 64, 256, 1024, 16384):
  fn = lambda s=s: strided_sum(s)
  print(f"strided sum (stride={s:>5}):       {benchmark('s', fn):>7.1f} ms  ({N//s:>5} reads)")

print()
print("Higher strides → fewer total reads, but each one misses the cache.")
print("On real hardware (no Python interpreter overhead), the gap is much wider.")
print("On C++ raw memory: sequential ~6 ms / 1M floats; strided every cache line ~60 ms.")

import time

N = 1_000_000
data = list(range(N))

def sequential_sum():
  s = 0
  for x in data: s += x
  return s

def strided_sum(stride):
  s = 0
  for i in range(0, N, stride):
      s += data[i]
  return s

def benchmark(label, fn, iters=20):
  t0 = time.perf_counter()
  for _ in range(iters): fn()
  return (time.perf_counter() - t0) * 1000 / iters  # ms per iter

print(f"sequential sum (N=1M):           {benchmark('seq', sequential_sum):>7.1f} ms")
for s in (16, 64, 256, 1024, 16384):
  fn = lambda s=s: strided_sum(s)
  print(f"strided sum (stride={s:>5}):       {benchmark('s', fn):>7.1f} ms  ({N//s:>5} reads)")

print()
print("Higher strides → fewer total reads, but each one misses the cache.")
print("On real hardware (no Python interpreter overhead), the gap is much wider.")
print("On C++ raw memory: sequential ~6 ms / 1M floats; strided every cache line ~60 ms.")

Ctrl+Enter to run

The Python overhead damps the effect, but the shape (more strided → more misses → more time per byte) is right. In compiled code on a real CPU, the gap is 5–10×.

Quick check

Fill in the blank

The size of a cache line on most modern CPUs (x86 and ARM):

A power of 2; the unit of every memory transfer.

Quick check

A team's particle-simulation hot loop touches only the x-coordinate of millions of particles. They use Array-of-Structs (each Particle has x, y, z). Profiling shows ~75% of time in cache misses. The right fix:

Key takeaways

64-byte cache lines are the unit of every memory transfer.
Sequential access wins because it uses every byte of each line. Strided access wastes bandwidth.
Latency hierarchy is brutal: L1 (1 ns) → DRAM (100+ ns). Code that fits in cache runs hundreds of times faster.
Arrays beat linked lists for any hot path. AoS vs SoA is workload-specific.
False sharing in multi-threaded code is the cache-line-aware gotcha — pad shared atomics to cache-line boundaries.

Go deeper

PaperWhat Every Programmer Should Know About Memory · Ulrich Drepper, 2007The canonical 100-page treatment. Sections 3 and 6 are most relevant. Slightly dated on numbers but the model is timeless.
BlogAlgorithmica HPC — CPU CacheModern, hands-on. Includes microbenchmarks and the cache-friendly algorithm catalog.
VideoMike Acton — Data-Oriented Design and C++CppCon 2014, the definitive "the data layout is the program" talk.
DocsIntel Software Developer ManualVolume 3 covers cache architecture. Reference, not tutorial.
Docspahole (dwarves)The struct-layout introspection tool. Run on a binary to see padding, alignment, and cache-line straddling.

TL;DR

A cache line is the unit of memory transfer between RAM and CPU caches: typically 64 bytes on x86 / ARM, occasionally 128. You never read 1 byte from RAM; you always read 64.
Sequential access uses every byte of each cache line you fetch → bandwidth-efficient. Strided access wastes 87% of the bandwidth (using 8 bytes of every 64-byte fetch).
L1 (32–64 KB), L2 (256 KB – 2 MB), L3 (32 MB – 100+ MB), DRAM (TB) — each level is ~5–10× slower and ~10× larger than the one above. Knowing this hierarchy is the entire CPU performance story.
Cache-friendly = arrays > pointer chasing. Linked lists, hash tables with chaining, tree traversal all suffer from cache misses; flat arrays and contiguous structs win on modern hardware.
ML systems engineering applies the same idea on GPU: HBM cache lines are 32–128 bytes; the same locality discipline that wins on CPU wins on GPU coalescing.

Why this matters

The single largest source of “code that should be fast but isn’t” on modern hardware is cache misses. A pointer-chase that misses L1 is ~5 ns. Missing L2 is ~15 ns. Missing L3 (a true DRAM read) is ~100–300 ns. A function that does one DRAM read per element on a 1M-element array spends 100 ms just waiting for memory. The compiler can’t fix this — it’s a data-layout issue. Knowing how cache lines work is the first lens for any “why is my code slow” investigation.

Mental model

Each level is roughly 10× slower than the one above it. Code that fits in L1 runs ~300× faster than code that misses to DRAM.

Concrete walkthrough

The 64-byte unit

For a 4-byte float, one cache line = 16 floats.


// Sequential walk: 1 cache miss per 16 elements
float sum = 0;
for (int i = 0; i < N; ++i) {
    sum += data[i];             // i=0 misses, i=1..15 hit, i=16 misses again
}
 
// Strided walk: 1 cache miss per element when stride > 16
sum = 0;
for (int i = 0; i < N; i += 32) {
    sum += data[i];             // every read misses a fresh cache line
}

The strided walk does fewer total operations but does 32× more cache misses → ~10× slower in practice (memory-bound).

Real numbers

Cache level	Size	Latency	Bandwidth (per core)
L1d	32–48 KB	1 ns	~200 GB/s
L2	256 KB – 2 MB	3 ns	~150 GB/s
L3	32 MB – 100 MB	10 ns	~80 GB/s
DRAM	TB	100–300 ns	~30 GB/s (single core)

These vary by CPU generation but the ratios are stable. The latency gap from L1 to DRAM is the gap between “fast” and “slow” code.

Why arrays beat linked lists

Linked list traversal:


struct Node { int val; Node* next; };
Node* p = head;
while (p) { sum += p->val; p = p->next; }

Each p->next is a different DRAM address (assuming the list was built by pushing). Every iteration: cache miss. For a 1M-node list, ~100 ms of DRAM-wait time.

Array traversal:


std::vector<int> v(1'000'000);
for (int x : v) sum += x;

Sequential reads → 1 miss per 16 elements (L1 prefetch eats the rest). For 1M elements: ~62K misses × 100 ns = ~6 ms. 17× faster on the same total data.

This is why C++ veterans always reach for std::vector over std::list. Linked lists are correct; they’re almost never fast.

Struct layout matters


// Bad: sizeof = 80. Reading {x, y, z, mass} per step pulls TWO cache lines —
// the cold tag/padding sits between the hot fields and pushes `mass` past byte 64.
struct Particle {
    double x, y, z;       // bytes 0..23   (line 1)
    char tag;              // byte 24
    char padding[47];     // bytes 25..71 (cold metadata)
    double mass;           // bytes 72..79 (line 2 — second cache miss per step)
};
 
// Good: hot fields adjacent. sizeof = 32, fits in one 64-byte line. Cold
// metadata moves to a parallel array or a trailing struct.
struct Particle {
    double x, y, z, mass;  // bytes 0..31 — all in one cache line
};

Order struct fields by access frequency. Hot fields together; cold fields after. Tools like cache-conscious data structures and the Linux pahole (struct layout pretty-printer) help.

Structure-of-Arrays (SoA) vs Array-of-Structures (AoS)


// AoS: layout is x0, y0, z0, x1, y1, z1, ...
struct Particle { float x, y, z; };
std::vector<Particle> particles;
 
// SoA: layout is x0, x1, x2, ..., y0, y1, ..., z0, z1, ...
struct Particles {
    std::vector<float> x, y, z;
};

Most ML / numerics code is SoA-shaped. Most game code is AoS or a mix. The choice depends on your access pattern.

Cache lines on GPU

This is the same cache-line discipline, scaled up: on CPU, sequential access wins; on GPU, coalesced access wins. The principles transfer directly.

A subtle multi-threaded gotcha:


// Two threads each update their own counter
struct Counters {
    std::atomic<int> a, b;     // both fit in one 64-byte cache line!
};
 
// Thread 1 increments .a; Thread 2 increments .b. The cache line ping-pongs
// between the two CPU cores, even though they're updating different fields.

Fix: pad to ensure each counter has its own cache line:


struct alignas(64) PaddedCounter { std::atomic<int> v; };
struct Counters {
    PaddedCounter a;
    PaddedCounter b;
};

The alignas(64) guarantees the counter is on its own cache line. Slowdowns from false sharing can be 2–10×; the fix is cheap.

Run it in your browser — sequential vs strided cost

Python — editableTime sequential vs strided list access. Stride > cache_line_size / element_size triggers a miss per access.

import time

N = 1_000_000
data = list(range(N))

def sequential_sum():
  s = 0
  for x in data: s += x
  return s

def strided_sum(stride):
  s = 0
  for i in range(0, N, stride):
      s += data[i]
  return s

def benchmark(label, fn, iters=20):
  t0 = time.perf_counter()
  for _ in range(iters): fn()
  return (time.perf_counter() - t0) * 1000 / iters  # ms per iter

print(f"sequential sum (N=1M):           {benchmark('seq', sequential_sum):>7.1f} ms")
for s in (16, 64, 256, 1024, 16384):
  fn = lambda s=s: strided_sum(s)
  print(f"strided sum (stride={s:>5}):       {benchmark('s', fn):>7.1f} ms  ({N//s:>5} reads)")

print()
print("Higher strides → fewer total reads, but each one misses the cache.")
print("On real hardware (no Python interpreter overhead), the gap is much wider.")
print("On C++ raw memory: sequential ~6 ms / 1M floats; strided every cache line ~60 ms.")

import time

N = 1_000_000
data = list(range(N))

def sequential_sum():
  s = 0
  for x in data: s += x
  return s

def strided_sum(stride):
  s = 0
  for i in range(0, N, stride):
      s += data[i]
  return s

def benchmark(label, fn, iters=20):
  t0 = time.perf_counter()
  for _ in range(iters): fn()
  return (time.perf_counter() - t0) * 1000 / iters  # ms per iter

print(f"sequential sum (N=1M):           {benchmark('seq', sequential_sum):>7.1f} ms")
for s in (16, 64, 256, 1024, 16384):
  fn = lambda s=s: strided_sum(s)
  print(f"strided sum (stride={s:>5}):       {benchmark('s', fn):>7.1f} ms  ({N//s:>5} reads)")

print()
print("Higher strides → fewer total reads, but each one misses the cache.")
print("On real hardware (no Python interpreter overhead), the gap is much wider.")
print("On C++ raw memory: sequential ~6 ms / 1M floats; strided every cache line ~60 ms.")

import time

N = 1_000_000
data = list(range(N))

def sequential_sum():
  s = 0
  for x in data: s += x
  return s

def strided_sum(stride):
  s = 0
  for i in range(0, N, stride):
      s += data[i]
  return s

def benchmark(label, fn, iters=20):
  t0 = time.perf_counter()
  for _ in range(iters): fn()
  return (time.perf_counter() - t0) * 1000 / iters  # ms per iter

print(f"sequential sum (N=1M):           {benchmark('seq', sequential_sum):>7.1f} ms")
for s in (16, 64, 256, 1024, 16384):
  fn = lambda s=s: strided_sum(s)
  print(f"strided sum (stride={s:>5}):       {benchmark('s', fn):>7.1f} ms  ({N//s:>5} reads)")

print()
print("Higher strides → fewer total reads, but each one misses the cache.")
print("On real hardware (no Python interpreter overhead), the gap is much wider.")
print("On C++ raw memory: sequential ~6 ms / 1M floats; strided every cache line ~60 ms.")

Ctrl+Enter to run

The Python overhead damps the effect, but the shape (more strided → more misses → more time per byte) is right. In compiled code on a real CPU, the gap is 5–10×.

Quick check

Fill in the blank

The size of a cache line on most modern CPUs (x86 and ARM):

A power of 2; the unit of every memory transfer.

Quick check

Key takeaways

64-byte cache lines are the unit of every memory transfer.
Sequential access wins because it uses every byte of each line. Strided access wastes bandwidth.
Latency hierarchy is brutal: L1 (1 ns) → DRAM (100+ ns). Code that fits in cache runs hundreds of times faster.
Arrays beat linked lists for any hot path. AoS vs SoA is workload-specific.
False sharing in multi-threaded code is the cache-line-aware gotcha — pad shared atomics to cache-line boundaries.

Go deeper

PaperWhat Every Programmer Should Know About Memory · Ulrich Drepper, 2007The canonical 100-page treatment. Sections 3 and 6 are most relevant. Slightly dated on numbers but the model is timeless.
BlogAlgorithmica HPC — CPU CacheModern, hands-on. Includes microbenchmarks and the cache-friendly algorithm catalog.
VideoMike Acton — Data-Oriented Design and C++CppCon 2014, the definitive "the data layout is the program" talk.
DocsIntel Software Developer ManualVolume 3 covers cache architecture. Reference, not tutorial.
Docspahole (dwarves)The struct-layout introspection tool. Run on a binary to see padding, alignment, and cache-line straddling.

Cache Lines

TL;DR

The latency hierarchy

The 64-byte unit

Real numbers

Why arrays beat linked lists

Struct layout matters

AoS vs SoA

Cache lines on GPU — same idea, scaled up

False sharing — the multi-threaded gotcha

Run it in your browser — sequential vs strided cost

Quick check

Key takeaways

Go deeper

TL;DR

Why this matters

Mental model

Concrete walkthrough

The 64-byte unit

Real numbers

Why arrays beat linked lists

Struct layout matters

Structure-of-Arrays (SoA) vs Array-of-Structures (AoS)

Cache lines on GPU

False sharing

Run it in your browser — sequential vs strided cost

Quick check

Key takeaways

Go deeper

vs

— the multi-threaded gotcha