Stack vs Heap
In a managed language — Python, JavaScript, Java, Go — when you write data = [1, 2, 3], the runtime quietly puts the array on the , hands you back a reference, and a eventually frees it for you. You don’t decide where it lives. You don’t decide when it dies. The runtime hides the hardware from you.
C++ peels that hiding away. Every variable has to live somewhere — and you’re the one choosing. There are two regions to choose from: the stack and the heap. Picking the right one is the single biggest lever you have on systems-code performance.
This lesson is about the choice and what it costs.
TL;DR
- The stack is fast, automatic, and tiny — like a stack of plates that the function call itself manages.
- The heap is large, flexible, and slow to allocate from — you ask for memory, you (or RAII) give it back.
- A stack allocation is ~1 ns; a heap allocation is ~50–500 ns. The 100× gap is the lesson.
- ML runtimes obey one rule on the hot path: don’t touch the allocator. Pre-allocate everything at startup and reuse.
The two regions, in plain English
The is a small, strictly organized block of fast memory tied to your function calls. When a function starts, it gets a chunk of stack at the top. When it returns, the chunk is instantly reclaimed. That’s it.
You can think of it literally as a stack of plates: you can only add to the top, you can only take from the top, and the plates leave in the reverse order they arrived. Computer scientists call this — Last-In, First-Out.
The is the rest of your program’s memory — a much larger, free-form pool. You ask an for some, you get a pointer back, and you’re responsible for handing it back later. Or, in modern C++, an RAII wrapper hands it back for you when its enclosing scope ends.
| Stack | Heap | |
|---|---|---|
| Size | small (~8 MB per thread) | huge (gigabytes) |
| Allocation cost | ~1 nanosecond | ~50–500 ns |
| Lifetime | bound to a function call | until you free it |
| Who cleans up | the language, automatically | you (or RAII / smart pointers) |
| Failure mode | overflow if you exceed 8 MB | leak if you forget to free |
The 100× cost gap is what this lesson is really about.
Why this matters for ML systems
When you write vector<float> x(1024) inside a token-generation loop, you’re not just creating a vector. You’re calling the heap allocator, which may grab a global lock, which may ask the OS for a fresh memory page, which triggers a . Do this once per token across 32 layers and you’ve capped your throughput long before any matrix multiplication runs.
Understanding which region a variable lives in is what separates code that runs at 2 ms per token from code that runs at 200 ms per token. It’s the first lens through which to read systems code.
Mental model
The stack is a contiguous region that grows and shrinks with function calls — push on entry, pop on return. The heap is a free-form pool managed by an allocator (typically , or a library replacement like / ).
The stack is fast because allocation is just one CPU instruction: move the down by N bytes. Done.
The heap is slow because the allocator has to find a free block of the right size, possibly split or merge blocks, and update bookkeeping structures — all while being thread-safe.
Where each variable actually lives
Look at where each variable in this function ends up:
void run_inference() {
float scratch[256]; // STACK — 1024 bytes, free
std::array<float, 256> arr; // STACK — same as above
Tensor t(256); // STACK header, but t's *data* is HEAP
std::vector<float> v(256); // STACK header, HEAP buffer
auto p = std::make_unique<float[]>(256); // STACK pointer, HEAP buffer
// Cost roughly:
// scratch → 1 instruction
// arr → 1 instruction
// v → ~100 ns malloc
// p → ~100 ns malloc
}
// All four are cleaned up automatically when run_inference returns.
// scratch and arr never touched the heap — pure stack work.Notice the gotcha on std::vector<float> v(256): the header (size, capacity, pointer) lives on the stack, but the data buffer the pointer points to lives on the heap. The variable looks local — its cost isn’t.
This is the most common pattern in C++: a stack-allocated wrapper holding a heap-allocated buffer. The wrapper handles the cleanup automatically (that’s what destructors do), but you still pay the heap cost on construction.
The pattern: pre-allocate once, reuse forever
The stack is too small for tensors. The heap is too slow to touch repeatedly. So production ML runtimes do something specific: pay the heap cost once at startup, then never touch the allocator again on the .
class InferenceEngine {
// Allocated ONCE at startup. Lives on the heap, but stays put.
std::vector<float> kv_cache_;
std::vector<float> scratch_;
void forward_token(/* ... */) {
// Hot path: zero allocations.
// Use scratch_.data() as a working buffer.
}
};This pattern — hoist allocations out of the hot path — is everywhere in production ML runtimes (vLLM, llama.cpp, ONNX Runtime, TensorRT). It’s why the is sized at startup based on the max batch and max sequence length: one big allocation, then pure arithmetic on stable buffers for the rest of the program’s life.
The numbers
Rough cost model on a modern x86 server, single thread, allocator warm:
| Operation | Cost |
|---|---|
Stack allocation (int x;) | ~0.3 ns |
malloc(64) / new int | ~30–80 ns |
| (likely needs ) | ~1–10 µs |
| on first touch (4 KB page) | ~1 µs |
| First-time miss after allocation | ~100 ns |
If your generation loop runs 100 tokens/sec and each token does 5 small mallocs, that’s 500 allocations per second — fine. But if it does 5 per layer per token across 32 layers, that’s 16 000/sec — and now allocator overhead alone is ~1 ms per token, eating a meaningful chunk of latency budget that should belong to the actual matrix multiplications.
That’s the punchline. Allocations don’t show up in the math, but they show up in the wall clock.
Hands-on (optional)
A 5-line benchmark you can run locally to feel the gap yourself:
#include <chrono>
#include <vector>
#include <cstdio>
int main() {
using clk = std::chrono::high_resolution_clock;
constexpr int N = 1'000'000;
auto t0 = clk::now();
for (int i = 0; i < N; i++) { volatile float a[64]; (void)a; }
auto t1 = clk::now();
for (int i = 0; i < N; i++) { auto* a = new float[64]; delete[] a; }
auto t2 = clk::now();
std::printf("stack: %.1f ns/iter\n", std::chrono::duration<double, std::nano>(t1 - t0).count() / N);
std::printf("heap: %.1f ns/iter\n", std::chrono::duration<double, std::nano>(t2 - t1).count() / N);
}Compile with g++ -O2 stackheap.cpp -o stackheap && ./stackheap. Expect a 30–100× ratio. That ratio is the number to keep in your head whenever you’re reading systems code.
Quick check
Key takeaways
- Stack alloc ≈ free; heap alloc ≈ 100 ns. The 100× gap is what makes “no allocations on the hot path” a hard rule in ML runtimes.
- A
std::vector<float> vputs the header on the stack but the buffer on the heap. Don’t be fooled by the variable’s apparent locality. - Pre-allocate once, reuse forever. Production ML engines hoist all allocations to startup or to a memory arena.
- The heap can fault, fragment, and lock. It’s not just slow — it’s also unpredictable. The stack is deterministic.
Further reading
- What every programmer should know about memory — Ulrich Drepper, the canonical reference. Long, but the early chapters on memory hierarchy reward rereading.
- jemalloc design notes — how a production allocator actually works under contention.
std::pmr— Polymorphic Memory Resources — the modern C++ way to plug in your own allocator on a per-container basis.
TL;DR
- The stack is a fast, LIFO region of memory tied to function calls. Allocation is one register decrement; deallocation is automatic when the function returns.
- The heap is a general-purpose region you allocate from explicitly (
new,malloc). Allocation involves bookkeeping, can fragment, and you’re responsible for freeing. - A stack allocation is typically ~1 ns; a heap allocation is ~50–500 ns depending on the allocator and contention.
- ML systems code stays on the stack as much as possible — heap traffic on a hot path is a known performance killer.
Why this matters
When you write vector<float> x(1024) inside a hot inference loop, you’re not just creating a vector — you’re invoking the heap allocator, possibly taking a global lock, possibly faulting in a fresh page from the OS. Do this once per token and you’ve capped your throughput long before any matrix multiplication runs.
Understanding which region a variable lives in is what separates code that runs at 2 ms per token from code that runs at 200 ms per token. It’s the first lens through which to read systems code.
Mental model
The stack is a contiguous region that grows and shrinks with function calls — push on entry, pop on return. The heap is a free-form pool managed by an allocator (typically glibc’s ptmalloc, or a library replacement like jemalloc / tcmalloc).
The stack is fast because allocation is just sp -= n. The heap is slow because the allocator has to find a free block of the right size, possibly split or merge blocks, and update bookkeeping structures — all while being thread-safe.
Concrete walkthrough
Look at where each variable lives:
void run_inference() {
float scratch[256]; // STACK — 1024 bytes, free
std::array<float, 256> arr; // STACK — same as above
Tensor t(256); // STACK header, but t's *data* is HEAP
std::vector<float> v(256); // STACK header, HEAP buffer
auto p = std::make_unique<float[]>(256); // STACK pointer, HEAP buffer
// Cost roughly:
// scratch → 1 instruction
// arr → 1 instruction
// v → ~100 ns malloc
// p → ~100 ns malloc
}
// All four are cleaned up automatically when run_inference returns.
// scratch and arr never touched the heap — pure stack work.A common ML systems pattern: pre-allocate once, reuse forever.
class InferenceEngine {
// Allocated ONCE at startup. Lives on the heap, but stays put.
std::vector<float> kv_cache_;
std::vector<float> scratch_;
void forward_token(/* ... */) {
// Hot path: zero allocations.
// Use scratch_.data() as a working buffer.
}
};This pattern — hoist allocations out of the hot path — is everywhere in production ML runtimes (vLLM, llama.cpp, ONNX Runtime, TensorRT). The stack is too small for tensors, but the heap is too slow to touch repeatedly. So you pay the heap cost once, then operate from stable buffers.
The numbers
Rough cost model on a modern x86 server, single thread, allocator warm:
| Operation | Cost |
|---|---|
Stack allocation (int x;) | ~0.3 ns |
malloc(64) / new int | ~30–80 ns |
malloc(1MB) (likely needs mmap) | ~1–10 µs |
| Page fault on first touch (4 KB page) | ~1 µs |
| First-time TLB miss after allocation | ~100 ns |
If your generation loop runs 100 tokens/sec and each token does 5 small mallocs, that’s 500 allocations per second — fine. But if it does 5 per layer per token across 32 layers, that’s 16 000/sec — and now allocator overhead is ~1 ms/token, eating a meaningful chunk of latency.
Hands-on (optional)
A 5-line benchmark you can run locally:
#include <chrono>
#include <vector>
#include <cstdio>
int main() {
using clk = std::chrono::high_resolution_clock;
constexpr int N = 1'000'000;
auto t0 = clk::now();
for (int i = 0; i < N; i++) { volatile float a[64]; (void)a; }
auto t1 = clk::now();
for (int i = 0; i < N; i++) { auto* a = new float[64]; delete[] a; }
auto t2 = clk::now();
std::printf("stack: %.1f ns/iter\n", std::chrono::duration<double, std::nano>(t1 - t0).count() / N);
std::printf("heap: %.1f ns/iter\n", std::chrono::duration<double, std::nano>(t2 - t1).count() / N);
}Compile with g++ -O2 stackheap.cpp -o stackheap && ./stackheap. Expect a 30–100× ratio.
Quick check
Key takeaways
- Stack alloc ≈ free; heap alloc ≈ 100 ns. The 100× gap is what makes “no allocations on the hot path” a hard rule in ML runtimes.
- A
std::vector<float> vputs the header on the stack but the buffer on the heap. Don’t be fooled by the variable’s apparent locality. - Pre-allocate once, reuse forever. Production ML engines hoist all allocations to startup or to a memory arena.
- The heap can fault, fragment, and lock. It’s not just slow — it’s also unpredictable. The stack is deterministic.
Further reading
- What every programmer should know about memory — Ulrich Drepper, the canonical reference. Long, but the early chapters on memory hierarchy reward rereading.
- jemalloc design notes — how a production allocator actually works under contention.
std::pmr— Polymorphic Memory Resources — the modern C++ way to plug in your own allocator on a per-container basis.