Branch Prediction
A modern CPU is, deep down, a gambler. At any given moment it has 15–20 instructions in flight at different stages of execution — fetched, decoded, partway through, almost done. This is the trick that lets a 3 GHz chip retire multiple instructions per clock tick.
The catch: when the CPU hits an if, a for, or a while, it doesn’t yet know which branch will run. Stalling the pipeline to wait would burn the whole speedup. So instead, the CPU guesses, starts speculatively running the predicted path, and commits the work if it was right.
Right guess: ~free. Wrong guess: gets flushed, ~15–20 cycles wasted. Across a billion-iteration loop, that’s the difference between 1 second and 6 seconds of wall time, on the same code, on the same data.
This lesson is about the gambler — when it wins, when it loses, and what to write so it wins more often.
TL;DR
- Modern CPUs are deeply pipelined — 15–20 instructions in flight simultaneously. When the CPU hits a branch (
if,for,while), it doesn’t know which path will execute, so it guesses and starts . - If the guess is right: ~free, the speculation commits.
- If the guess is wrong: pipeline gets flushed, ~10–20 cycles wasted. A misprediction costs 5–10× a regular instruction.
- Modern predictors are really good — branches that follow patterns (always taken, predictable loops, simple conditions) hit ~99% accuracy.
- Branches that depend on data with no pattern (sorting random values, hash lookups, RNG branches) hit ~50%. A 1B-iteration loop with 50% mispredict can spend half its time in pipeline stalls.
How the gambler works
Speculation is the engine that lets modern CPUs sustain ~3 instructions per cycle. Mispredicts break it.
The hardware tracks branch history: for each branch (identified by its instruction address), it remembers the recent pattern of taken / not-taken. Patterns are easy. “Always taken” is trivial. “Alternating taken/not-taken” works. “Taken 3 times then not-taken” within a 16-history window works. Random-looking branches based on data values can’t be predicted — they hit the 50% baseline.
The famous Stack Overflow benchmark
The most-cited example, Stack Overflow circa 2012:
const unsigned arraySize = 32768;
int data[arraySize];
for (auto& x : data) x = rand() % 256;
// std::sort(data, data + arraySize); // <-- uncomment to make this 6× faster
int sum = 0;
for (int n = 0; n < 100000; ++n) {
for (int i = 0; i < arraySize; ++i) {
if (data[i] >= 128) sum += data[i];
}
}With sorted data: the branch is predictable (first half always not-taken, second half always taken). Prediction accuracy ~100%. Runs fast.
With unsorted data: every data[i] >= 128 is a coin flip. Prediction accuracy ~50%. Same code runs 6× slower because half the branches stall the pipeline.
This single benchmark turned branch prediction from CPU-architecture trivia into something every working engineer learned to think about.
Real misprediction costs
| Operation | Cycles |
|---|---|
| Add / subtract | 1 |
| Multiply | 3 |
| L1 cache hit | 4 |
| Predicted branch | 0–1 |
| Mispredicted branch | 15–20 |
| L2 cache hit | 12 |
| L3 cache hit | 40 |
| DRAM access | 100–300 |
A misprediction is ~5× the cost of an L1 hit and ~15× the cost of an add. A loop running 1B iterations with 50% misprediction wastes ~10 seconds in pipeline stalls. That’s why this matters.
Branchless code — when you can’t predict, don’t branch
When a branch is unpredictable, don’t branch at all:
// Branchful: misprediction-prone
int abs_branchful(int x) {
if (x < 0) return -x; else return x;
}
// Branchless: arithmetic + bitwise
int abs_branchless(int x) {
int mask = x >> 31; // -1 if x < 0, 0 otherwise (arithmetic shift)
return (x + mask) ^ mask; // negates if x < 0, else identity
}The branchless version always runs the same instructions; no speculation needed. Faster on unpredictable inputs; slightly slower on predictable inputs (the branchful version’s misprediction cost amortizes to ~0 when prediction is correct).
The compiler often does this for you — <Term name="cmov">cmov</Term> (conditional move) is an x86 instruction that lets you compute both branches and pick the result without a real branch. Compilers emit cmov automatically when they think it’ll help.
Predication
Closely related: predication evaluates both branches and uses the condition as a mask:
// Naive
for (int i = 0; i < N; ++i) {
if (cond[i]) y[i] = a[i] * 2;
else y[i] = a[i] + 1;
}
// Predicated (branch-free, vectorizable)
for (int i = 0; i < N; ++i) {
int taken = cond[i];
int v_taken = a[i] * 2;
int v_else = a[i] + 1;
y[i] = taken ? v_taken : v_else; // compiles to cmov / select
}The compiler hoists the branch out and produces a vectorizable predicated loop. SIMD intrinsics (_mm256_blendv_ps etc.) are explicit predication.
Sort first, then process
When the branch outcome correlates with a sortable key, sorting first is sometimes a net win:
// Bad: random tags → mispredict-heavy loop
for (auto& item : items) {
if (item.tag == TAG_FAST) fast_path(item);
else slow_path(item);
}
// Better: sort by tag once, then two predictable loops
std::sort(items.begin(), items.end(), [](auto& a, auto& b) { return a.tag < b.tag; });
auto split = std::find_if(items.begin(), items.end(), [](auto& i) { return i.tag != TAG_FAST; });
for (auto it = items.begin(); it != split; ++it) fast_path(*it);
for (auto it = split; it != items.end(); ++it) slow_path(*it);The sort cost (O(N log N)) plus two perfectly-predicted loops (O(N)) sometimes beats the original O(N) mispredict-heavy loop. Profile to know.
Branch hints (rarely needed)
Compilers accept hints:
if (__builtin_expect(rare_condition, 0)) {
// ...rare path...
}
[[likely]] if (common_condition) { /* ... */ } // C++20 attributeThese let the compiler arrange code so the predicted-fast path stays in instruction cache. Mostly unnecessary — modern predictors learn the right behavior in microseconds. Useful in very hot inner loops or when you have data-driven knowledge the compiler doesn’t.
Run it in your browser — sorted vs unsorted branch cost
In compiled code, the same benchmark gives ~6× speedup just from sorting. Pure Python damps the effect because the interpreter overhead dominates the branch cost. On real CPU code, the gap is the headline number.
Quick check
Key takeaways
- CPUs speculate aggressively. A right guess is ~free; a wrong guess costs 15–20 cycles.
- Patterns are predictable; random data isn’t. Modern predictors hit ~99% on patterns, ~50% on random.
- Branchless / predicated code is the workaround when you can’t make branches predictable.
- Sorting first can beat a mispredict-heavy loop on the cost-of-sort + cost-of-loop.
- The compiler often does the right thing with
cmovand predication. Don’t reach for__builtin_expectuntil profiling shows the branch matters.
Go deeper
- BlogStack Overflow — Why is processing a sorted array faster than an unsorted array?The famous question. The accepted answer is the canonical introduction to branch prediction.
- PaperTAGE: A Case for Tagged Geometric History Length Branch PredictorThe predictor design used in modern Intel CPUs. Heavy reading; useful only if you really care.
- BlogDan Luu — Branch PredictionPractitioner overview with numbers from real CPUs.
- BlogAlgorithmica HPC — Branchless ProgrammingHands-on guide to writing branch-free code with measured speedups.
- DocsIntel SDM — Branch Prediction (Vol 1, §2.5.2)Reference. Not pedagogical.
Prereq: Cache Lines. Branch prediction is the other major non-obvious cost on modern CPUs.
TL;DR
- Modern CPUs are deeply pipelined — 15–20 instructions in flight simultaneously. When the CPU hits a branch (
if,for,while), it doesn’t know which path will execute, so it guesses and starts speculatively running. - If the guess is right: ~free, the speculation commits.
- If the guess is wrong: pipeline gets flushed, ~10–20 cycles wasted. A misprediction costs 5–10× a regular instruction.
- Modern predictors are really good — branches that follow patterns (always taken, predictable loops, simple conditions) hit ~99% accuracy.
- Branches that depend on data with no pattern (sorting random values, hash lookups, RNG branches) hit ~50%. A 1B-iteration loop with 50% mispredict can spend half its time in pipeline stalls.
Why this matters
If you’ve ever sorted an array and seen subsequent code run faster on the sorted version than the unsorted one, that’s branch prediction. Branch-mispredict-heavy code can be 2–10× slower than a branch-free equivalent. The fix is sometimes algorithmic (avoid the branch entirely), sometimes a data-layout choice (group similar branches together), sometimes intrinsic-level (cmov, predication). Knowing that branch prediction exists is the price of writing performance-critical code on any modern CPU.
Mental model
Speculation is the engine that lets modern CPUs sustain ~3 IPC. Mispredicts break it.
Concrete walkthrough
The famous Stack Overflow benchmark
The most-cited example (Stack Overflow, ~2012):
const unsigned arraySize = 32768;
int data[arraySize];
for (auto& x : data) x = rand() % 256;
// std::sort(data, data + arraySize); // <-- uncomment to make this 6× faster
int sum = 0;
for (int n = 0; n < 100000; ++n) {
for (int i = 0; i < arraySize; ++i) {
if (data[i] >= 128) sum += data[i];
}
}With sorted data: the branch is predictable (first half always not-taken, second half always taken). Prediction accuracy ~100%. Runs fast.
With unsorted data: every iteration’s data[i] >= 128 is a coin flip. Prediction accuracy ~50%. Same code runs 6× slower because half the branches stall the pipeline.
Real misprediction costs
| Operation | Cycles |
|---|---|
| Add / subtract | 1 |
| Multiply | 3 |
| L1 cache hit | 4 |
| Predicted branch | 0–1 |
| Mispredicted branch | 15–20 |
| L2 cache hit | 12 |
| L3 cache hit | 40 |
| DRAM access | 100–300 |
A misprediction is ~5× the cost of an L1 hit and ~15× the cost of an add. A loop running 1B iterations with 50% misprediction wastes ~10 seconds in pipeline stalls. That’s why this stuff matters.
How modern predictors work
The hardware tracks branch history: for each branch (identified by its instruction address), it remembers the recent pattern of taken / not-taken. Two-level predictors (Smith, Yeh) extend this to “for this branch, given the last 16 branches’ history, what was the outcome?” Modern Intel / AMD use even more elaborate schemes (TAGE, perceptron predictors).
Result: patterns are easy. “Always taken” is trivial. “Alternating taken/not-taken” works. “Taken 3 times then not-taken” within 16 history bits works. Random-looking branches based on data values can’t be predicted — they hit the 50% baseline.
Branchless code
When a branch is unpredictable, don’t branch at all:
// Branchful: misprediction-prone
int abs_branchful(int x) {
if (x < 0) return -x; else return x;
}
// Branchless: arithmetic + bitwise
int abs_branchless(int x) {
int mask = x >> 31; // -1 if x < 0, 0 otherwise (arithmetic shift)
return (x + mask) ^ mask; // negates if x < 0, else identity
}The branchless version always runs the same instructions; no speculation needed. Faster on unpredictable inputs; slightly slower on predictable inputs (the branchful version has the misprediction cost amortized to ~0 when prediction is correct).
The compiler often does this for you — cmov (conditional move) is an x86 instruction that lets you compute both branches and pick the result without a real branch. Compilers emit cmov automatically when they think it’ll help.
Predication
Closely related: predication evaluates both branches and uses the condition as a mask:
// Naive
for (int i = 0; i < N; ++i) {
if (cond[i]) y[i] = a[i] * 2;
else y[i] = a[i] + 1;
}
// Predicated (branch-free, vectorizable)
for (int i = 0; i < N; ++i) {
int taken = cond[i];
int v_taken = a[i] * 2;
int v_else = a[i] + 1;
y[i] = taken ? v_taken : v_else; // compiles to cmov / select
}The compiler hoists the branch out and produces a vectorizable predicated loop. SIMD intrinsics (_mm256_blendv_ps etc.) are explicit predication.
Sort first, then process
When the branch outcome correlates with a sortable key, sorting first is sometimes a net win:
// Bad: random tags → mispredict-heavy loop
for (auto& item : items) {
if (item.tag == TAG_FAST) fast_path(item);
else slow_path(item);
}
// Better: sort by tag once, then two predictable loops
std::sort(items.begin(), items.end(), [](auto& a, auto& b) { return a.tag < b.tag; });
auto split = std::find_if(items.begin(), items.end(), [](auto& i) { return i.tag != TAG_FAST; });
for (auto it = items.begin(); it != split; ++it) fast_path(*it);
for (auto it = split; it != items.end(); ++it) slow_path(*it);The sort cost (O(N log N)) plus two perfectly-predicted loops (O(N)) sometimes beats the original O(N) mispredict-heavy loop. Profile to know.
Branch hints (rarely needed)
Compilers accept hints:
if (__builtin_expect(rare_condition, 0)) {
// ...rare path...
}
[[likely]] if (common_condition) { /* ... */ } // C++20 attributeThese let the compiler arrange code so the predicted-fast path stays in instruction cache. Mostly unnecessary — modern predictors learn the right behavior in microseconds. Useful in very hot inner loops or when you have data-driven knowledge the compiler doesn’t.
Run it in your browser — sorted vs unsorted branch cost
In compiled code, the same benchmark gives ~6× speedup just from sorting. Pure Python damps the effect because the interpreter overhead dominates the branch cost. On real CPU code, the gap is the headline number.
Quick check
Key takeaways
- CPUs speculate aggressively. A right guess is ~free; a wrong guess costs 15–20 cycles.
- Patterns are predictable; random data isn’t. Modern predictors hit ~99% on patterns, ~50% on random.
- Branchless / predicated code is the workaround when you can’t make branches predictable.
- Sorting first can beat a mispredict-heavy loop on the cost-of-sort + cost-of-loop.
- The compiler often does the right thing with
cmovand predication. Don’t reach for__builtin_expectuntil profiling shows the branch matters.
Go deeper
- BlogStack Overflow — Why is processing a sorted array faster than an unsorted array?The famous question. The accepted answer is the canonical introduction to branch prediction.
- PaperTAGE: A Case for Tagged Geometric History Length Branch PredictorThe predictor design used in modern Intel CPUs. Heavy reading; useful only if you really care.
- BlogDan Luu — Branch PredictionPractitioner overview with numbers from real CPUs.
- BlogAlgorithmica HPC — Branchless ProgrammingHands-on guide to writing branch-free code with measured speedups.
- DocsIntel SDM — Branch Prediction (Vol 1, §2.5.2)Reference. Not pedagogical.