SIMD
There’s a layer of parallelism underneath threads that most code doesn’t use. Your CPU has been able to multiply 8 floats in a single instruction since the late 2000s, and 16 floats since 2017. AVX-512 calls it “Advanced Vector Extensions”; ARM calls its version NEON; Apple goes further and ships matrix-multiply tile instructions in its M-series chips. The general name for the technique is — Single Instruction, Multiple Data.
The reason most code doesn’t use it: the compiler has to prove the loop is safe to vectorize, and small details (pointer aliasing, conditional branches, floating-point ordering) trip it up. When auto-vectorization works, you get an 8–16× speedup essentially for free. When it doesn’t, you write by hand — and that’s how high-performance ML kernels (llama.cpp, OpenBLAS, Eigen) actually deliver their numbers on CPU.
This lesson is about the third axis of parallelism — neither threads nor processes, just more lanes per instruction — and why it’s the difference between “0.5 tokens per second on a Raspberry Pi” and “5 tokens per second.”
TL;DR
- = Single Instruction, Multiple Data. One CPU instruction operates on a whole vector of values. does 16 FP32 ops per cycle; ARM SVE2 is similar; Apple adds matrix-shaped ops.
- It’s a third level of parallelism below threads and processes — same core, multiple lanes per instruction.
- Compilers auto-vectorize simple loops with
-O3 -march=native. They struggle with: pointer aliasing, conditional branches, non-trivial reductions. Anything they don’t vectorize, you write for. - Pyodide / numpy /
__builtin_*intrinsics are the layered API. Numpy’s array operations dispatch to SIMD-optimized C; Eigen / xsimd / highway / std::experimental::simd give portable C++ access. - For ML on CPU:
llama.cpp’s / AVX-512 paths are the production reference. SIMD is what makes “decode a 7B Q4_K_M on a phone” runnable.
Mental model
Wider registers = more lanes per instruction = more throughput per cycle. The progression: scalar → 128-bit → 256 → 512 → matrix tiles.
Auto-vectorization — when the compiler does the work
void scale(float* x, float s, size_t n) {
for (size_t i = 0; i < n; ++i) {
x[i] *= s;
}
}Compile with g++ -O3 -mavx2: the compiler emits AVX2 code processing 8 floats per iteration. With -mavx512f: 16 floats per iteration. Free 8–16× speedup if the loop is simple enough.
What blocks auto-vectorization:
- Pointer aliasing: if
xandycould overlap, the compiler can’t reorder reads/writes safely. Mark with__restrict__to help. - Conditional branches inside the loop:
if (cond) y[i] = a; else y[i] = b;may force scalar fallback. Predication / select intrinsics fix this. - Reductions with floating-point order constraints:
sum += x[i]is technically reordered by SIMD, which changes results. Use-ffast-mathif you accept that. - Indirect addressing:
y[i] = x[idx[i]]is hard to vectorize without gather instructions (AVX2 has them but they’re slow).
Always check what the compiler did:
g++ -O3 -march=native -fopt-info-vec-all scale.cpp 2>&1 | head -10
# Reports which loops vectorized and which didn't (with reasons).Or look at the assembly directly: objdump -d and search for vmul, vfma, etc.
When you write
Some loops the compiler won’t vectorize. Write the SIMD by hand:
#include <immintrin.h>
void scale_avx2(float* x, float s, size_t n) {
__m256 vs = _mm256_set1_ps(s); // broadcast s into 8 lanes
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 v = _mm256_loadu_ps(x + i); // load 8 floats
v = _mm256_mul_ps(v, vs); // multiply 8 floats in parallel
_mm256_storeu_ps(x + i, v); // store back
}
for (; i < n; ++i) x[i] *= s; // scalar tail for non-multiple-of-8
}Same effect as the compiler-vectorized version, but you control:
- Which registers / instructions get used.
- How the loop is unrolled.
- Cache prefetching (
_mm_prefetch). - Tail handling.
For production-quality SIMD code (cuBLAS, oneMKL, OpenBLAS, llama.cpp, Eigen), every hot loop is hand-vectorized and aggressively tuned.
Portable SIMD — Highway, xsimd, std::experimental::simd
Intrinsics are non-portable: AVX-512 doesn’t run on ARM. Wrappers fix this:
#include <hwy/highway.h> // Google's portable SIMD library
namespace hn = hwy::HWY_NAMESPACE;
template<typename D>
void scale(D d, float* x, float s, size_t n) {
auto vs = hn::Set(d, s);
size_t i = 0;
for (; i + hn::Lanes(d) <= n; i += hn::Lanes(d)) {
auto v = hn::LoadU(d, x + i);
v = hn::Mul(v, vs);
hn::StoreU(v, d, x + i);
}
for (; i < n; ++i) x[i] *= s;
}Compiles to AVX-512 on x86, NEON on ARM, scalar on platforms without SIMD. C++26 standardizes this as std::simd.
Apple and matrix-shaped instructions
Apple’s M-series chips include AMX — undocumented matrix-multiply instructions that operate on 32×32 tile blocks. Apple’s Accelerate framework (and llama.cpp on macOS) use AMX for FP16 GEMM, achieving ~3× the throughput of vanilla NEON. Not officially exposed but reverse-engineered and integrated where it matters.
Intel’s AMX (Advanced Matrix Extensions, separately) on Sapphire Rapids+ also adds matrix-tile instructions. Intel’s oneDNN uses them for INT8 GEMM at high throughput. The pattern of “add matrix instructions to the CPU” is universal in 2024+ silicon — the CPU is converging on what tensor cores started.
Numpy / SciPy under the hood
When you write np.dot(x, y), numpy calls into BLAS (OpenBLAS, MKL, Apple Accelerate). BLAS calls hand-optimized SIMD GEMM. The 1000× gap between for i in range: y += x[i]*w[i] (Python) and y = x @ w (numpy) is mostly SIMD plus parallel threads.
Same for PyTorch CPU kernels. Pure-Python perception of “Python is slow” is mostly “Python is slow when it’s not calling SIMD-optimized C.”
llama.cpp’s SIMD strategy
ggml-cpu.c in llama.cpp has separate hot-loop implementations for:
- AVX-512 (newest Intel server / Zen 4+ AMD)
- AVX2 (most x86)
- NEON / Apple AMX (M-series Macs and ARM phones)
- Pure scalar fallback
Each is hand-tuned. The matmul kernels for K-quants do unpack-on-the-fly + multiply at SIMD speed. It’s the canonical reference for how to write fast SIMD AI kernels in C.
Run it in your browser — SIMD-style operations via numpy
The output is the canonical lesson: a “for i in range” loop in Python is hundreds of times slower than x @ y not because Python is slow, but because Python skips both SIMD and BLAS.
Quick check
Key takeaways
- SIMD = parallelism within one core. AVX-512: 16 FP32 per cycle. NEON / SVE: similar.
- Compilers auto-vectorize simple loops.
-O3 -march=nativeand check the report. Aliasing and branches block them. - Hand-write intrinsics for hot loops the compiler can’t reach. Use Highway / xsimd / std::simd for portability.
- Apple AMX, Intel AMX, ARM SME are the matrix-tile instructions — CPU converging on tensor-core-shaped ops.
- Numpy / BLAS / llama.cpp internalize all this. When you can’t reach for those, write SIMD.
Go deeper
- DocsIntel Intrinsics GuideThe reference. Search by instruction name; copy idiomatic usage.
- DocsARM Neon Programmer's GuideARM's official guide. Cleaner than Intel's docs; same model.
- BlogAlgorithmica HPC — SIMDBest free SIMD intro on the web. Examples + measured speedups.
- BlogJustine Tunney — Matrix Multiplication on CPUWhy llama.cpp's CPU matmul beats OpenBLAS on Apple Silicon. SIMD + microkernel design at production quality.
- DocsGoogle HighwayPortable C++ SIMD library. The Lanes(d) abstraction is the future.
- Docsstd::simd (C++26 / experimental)Portable SIMD coming to the C++ standard. The cleanest API.
- Repollama.cpp — ggml-cpu.cProduction-quality SIMD AI kernels. Read for the per-arch hot-loop discipline.
TL;DR
- SIMD = Single Instruction, Multiple Data. One CPU instruction operates on a whole vector of values. AVX-512 does 16 FP32 ops per cycle; ARM SVE2 is similar; Apple AMX adds matrix-shaped ops.
- It’s a third level of parallelism below threads and processes — same core, multiple lanes per instruction.
- Compilers auto-vectorize simple loops with
-O3 -march=native. They struggle with: pointer aliasing, conditional branches, non-trivial reductions. Anything they don’t vectorize, you write intrinsics for. - Pyodide / numpy /
__builtin_*intrinsics are the layered API. Numpy’s array operations dispatch to SIMD-optimized C; Eigen / xsimd / highway / std::experimental::simd give portable C++ access. - For ML on CPU:
llama.cpp’s NEON / AVX-512 paths are the production reference. SIMD is what makes “decode a 7B Q4_K_M on a phone” runnable.
Why this matters
When the GPU isn’t an option (mobile, edge, CPU-only training, serverless inference), SIMD is the difference between “0.5 tok/s on this Raspberry Pi” and “5 tok/s.” llama.cpp’s CPU performance is essentially “carefully written SIMD”; PyTorch’s CPU kernels are “MKL-DNN compiled with AVX-512.” Knowing SIMD is the price of admission for any non-GPU AI work — and a useful debugging skill even for GPU work because the same locality discipline applies.
Mental model
Wider registers = more lanes per instruction = more throughput per cycle. The progression: scalar → 128-bit → 256 → 512 → matrix tiles.
Concrete walkthrough
Auto-vectorization
void scale(float* x, float s, size_t n) {
for (size_t i = 0; i < n; ++i) {
x[i] *= s;
}
}Compile with g++ -O3 -mavx2: the compiler emits AVX2 code processing 8 floats per iteration. With -mavx512f: 16 floats per iteration. Free 8–16× speedup if the loop is simple enough.
What blocks auto-vectorization:
- Pointer aliasing: if
xandycould overlap, the compiler can’t reorder reads/writes safely. Mark with__restrict__to help. - Conditional branches inside the loop:
if (cond) y[i] = a; else y[i] = b;may force scalar fallback. Predication / select intrinsics fix this. - Reductions with floating-point order constraints:
sum += x[i]is technically reordered by SIMD, which changes results. Use-ffast-mathif you accept that. - Indirect addressing:
y[i] = x[idx[i]]is hard to vectorize without gather instructions (AVX2 has them but they’re slow).
Always check what the compiler did:
g++ -O3 -march=native -fopt-info-vec-all scale.cpp 2>&1 | head -10
# Reports which loops vectorized and which didn't (with reasons).Or look at the assembly directly: objdump -d and search for vmul, vfma, etc.
When you write intrinsics
Some loops the compiler won’t vectorize. Write the SIMD by hand:
#include <immintrin.h>
void scale_avx2(float* x, float s, size_t n) {
__m256 vs = _mm256_set1_ps(s); // broadcast s into 8 lanes
size_t i = 0;
for (; i + 8 <= n; i += 8) {
__m256 v = _mm256_loadu_ps(x + i); // load 8 floats
v = _mm256_mul_ps(v, vs); // multiply 8 floats in parallel
_mm256_storeu_ps(x + i, v); // store back
}
for (; i < n; ++i) x[i] *= s; // scalar tail for non-multiple-of-8
}Same effect as the compiler-vectorized version, but you control:
- Which registers / instructions get used.
- How the loop is unrolled.
- Cache prefetching (
_mm_prefetch). - Tail handling.
For production-quality SIMD code (cuBLAS, oneMKL, OpenBLAS, llama.cpp, Eigen), every hot loop is hand-vectorized and aggressively tuned.
Portable SIMD — Highway, xsimd, std::experimental::simd
Intrinsics are non-portable: AVX-512 doesn’t run on ARM. Wrappers fix this:
#include <hwy/highway.h> // Google's portable SIMD library
namespace hn = hwy::HWY_NAMESPACE;
template<typename D>
void scale(D d, float* x, float s, size_t n) {
auto vs = hn::Set(d, s);
size_t i = 0;
for (; i + hn::Lanes(d) <= n; i += hn::Lanes(d)) {
auto v = hn::LoadU(d, x + i);
v = hn::Mul(v, vs);
hn::StoreU(v, d, x + i);
}
for (; i < n; ++i) x[i] *= s;
}Compiles to AVX-512 on x86, NEON on ARM, scalar on platforms without SIMD. C++26 standardizes this as std::simd.
Apple AMX and matrix-shaped instructions
Apple’s M-series chips include AMX — undocumented matrix-multiply instructions that operate on 32×32 tile blocks. Apple’s Accelerate framework (and llama.cpp on macOS) use AMX for FP16 GEMM, achieving ~3× the throughput of vanilla NEON. Not officially exposed but reverse-engineered and integrated where it matters.
Intel’s AMX (Advanced Matrix Extensions, separately) on Sapphire Rapids+ also adds matrix-tile instructions. Intel’s oneDNN uses them for INT8 GEMM at high throughput. The pattern of “add matrix instructions to the CPU” is universal in 2024+ silicon — the CPU is converging on what tensor cores started.
Numpy / SciPy under the hood
When you write np.dot(x, y), numpy calls into BLAS (OpenBLAS, MKL, Apple Accelerate). BLAS calls hand-optimized SIMD GEMM. The 1000× gap between for i in range: y += x[i]*w[i] (Python) and y = x @ w (numpy) is mostly SIMD plus parallel threads.
Same for PyTorch CPU kernels. Pure-Python perception of “Python is slow” is mostly “Python is slow when it’s not calling SIMD-optimized C.”
llama.cpp’s SIMD strategy
ggml-cpu.c in llama.cpp has separate hot-loop implementations for:
- AVX-512 (newest Intel server / Zen 4+ AMD)
- AVX2 (most x86)
- NEON / Apple AMX (M-series Macs and ARM phones)
- Pure scalar fallback
Each is hand-tuned. The matmul kernels for K-quants do unpack-on-the-fly + multiply at SIMD speed. It’s the canonical reference for how to write fast SIMD AI kernels in C.
Run it in your browser — SIMD-style operations via numpy
The output is the canonical lesson: a “for i in range” loop in Python is hundreds of times slower than x @ y not because Python is slow, but because Python skips both SIMD and BLAS.
Quick check
Key takeaways
- SIMD = parallelism within one core. AVX-512: 16 FP32 per cycle. NEON / SVE: similar.
- Compilers auto-vectorize simple loops.
-O3 -march=nativeand check the report. Aliasing and branches block them. - Hand-write intrinsics for hot loops the compiler can’t reach. Use Highway / xsimd / std::simd for portability.
- Apple AMX, Intel AMX, ARM SME are the matrix-tile instructions — CPU converging on tensor-core-shaped ops.
- Numpy / BLAS / llama.cpp internalize all this. When you can’t reach for those, write SIMD.
Go deeper
- DocsIntel Intrinsics GuideThe reference. Search by instruction name; copy idiomatic usage.
- DocsARM Neon Programmer's GuideARM's official guide. Cleaner than Intel's docs; same model.
- BlogAlgorithmica HPC — SIMDBest free SIMD intro on the web. Examples + measured speedups.
- BlogJustine Tunney — Matrix Multiplication on CPUWhy llama.cpp's CPU matmul beats OpenBLAS on Apple Silicon. SIMD + microkernel design at production quality.
- DocsGoogle HighwayPortable C++ SIMD library. The Lanes(d) abstraction is the future.
- Docsstd::simd (C++26 / experimental)Portable SIMD coming to the C++ standard. The cleanest API.
- Repollama.cpp — ggml-cpu.cProduction-quality SIMD AI kernels. Read for the per-arch hot-loop discipline.