Skip to content

SIMD

There’s a layer of parallelism underneath threads that most code doesn’t use. Your CPU has been able to multiply 8 floats in a single instruction since the late 2000s, and 16 floats since 2017. AVX-512 calls it “Advanced Vector Extensions”; ARM calls its version NEON; Apple goes further and ships matrix-multiply tile instructions in its M-series chips. The general name for the technique is — Single Instruction, Multiple Data.

The reason most code doesn’t use it: the compiler has to prove the loop is safe to vectorize, and small details (pointer aliasing, conditional branches, floating-point ordering) trip it up. When auto-vectorization works, you get an 8–16× speedup essentially for free. When it doesn’t, you write by hand — and that’s how high-performance ML kernels (llama.cpp, OpenBLAS, Eigen) actually deliver their numbers on CPU.

This lesson is about the third axis of parallelism — neither threads nor processes, just more lanes per instruction — and why it’s the difference between “0.5 tokens per second on a Raspberry Pi” and “5 tokens per second.”

TL;DR

  • = Single Instruction, Multiple Data. One CPU instruction operates on a whole vector of values. does 16 FP32 ops per cycle; ARM SVE2 is similar; Apple adds matrix-shaped ops.
  • It’s a third level of parallelism below threads and processes — same core, multiple lanes per instruction.
  • Compilers auto-vectorize simple loops with -O3 -march=native. They struggle with: pointer aliasing, conditional branches, non-trivial reductions. Anything they don’t vectorize, you write for.
  • Pyodide / numpy / __builtin_* intrinsics are the layered API. Numpy’s array operations dispatch to SIMD-optimized C; Eigen / xsimd / highway / std::experimental::simd give portable C++ access.
  • For ML on CPU: llama.cpp’s / AVX-512 paths are the production reference. SIMD is what makes “decode a 7B Q4_K_M on a phone” runnable.

Mental model

Wider registers = more lanes per instruction = more throughput per cycle. The progression: scalar → 128-bit → 256 → 512 → matrix tiles.

Auto-vectorization — when the compiler does the work

void scale(float* x, float s, size_t n) { for (size_t i = 0; i < n; ++i) { x[i] *= s; } }

Compile with g++ -O3 -mavx2: the compiler emits AVX2 code processing 8 floats per iteration. With -mavx512f: 16 floats per iteration. Free 8–16× speedup if the loop is simple enough.

What blocks auto-vectorization:

  • Pointer aliasing: if x and y could overlap, the compiler can’t reorder reads/writes safely. Mark with __restrict__ to help.
  • Conditional branches inside the loop: if (cond) y[i] = a; else y[i] = b; may force scalar fallback. Predication / select intrinsics fix this.
  • Reductions with floating-point order constraints: sum += x[i] is technically reordered by SIMD, which changes results. Use -ffast-math if you accept that.
  • Indirect addressing: y[i] = x[idx[i]] is hard to vectorize without gather instructions (AVX2 has them but they’re slow).

Always check what the compiler did:

g++ -O3 -march=native -fopt-info-vec-all scale.cpp 2>&1 | head -10 # Reports which loops vectorized and which didn't (with reasons).

Or look at the assembly directly: objdump -d and search for vmul, vfma, etc.

When you write

Some loops the compiler won’t vectorize. Write the SIMD by hand:

#include <immintrin.h> void scale_avx2(float* x, float s, size_t n) { __m256 vs = _mm256_set1_ps(s); // broadcast s into 8 lanes size_t i = 0; for (; i + 8 <= n; i += 8) { __m256 v = _mm256_loadu_ps(x + i); // load 8 floats v = _mm256_mul_ps(v, vs); // multiply 8 floats in parallel _mm256_storeu_ps(x + i, v); // store back } for (; i < n; ++i) x[i] *= s; // scalar tail for non-multiple-of-8 }

Same effect as the compiler-vectorized version, but you control:

  • Which registers / instructions get used.
  • How the loop is unrolled.
  • Cache prefetching (_mm_prefetch).
  • Tail handling.

For production-quality SIMD code (cuBLAS, oneMKL, OpenBLAS, llama.cpp, Eigen), every hot loop is hand-vectorized and aggressively tuned.

Portable SIMD — Highway, xsimd, std::experimental::simd

Intrinsics are non-portable: AVX-512 doesn’t run on ARM. Wrappers fix this:

#include <hwy/highway.h> // Google's portable SIMD library namespace hn = hwy::HWY_NAMESPACE; template<typename D> void scale(D d, float* x, float s, size_t n) { auto vs = hn::Set(d, s); size_t i = 0; for (; i + hn::Lanes(d) <= n; i += hn::Lanes(d)) { auto v = hn::LoadU(d, x + i); v = hn::Mul(v, vs); hn::StoreU(v, d, x + i); } for (; i < n; ++i) x[i] *= s; }

Compiles to AVX-512 on x86, NEON on ARM, scalar on platforms without SIMD. C++26 standardizes this as std::simd.

Apple and matrix-shaped instructions

Apple’s M-series chips include AMX — undocumented matrix-multiply instructions that operate on 32×32 tile blocks. Apple’s Accelerate framework (and llama.cpp on macOS) use AMX for FP16 GEMM, achieving ~3× the throughput of vanilla NEON. Not officially exposed but reverse-engineered and integrated where it matters.

Intel’s AMX (Advanced Matrix Extensions, separately) on Sapphire Rapids+ also adds matrix-tile instructions. Intel’s oneDNN uses them for INT8 GEMM at high throughput. The pattern of “add matrix instructions to the CPU” is universal in 2024+ silicon — the CPU is converging on what tensor cores started.

Numpy / SciPy under the hood

When you write np.dot(x, y), numpy calls into BLAS (OpenBLAS, MKL, Apple Accelerate). BLAS calls hand-optimized SIMD GEMM. The 1000× gap between for i in range: y += x[i]*w[i] (Python) and y = x @ w (numpy) is mostly SIMD plus parallel threads.

Same for PyTorch CPU kernels. Pure-Python perception of “Python is slow” is mostly “Python is slow when it’s not calling SIMD-optimized C.”

llama.cpp’s SIMD strategy

ggml-cpu.c in llama.cpp has separate hot-loop implementations for:

  • AVX-512 (newest Intel server / Zen 4+ AMD)
  • AVX2 (most x86)
  • NEON / Apple AMX (M-series Macs and ARM phones)
  • Pure scalar fallback

Each is hand-tuned. The matmul kernels for K-quants do unpack-on-the-fly + multiply at SIMD speed. It’s the canonical reference for how to write fast SIMD AI kernels in C.

Run it in your browser — SIMD-style operations via numpy

Python — editableCompare scalar (Python loop) vs vectorized (numpy) on the same operation. Numpy dispatches to SIMD-optimized C internally.
Ctrl+Enter to run

The output is the canonical lesson: a “for i in range” loop in Python is hundreds of times slower than x @ y not because Python is slow, but because Python skips both SIMD and BLAS.

Quick check

Fill in the blank
The 512-bit-wide SIMD instruction set on Intel and AMD CPUs that processes 16 FP32 values per instruction:
Acronym + bit width.
Quick check
A team's CPU inference is at 5% of theoretical peak FP32. The most likely missing piece:

Key takeaways

  1. SIMD = parallelism within one core. AVX-512: 16 FP32 per cycle. NEON / SVE: similar.
  2. Compilers auto-vectorize simple loops. -O3 -march=native and check the report. Aliasing and branches block them.
  3. Hand-write intrinsics for hot loops the compiler can’t reach. Use Highway / xsimd / std::simd for portability.
  4. Apple AMX, Intel AMX, ARM SME are the matrix-tile instructions — CPU converging on tensor-core-shaped ops.
  5. Numpy / BLAS / llama.cpp internalize all this. When you can’t reach for those, write SIMD.

Go deeper

TL;DR

  • SIMD = Single Instruction, Multiple Data. One CPU instruction operates on a whole vector of values. AVX-512 does 16 FP32 ops per cycle; ARM SVE2 is similar; Apple AMX adds matrix-shaped ops.
  • It’s a third level of parallelism below threads and processes — same core, multiple lanes per instruction.
  • Compilers auto-vectorize simple loops with -O3 -march=native. They struggle with: pointer aliasing, conditional branches, non-trivial reductions. Anything they don’t vectorize, you write intrinsics for.
  • Pyodide / numpy / __builtin_* intrinsics are the layered API. Numpy’s array operations dispatch to SIMD-optimized C; Eigen / xsimd / highway / std::experimental::simd give portable C++ access.
  • For ML on CPU: llama.cpp’s NEON / AVX-512 paths are the production reference. SIMD is what makes “decode a 7B Q4_K_M on a phone” runnable.

Why this matters

When the GPU isn’t an option (mobile, edge, CPU-only training, serverless inference), SIMD is the difference between “0.5 tok/s on this Raspberry Pi” and “5 tok/s.” llama.cpp’s CPU performance is essentially “carefully written SIMD”; PyTorch’s CPU kernels are “MKL-DNN compiled with AVX-512.” Knowing SIMD is the price of admission for any non-GPU AI work — and a useful debugging skill even for GPU work because the same locality discipline applies.

Mental model

Wider registers = more lanes per instruction = more throughput per cycle. The progression: scalar → 128-bit → 256 → 512 → matrix tiles.

Concrete walkthrough

Auto-vectorization

void scale(float* x, float s, size_t n) { for (size_t i = 0; i < n; ++i) { x[i] *= s; } }

Compile with g++ -O3 -mavx2: the compiler emits AVX2 code processing 8 floats per iteration. With -mavx512f: 16 floats per iteration. Free 8–16× speedup if the loop is simple enough.

What blocks auto-vectorization:

  • Pointer aliasing: if x and y could overlap, the compiler can’t reorder reads/writes safely. Mark with __restrict__ to help.
  • Conditional branches inside the loop: if (cond) y[i] = a; else y[i] = b; may force scalar fallback. Predication / select intrinsics fix this.
  • Reductions with floating-point order constraints: sum += x[i] is technically reordered by SIMD, which changes results. Use -ffast-math if you accept that.
  • Indirect addressing: y[i] = x[idx[i]] is hard to vectorize without gather instructions (AVX2 has them but they’re slow).

Always check what the compiler did:

g++ -O3 -march=native -fopt-info-vec-all scale.cpp 2>&1 | head -10 # Reports which loops vectorized and which didn't (with reasons).

Or look at the assembly directly: objdump -d and search for vmul, vfma, etc.

When you write intrinsics

Some loops the compiler won’t vectorize. Write the SIMD by hand:

#include <immintrin.h> void scale_avx2(float* x, float s, size_t n) { __m256 vs = _mm256_set1_ps(s); // broadcast s into 8 lanes size_t i = 0; for (; i + 8 <= n; i += 8) { __m256 v = _mm256_loadu_ps(x + i); // load 8 floats v = _mm256_mul_ps(v, vs); // multiply 8 floats in parallel _mm256_storeu_ps(x + i, v); // store back } for (; i < n; ++i) x[i] *= s; // scalar tail for non-multiple-of-8 }

Same effect as the compiler-vectorized version, but you control:

  • Which registers / instructions get used.
  • How the loop is unrolled.
  • Cache prefetching (_mm_prefetch).
  • Tail handling.

For production-quality SIMD code (cuBLAS, oneMKL, OpenBLAS, llama.cpp, Eigen), every hot loop is hand-vectorized and aggressively tuned.

Portable SIMD — Highway, xsimd, std::experimental::simd

Intrinsics are non-portable: AVX-512 doesn’t run on ARM. Wrappers fix this:

#include <hwy/highway.h> // Google's portable SIMD library namespace hn = hwy::HWY_NAMESPACE; template<typename D> void scale(D d, float* x, float s, size_t n) { auto vs = hn::Set(d, s); size_t i = 0; for (; i + hn::Lanes(d) <= n; i += hn::Lanes(d)) { auto v = hn::LoadU(d, x + i); v = hn::Mul(v, vs); hn::StoreU(v, d, x + i); } for (; i < n; ++i) x[i] *= s; }

Compiles to AVX-512 on x86, NEON on ARM, scalar on platforms without SIMD. C++26 standardizes this as std::simd.

Apple AMX and matrix-shaped instructions

Apple’s M-series chips include AMX — undocumented matrix-multiply instructions that operate on 32×32 tile blocks. Apple’s Accelerate framework (and llama.cpp on macOS) use AMX for FP16 GEMM, achieving ~3× the throughput of vanilla NEON. Not officially exposed but reverse-engineered and integrated where it matters.

Intel’s AMX (Advanced Matrix Extensions, separately) on Sapphire Rapids+ also adds matrix-tile instructions. Intel’s oneDNN uses them for INT8 GEMM at high throughput. The pattern of “add matrix instructions to the CPU” is universal in 2024+ silicon — the CPU is converging on what tensor cores started.

Numpy / SciPy under the hood

When you write np.dot(x, y), numpy calls into BLAS (OpenBLAS, MKL, Apple Accelerate). BLAS calls hand-optimized SIMD GEMM. The 1000× gap between for i in range: y += x[i]*w[i] (Python) and y = x @ w (numpy) is mostly SIMD plus parallel threads.

Same for PyTorch CPU kernels. Pure-Python perception of “Python is slow” is mostly “Python is slow when it’s not calling SIMD-optimized C.”

llama.cpp’s SIMD strategy

ggml-cpu.c in llama.cpp has separate hot-loop implementations for:

  • AVX-512 (newest Intel server / Zen 4+ AMD)
  • AVX2 (most x86)
  • NEON / Apple AMX (M-series Macs and ARM phones)
  • Pure scalar fallback

Each is hand-tuned. The matmul kernels for K-quants do unpack-on-the-fly + multiply at SIMD speed. It’s the canonical reference for how to write fast SIMD AI kernels in C.

Run it in your browser — SIMD-style operations via numpy

Python — editableCompare scalar (Python loop) vs vectorized (numpy) on the same operation. Numpy dispatches to SIMD-optimized C internally.
Ctrl+Enter to run

The output is the canonical lesson: a “for i in range” loop in Python is hundreds of times slower than x @ y not because Python is slow, but because Python skips both SIMD and BLAS.

Quick check

Fill in the blank
The 512-bit-wide SIMD instruction set on Intel and AMD CPUs that processes 16 FP32 values per instruction:
Acronym + bit width.
Quick check
A team's CPU inference is at 5% of theoretical peak FP32. The most likely missing piece:

Key takeaways

  1. SIMD = parallelism within one core. AVX-512: 16 FP32 per cycle. NEON / SVE: similar.
  2. Compilers auto-vectorize simple loops. -O3 -march=native and check the report. Aliasing and branches block them.
  3. Hand-write intrinsics for hot loops the compiler can’t reach. Use Highway / xsimd / std::simd for portability.
  4. Apple AMX, Intel AMX, ARM SME are the matrix-tile instructions — CPU converging on tensor-core-shaped ops.
  5. Numpy / BLAS / llama.cpp internalize all this. When you can’t reach for those, write SIMD.

Go deeper