Threading
A thread is the unit of execution the OS hands you when you say “run this in parallel.” Multiple threads inside the same process share one address space — they all see the same heap, the same global variables, the same files — but each has its own stack and its own slot in the CPU scheduler.
That sounds great. The catch is that “share memory” is a euphemism for “fight over memory.” Two threads writing to the same counter, or two threads using two counters that happen to live in the same 64-byte cache line, or one thread holding a while seven others queue up behind it — these are how you turn a multi-core machine into a slightly-slower single-core machine.
This lesson is about the cost of a thread (~10 microseconds to spawn), the cost of contention, and the patterns that make multi-threaded code actually use the cores you paid for.
TL;DR
- A thread is a kernel-scheduled unit of execution sharing the address space of a process. Creating one costs ~10 μs; switching between two costs ~1–10 μs depending on cache thrashing.
- Don’t fan out to one thread per task. Use a thread pool sized to the core count; queue tasks. The thread-per-task model dies under any real workload.
- Synchronization primitives: (cheapest, lock-free), (general-purpose), condvars (wait/signal), semaphores (counters), barriers (group sync). Pick by the data structure, not the syntax.
- is the silent killer. Two threads fighting over one mutex serialize each other; you’ve added overhead with no parallelism. The fix is usually: shard the data, use lock-free structures, or relax the consistency requirement.
- The Python prevents true CPU parallelism in pure Python — but releases for C extensions (NumPy, PyTorch, etc.). For Python-heavy work use multiprocessing; for C-heavy work threading is fine.
Mental model
Threads share memory; coordinating that sharing is where threading complexity (and cost) lives.
Cost of a thread
auto t = std::thread([] { do_work(); });
// ... main does other work ...
t.join();- Spawn: ~10 μs (kernel allocates stack, registers thread, schedules).
- Context switch: 1–10 μs depending on whether the new thread’s stack is in cache.
- Memory: ~8 MB stack per thread by default (just virtual memory; physical pages are lazy).
For tasks that take under 100 μs, the spawn cost dominates. Never spawn a thread per request, per token, per pixel. Use a thread pool.
Thread pool — the right pattern
class ThreadPool {
std::vector<std::thread> workers;
std::queue<std::function<void()>> tasks;
std::mutex mu;
std::condition_variable cv;
bool stop = false;
public:
ThreadPool(size_t n) {
for (size_t i = 0; i < n; ++i) workers.emplace_back([this]{
while (true) {
std::function<void()> task;
{
std::unique_lock lk(mu);
cv.wait(lk, [this]{ return stop || !tasks.empty(); });
if (stop && tasks.empty()) return;
task = std::move(tasks.front()); tasks.pop();
}
task();
}
});
}
template<class F> void enqueue(F&& f) {
{ std::lock_guard lk(mu); tasks.push(std::forward<F>(f)); }
cv.notify_one();
}
~ThreadPool() {
{ std::lock_guard lk(mu); stop = true; }
cv.notify_all();
for (auto& t : workers) t.join();
}
};Pool size: typically std::thread::hardware_concurrency() (logical cores) for compute-bound work, or 2–4× that for I/O-bound. Anything more is just context-switch overhead.
Production frameworks: oneTBB, Intel TBB, libdispatch (Apple), Folly, Boost.Asio, Tokio (Rust), C++ standard executors (proposed, partially shipped). Don’t roll your own beyond a quick prototype.
Atomics — the cheapest primitive
The cheapest synchronization primitive:
std::atomic<int> counter{0};
// Two threads can do this concurrently safely:
counter.fetch_add(1, std::memory_order_relaxed);A relaxed atomic increment is ~2–10 cycles on x86 (a lock add instruction). Sequentially-consistent (default) is more like 30 cycles. Use relaxed for counters and stats; sequentially-consistent for happens-before relationships.
When you need more than one atomic value, you usually need a mutex. Atomics compose poorly — counter1.fetch_add(1); counter2.fetch_add(1) doesn’t atomically update both.
Mutexes
std::mutex mu;
{
std::lock_guard lk(mu);
shared_data.update();
}Cost when uncontended: ~10–30 cycles. Cost when contended: depends on contention level — at high contention, a mutex can be 10× slower than the protected work itself.
std::shared_mutex (reader-writer): cheaper for read-heavy workloads. Multiple concurrent readers; one writer.
— the silent killer
A mutex held by thread A blocks thread B and costs B a context switch (so B can do other work or sleep until A releases). On a hot mutex with high contention:
Thread 1: do_work_locked() → lock → critical_section_5us → unlock
Thread 2: → lock(blocks) → wakes 5us later → critical_section_5us → unlock
Thread 3: → blocks 10us → ...8 threads serializing through one critical section ≈ 1 thread doing 8× the work. Your parallelism just disappeared.
Fixes:
- Shard the data: 16 buckets, 16 mutexes; thread X locks
bucket_hash(key) & 15. Contention drops to 1/16. - Use lock-free structures:
std::atomicqueues, hazard pointers, RCU (Linux kernel pattern). Hard to write; libraries (folly, concurrent_queue) ship them. - Use copy-on-write / per-thread accumulators: each thread updates its own; merge at the end.
Condition variables
std::mutex mu;
std::condition_variable cv;
std::queue<Task> tasks;
// Producer
{ std::lock_guard lk(mu); tasks.push(t); }
cv.notify_one();
// Consumer
{
std::unique_lock lk(mu);
cv.wait(lk, [&]{ return !tasks.empty(); }); // sleeps until notified + predicate true
auto t = std::move(tasks.front()); tasks.pop();
}The cv.wait(lk, pred) form handles spurious wakeups correctly. Always use the predicate form.
The Python
In CPython, a global lock (the Global Interpreter Lock) prevents two threads from running Python bytecode simultaneously. This means:
- Pure Python threading is single-core.
threading.Threadruns concurrently for I/O but not CPU. multiprocessingdoes run parallel (separate processes, separate GILs).- C extensions release the GIL during heavy compute. NumPy, PyTorch, llama.cpp all release; their threads do real parallel work.
For ML: PyTorch’s CUDA / CPU kernels release the GIL, so threading.Thread + DataLoader workers do work. For pure-Python orchestration, multiprocessing is the right tool.
CPython 3.13 (2024) introduced an experimental “no-GIL” mode (--disable-gil); production rollout is 2025–2026. Worth tracking as it changes the threading vs multiprocessing tradeoff for Python-side work.
Run it in your browser — thread pool sizing
The output is the canonical Python lesson: threads don’t speed up pure-Python CPU code (GIL); they do speed up I/O-bound code dramatically. For ML, your compute is in C extensions that release the GIL, so threading still works for kernel launches and dataloading.
Quick check
Key takeaways
- A thread costs ~10 μs to spawn and ~1–10 μs to switch. Use thread pools sized to core count.
- Atomics, mutexes, condvars, barriers — pick by data structure, not by intuition.
- Lock contention serializes parallelism. Shard data, use lock-free structures, or rethink the design.
- Python GIL prevents pure-Python parallelism, but C extensions (NumPy, PyTorch) release it. Use multiprocessing for Python-heavy work.
- Production frameworks ship thread pools. TBB, oneTBB, Folly, Boost.Asio. Don’t roll your own beyond a prototype.
Go deeper
- Docscppreference — Thread SupportCanonical C++ threading API. Covers std::thread, mutex, atomic, condvar, future.
- VideoHerb Sutter — atomic Weapons (CppCon)The talk that demystified C++ atomics. Memory orderings explained well.
- DocsPython — threading moduleStandard library reference. The "GIL" caveat in the intro is what matters.
- PaperLinux Kernel — What is RCUThe lock-free pattern that makes the kernel scale. Read once; come back when you need it.
- DocsoneTBB DocumentationIndustrial-strength C++ thread pool + concurrent containers.
- BlogLWN — Python without the GILBest technical writeup of the no-GIL Python work. Useful for tracking what's coming.
TL;DR
- A thread is a kernel-scheduled unit of execution sharing the address space of a process. Creating one costs ~10 μs; switching between two costs ~1–10 μs depending on cache thrashing.
- Don’t fan out to one thread per task. Use a thread pool sized to the core count; queue tasks. The thread-per-task model dies under any real workload.
- Synchronization primitives: atomics (cheapest, lock-free), mutexes (general-purpose), condvars (wait/signal), semaphores (counters), barriers (group sync). Pick by the data structure, not the syntax.
- Lock contention is the silent killer. Two threads fighting over one mutex serialize each other; you’ve added overhead with no parallelism. The fix is usually: shard the data, use lock-free structures, or relax the consistency requirement.
- The Python GIL prevents true CPU parallelism in pure Python — but releases for C extensions (NumPy, PyTorch, etc.). For Python-heavy work use multiprocessing; for C-heavy work threading is fine.
Why this matters
Every modern ML system is multi-threaded: the dataloader uses workers, the optimizer uses kernel launches, the serving stack uses async runtimes, the framework’s allocator coordinates across threads. Knowing the cost of a thread, of a mutex, of a context switch — and the patterns that avoid each cost — is what separates “scales linearly to 32 cores” from “32 cores barely faster than 1.” Threading mistakes don’t blow up; they just make your code 10× slower than it should be.
Mental model
Threads share memory; coordinating that sharing is where threading complexity (and cost) lives.
Concrete walkthrough
Cost of a thread
auto t = std::thread([] { do_work(); });
// ... main does other work ...
t.join();- Spawn: ~10 μs (kernel allocates stack, registers thread, schedules).
- Context switch: 1–10 μs depending on whether the new thread’s stack is in cache.
- Memory: ~8 MB stack per thread by default (just virtual memory; physical pages are lazy).
For tasks that take under 100 μs, the spawn cost dominates. Never spawn a thread per request, per token, per pixel. Use a thread pool.
Thread pool — the right pattern
class ThreadPool {
std::vector<std::thread> workers;
std::queue<std::function<void()>> tasks;
std::mutex mu;
std::condition_variable cv;
bool stop = false;
public:
ThreadPool(size_t n) {
for (size_t i = 0; i < n; ++i) workers.emplace_back([this]{
while (true) {
std::function<void()> task;
{
std::unique_lock lk(mu);
cv.wait(lk, [this]{ return stop || !tasks.empty(); });
if (stop && tasks.empty()) return;
task = std::move(tasks.front()); tasks.pop();
}
task();
}
});
}
template<class F> void enqueue(F&& f) {
{ std::lock_guard lk(mu); tasks.push(std::forward<F>(f)); }
cv.notify_one();
}
~ThreadPool() {
{ std::lock_guard lk(mu); stop = true; }
cv.notify_all();
for (auto& t : workers) t.join();
}
};Pool size: typically std::thread::hardware_concurrency() (logical cores) for compute-bound work, or 2–4× that for I/O-bound. Anything more is just context-switch overhead.
Production frameworks: oneTBB, Intel TBB, libdispatch (Apple), Folly, Boost.Asio, Tokio (Rust), C++ standard executors (proposed, partially shipped). Don’t roll your own beyond a quick prototype.
Atomics
The cheapest synchronization primitive:
std::atomic<int> counter{0};
// Two threads can do this concurrently safely:
counter.fetch_add(1, std::memory_order_relaxed);A relaxed atomic increment is ~2–10 cycles on x86 (a lock add instruction). Sequentially-consistent (default) is more like 30 cycles. Use relaxed for counters and stats; sequentially-consistent for happens-before relationships.
When you need more than one atomic value, you usually need a mutex. Atomics compose poorly — counter1.fetch_add(1); counter2.fetch_add(1) doesn’t atomically update both.
Mutexes
std::mutex mu;
{
std::lock_guard lk(mu);
shared_data.update();
}Cost when uncontended: ~10–30 cycles. Cost when contended: depends on contention level — at high contention, a mutex can be 10× slower than the protected work itself.
std::shared_mutex (reader-writer): cheaper for read-heavy workloads. Multiple concurrent readers; one writer.
Lock contention — the silent killer
A mutex held by thread A blocks thread B and costs B a context switch (so B can do other work or sleep until A releases). On a hot mutex with high contention:
Thread 1: do_work_locked() → lock → critical_section_5us → unlock
Thread 2: → lock(blocks) → wakes 5us later → critical_section_5us → unlock
Thread 3: → blocks 10us → ...8 threads serializing through one critical section ≈ 1 thread doing 8× the work. Your parallelism just disappeared.
Fixes:
- Shard the data: 16 buckets, 16 mutexes; thread X locks
bucket_hash(key) & 15. Contention drops to 1/16. - Use lock-free structures:
std::atomicqueues, hazard pointers, RCU (Linux kernel pattern). Hard to write; libraries (folly, concurrent_queue) ship them. - Use copy-on-write / per-thread accumulators: each thread updates its own; merge at the end.
Condition variables
std::mutex mu;
std::condition_variable cv;
std::queue<Task> tasks;
// Producer
{ std::lock_guard lk(mu); tasks.push(t); }
cv.notify_one();
// Consumer
{
std::unique_lock lk(mu);
cv.wait(lk, [&]{ return !tasks.empty(); }); // sleeps until notified + predicate true
auto t = std::move(tasks.front()); tasks.pop();
}The cv.wait(lk, pred) form handles spurious wakeups correctly. Always use the predicate form.
The Python GIL
In CPython, a global lock (the Global Interpreter Lock) prevents two threads from running Python bytecode simultaneously. This means:
- Pure Python threading is single-core.
threading.Threadruns concurrently for I/O but not CPU. multiprocessingdoes run parallel (separate processes, separate GILs).- C extensions release the GIL during heavy compute. NumPy, PyTorch, llama.cpp all release; their threads do real parallel work.
For ML: PyTorch’s CUDA / CPU kernels release the GIL, so threading.Thread + DataLoader workers do work. For pure-Python orchestration, multiprocessing is the right tool.
CPython 3.13 (2024) introduced an experimental “no-GIL” mode (--disable-gil); production rollout is 2025–2026. Worth tracking as it changes the threading vs multiprocessing tradeoff for Python-side work.
Run it in your browser — thread pool sizing
The output is the canonical Python lesson: threads don’t speed up pure-Python CPU code (GIL); they do speed up I/O-bound code dramatically. For ML, your compute is in C extensions that release the GIL, so threading still works for kernel launches and dataloading.
Quick check
Key takeaways
- A thread costs ~10 μs to spawn and ~1–10 μs to switch. Use thread pools sized to core count.
- Atomics, mutexes, condvars, barriers — pick by data structure, not by intuition.
- Lock contention serializes parallelism. Shard data, use lock-free structures, or rethink the design.
- Python GIL prevents pure-Python parallelism, but C extensions (NumPy, PyTorch) release it. Use multiprocessing for Python-heavy work.
- Production frameworks ship thread pools. TBB, oneTBB, Folly, Boost.Asio. Don’t roll your own beyond a prototype.
Go deeper
- Docscppreference — Thread SupportCanonical C++ threading API. Covers std::thread, mutex, atomic, condvar, future.
- VideoHerb Sutter — atomic Weapons (CppCon)The talk that demystified C++ atomics. Memory orderings explained well.
- DocsPython — threading moduleStandard library reference. The "GIL" caveat in the intro is what matters.
- PaperLinux Kernel — What is RCUThe lock-free pattern that makes the kernel scale. Read once; come back when you need it.
- DocsoneTBB DocumentationIndustrial-strength C++ thread pool + concurrent containers.
- BlogLWN — Python without the GILBest technical writeup of the no-GIL Python work. Useful for tracking what's coming.