Inference-Time Architecture
A model file is just weights. Turning that into a serving system that returns tokens at low latency for many concurrent users requires a different kind of architecture — and that architecture is where 80% of real-world LLM engineering happens.
0 / 4 lessons ~61 min total
Module capstone — build it
A continuous-batching inference server in 200 lines The vLLM scheduler trick — packed forward passes that triple your throughput — built from scratch.
Advanced · One focused weekend (~9 h) · Free Colab T4
A FastAPI server wrapping Llama-3.2-1B with a custom scheduler. The artifact is the server source + a Locust load test report comparing single-stream vs continuous-batched throughput at 50 concurrent clients.
Build it — step by step
01 Single-stream baseline — sequential FastAPI 60 min
Wrap `model.generate()` in a FastAPI POST endpoint. One request at a time. Stream tokens as Server-Sent Events. Hit it with 50 sequential clients via Locust; measure tokens/s/client and total throughput.
checkpoint Server runs; Locust reports baseline throughput (will be embarrassingly low under concurrency).
watch out Don't use `model.generate()` directly under concurrency — it holds the GIL and blocks. The baseline's job is to *be slow*, then we beat it.
02 Build the request queue + per-request state 60 min
When a request arrives, create a Sequence object: prompt_tokens, generated_tokens=[], asyncio.Queue of out-tokens for the client, status, position. Push to a global pending-queue. Server returns SSE that drains the per-request queue.
checkpoint Multiple requests in flight; each has its own queue; nothing has crossed wires.
03 The scheduler loop — one forward pass per step 120 min
In a background asyncio task: each step, pull active sequences (capped at MAX_BATCH=16). Pad to the longest. Run *one* `model(...)` forward pass with the batched input. Write the new token of each sequence into its per-client queue.
checkpoint You can hit the server with 4 concurrent requests; tokens stream to all 4 simultaneously instead of serially.
watch out Padding tokens contaminate the next-token prediction. Use an attention mask to ignore them; verify the unpadded output still matches.
04 Continuous batching — admit new requests mid-batch 90 min
Modify the scheduler: every step, evict finished sequences and admit pending ones. The batch composition changes on every iteration — the GPU never idles between requests. Cap at MAX_BATCH active sequences.
checkpoint Locust load test with 50 clients of varied output length: throughput stays high even as some sequences finish early. Compared to sequential baseline, you see ~3–5× total tokens/s.
watch out When sequences enter mid-batch, their KV-cache must start fresh — no shared state with finished sequences. A shared KV tensor without proper indexing leaks tokens between users.
05 Measure the win with Locust 45 min
Run `locust -f load.py --users 50 --spawn-rate 5` against both servers (sequential and continuous-batched). Save the HTML reports. Note p50 and p99 latency, total throughput, the GPU utilization difference.
checkpoint You have two side-by-side Locust HTML reports proving the win.
06 README + push 45 min
Repo with `server.py` (200 lines), `load.py` (Locust), `README.md` linking to the vLLM continuous-batching post and the PagedAttention paper. Embed the throughput comparison.
checkpoint A reader can clone, `pip install -r requirements.txt`, run `python server.py & locust ...` and reproduce.
You walk away with
A working continuous-batching server — the core idea behind vLLM in 200 lines A real load-testing workflow with Locust you can re-use Fluency with asyncio + per-request streaming under concurrent load A repo that demonstrates production-grade thinking about scheduler design Tools you'll use FastAPI transformers asyncio + asyncio.Queue vLLM (reference architecture) Locust for load testing