Inference-Time Architecture

A model file is just weights. Turning that into a serving system that returns tokens at low latency for many concurrent users requires a different kind of architecture — and that architecture is where 80% of real-world LLM engineering happens.

0 / 4 lessons~61 min total

Module capstone — build it

A continuous-batching inference server in 200 lines

The vLLM scheduler trick — packed forward passes that triple your throughput — built from scratch.

AdvancedOne focused weekend (~9 h)Free Colab T4

A FastAPI server wrapping Llama-3.2-1B with a custom scheduler. The artifact is the server source + a Locust load test report comparing single-stream vs continuous-batched throughput at 50 concurrent clients.

Build it — step by step

01Single-stream baseline — sequential FastAPI60 min
Wrap `model.generate()` in a FastAPI POST endpoint. One request at a time. Stream tokens as Server-Sent Events. Hit it with 50 sequential clients via Locust; measure tokens/s/client and total throughput.
checkpoint Server runs; Locust reports baseline throughput (will be embarrassingly low under concurrency).
watch out Don't use `model.generate()` directly under concurrency — it holds the GIL and blocks. The baseline's job is to *be slow*, then we beat it.
02Build the request queue + per-request state60 min
When a request arrives, create a Sequence object: prompt_tokens, generated_tokens=[], asyncio.Queue of out-tokens for the client, status, position. Push to a global pending-queue. Server returns SSE that drains the per-request queue.
checkpoint Multiple requests in flight; each has its own queue; nothing has crossed wires.
03The scheduler loop — one forward pass per step120 min
In a background asyncio task: each step, pull active sequences (capped at MAX_BATCH=16). Pad to the longest. Run *one* `model(...)` forward pass with the batched input. Write the new token of each sequence into its per-client queue.
checkpoint You can hit the server with 4 concurrent requests; tokens stream to all 4 simultaneously instead of serially.
watch out Padding tokens contaminate the next-token prediction. Use an attention mask to ignore them; verify the unpadded output still matches.
04Continuous batching — admit new requests mid-batch90 min
Modify the scheduler: every step, evict finished sequences and admit pending ones. The batch composition changes on every iteration — the GPU never idles between requests. Cap at MAX_BATCH active sequences.
checkpoint Locust load test with 50 clients of varied output length: throughput stays high even as some sequences finish early. Compared to sequential baseline, you see ~3–5× total tokens/s.
watch out When sequences enter mid-batch, their KV-cache must start fresh — no shared state with finished sequences. A shared KV tensor without proper indexing leaks tokens between users.
05Measure the win with Locust45 min
Run `locust -f load.py --users 50 --spawn-rate 5` against both servers (sequential and continuous-batched). Save the HTML reports. Note p50 and p99 latency, total throughput, the GPU utilization difference.
checkpoint You have two side-by-side Locust HTML reports proving the win.
06README + push45 min
Repo with `server.py` (200 lines), `load.py` (Locust), `README.md` linking to the vLLM continuous-batching post and the PagedAttention paper. Embed the throughput comparison.
checkpoint A reader can clone, `pip install -r requirements.txt`, run `python server.py & locust ...` and reproduce.

You walk away with

A working continuous-batching server — the core idea behind vLLM in 200 lines
A real load-testing workflow with Locust you can re-use
Fluency with asyncio + per-request streaming under concurrent load
A repo that demonstrates production-grade thinking about scheduler design

Tools you'll use

FastAPI
transformers
asyncio + asyncio.Queue
vLLM (reference architecture)
Locust for load testing