04Track 03 · LLM Architecture
Transformers from a systems lens.
A Transformer isn't just a math object — it's a memory allocator, a state machine, and a data pipeline. This track looks at LLMs through the lens of what data lives where and what computation happens at each step, which is what determines whether your model fits, whether it serves, and whether it's fast.
- — MHA, GQA, MQA, RoPE, FlashAttention
- — why we cache, how it grows, PagedAttention
- — sampling, tokenization, continuous batching, speculative decoding
- Read a Transformer implementation and trace where every byte of memory goes
- Reason about why a “small” model can OOM at long contexts
- Pick the right attention variant for a given memory / quality tradeoff