LLM Architecture · Mosaic

Modules in this track

Attention — MHA, GQA, MQA, RoPE, FlashAttention
KV Cache — why we cache, how it grows, PagedAttention
Inference-Time Architecture — sampling, tokenization, continuous batching, speculative decoding

What you’ll be able to do after

Read a Transformer implementation and trace where every byte of memory goes
Reason about why a “small” model can OOM at long contexts
Pick the right attention variant for a given memory / quality tradeoff