RoPE + YaRN / LongRoPE
When you load Llama-3.1-8B, the config has one line: "rope_theta": 500000.0. When you load Qwen2.5-7B-1M, the config has "rope_scaling": { "type": "yarn", "factor": 4.0 }. Those numbers control how far the model can read before degrades. This lesson is the math behind that one line — and behind every “we extended context to 128K / 1M / 2M” announcement of the last two years.
The user-facing surface is “context window.” Underneath, every Q and K vector is silently rotated before the dot product, by an angle that grows with token position. That’s RoPE: sees not raw q · k but the dot product of rotated vectors, where the rotation angle encodes “where in the sequence am I.” The clever part is that after rotating q by position m and k by position n, the dot product depends only on m − n — RoPE is a relative-position scheme dressed up as an absolute one.
The math is six lines of trigonometry. The interesting story is what happens when you try to use a model trained at 4K context on a 128K-token document — the high-frequency rotation dimensions wrap into phase regions the model never saw, attention scores collapse, and output goes to gibberish. YaRN, NTK-aware, LongRoPE are three escalating fixes for that one failure mode.
TL;DR
- RoPE (Rotary Positional Embedding, Su et al., 2021) encodes position by rotating query and key vectors in 2D pairs by an angle proportional to position. The dot product then depends only on the relative position .
- It’s the de facto standard in 2024–2026: Llama family, Mistral, Qwen, DeepSeek, Gemma all use it.
- The base wavelength controls how far the model can extrapolate. Standard caps at ~4–8K tokens. Beyond that, attention scores collapse.
- NTK-aware / YaRN / LongRoPE are the three rescaling techniques that extend a model’s context to 32K → 128K → 1M tokens with light continued pretraining.
- For inference, PI (Position Interpolation), NTK-by-parts, and YaRN are the three you’ll see in real configs. LongRoPE (2024) is the current frontier for 2M+ token contexts.
Why position has to be injected somehow
Attention is permutation-invariant. Without a positional signal, “the cat sat on the mat” and “the mat sat on the cat” produce identical outputs — attention is a set operation, not a sequence operation. Every transformer needs some way to tell each token where it is.
For a long time we used learned absolute embeddings, then sinusoidal, then ALiBi, then RoPE. RoPE won because:
- It’s purely relative —
attention(q_m, k_n)depends only onm - n, not absolute positions. - It composes cleanly with attention’s existing math — no extra parameters, no separate embedding to add.
- It’s straightforward to scale to longer contexts than training saw, with the right tricks.
If you can’t read a RoPE config (theta = 500000, factor = 8.0, original_max_position = 8192), you can’t operate on a long-context model.
Mental model
The math is rotation in pairs of dimensions. For each pair of the head dim, rotate by an angle , where . Different dimension pairs rotate at very different frequencies — high-frequency pairs encode short-range distinctions, low-frequency pairs encode long-range.
The math
For a vector at position , treat each pair as a complex number and multiply by :
with for .
The crucial property: after rotating by and by , the dot product is
— purely relative.
Why it breaks at long context
At training time, the model sees positions up to (say 4K). The high-frequency dimensions cycle many times — cos(4096 · θ_0) has gone around hundreds of revolutions. The model has learned what those wrapping patterns mean.
At inference, if you go to , the high-frequency pairs are now in unseen phase regions. Attention scores degrade. The model produces garbage past ~8K.
The four rescaling techniques
1. Position Interpolation (PI) — Chen et al., 2023. Linearly compress positions: replace with where . Effectively makes all wavelengths longer. Quick, requires fine-tuning to recover quality. Now superseded.
2. NTK-aware — bloc97 (LocalLLaMA, 2023). Scale by so that high-frequency pairs are barely touched and low-frequency pairs interpolate smoothly. Better than PI; minimal fine-tuning.
3. YaRN (Yet another RoPE extensioN) — Peng et al., 2023. The current production default in many open models (Code Llama, Yi, Qwen-2.5 long-context variants). YaRN is NTK-by-parts — different scaling for different frequency bands plus an attention-temperature correction. Often used with ~100M tokens of continued pretraining.
4. LongRoPE — Ding et al., Microsoft, 2024. Search for non-uniform rescaling factors per dimension via evolutionary algorithms. Pushed Llama-2 7B from 4K → 2M tokens with minimal fine-tuning. The current frontier; reflected in 2025 model configs targeting >1M context.
Real configs you’ll see in 2026
// Llama-3.1 70B (8K → 128K via post-training)
"rope_scaling": {
"type": "llama3",
"factor": 8.0,
"original_max_position_embeddings": 8192,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0
}
// Qwen2.5-7B-Instruct-1M
"rope_scaling": { "type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144 }
// DeepSeek-V3 (4K base, 128K via YaRN)
"rope_scaling": { "type": "yarn", "factor": 40.0, "beta_fast": 32, "beta_slow": 1, "mscale": 0.707 }
// LongRoPE-tuned (custom per-dim factors)
"rope_scaling": { "type": "longrope", "long_factor": [2.5, 2.7, 3.1, 3.3, ...], "short_factor": [...] }You read a model card by checking rope_scaling. That tells you the maximum honest context length and which technique was used.
Run it in your browser — visualize the rotation
You’ll see: at pos=0, identity. At pos=1, slight rotation. At pos=4096, the high-frequency pairs have wrapped many times; the low-frequency pairs barely budged. That’s why RoPE encodes both fine local and coarse global position.
Quick check
Key takeaways
- RoPE = rotate in 2D pairs by an angle proportional to position. Every modern Transformer except a few outliers uses it.
- Different dimension pairs rotate at very different rates. This is what gives RoPE both fine and coarse positional information.
rope_scalingin a model config tells you everything. Type (yarn,llama3,longrope), factor, base. Read it before assuming a context length.- Beyond 4× extrapolation requires fine-tuning. Pure inference-time scaling stops working past ~32K from a 4K base.
- LongRoPE pushed the boundary to 2M tokens. Expect this and successors to be the dominant long-context approach through 2026.
Go deeper
- PaperRoFormer: Enhanced Transformer with Rotary Position EmbeddingThe original RoPE paper. Surprisingly readable.
- PaperExtending Context Window of Large Language Models via Positional InterpolationPI — historical baseline, but the framing is foundational.
- PaperYaRN: Efficient Context Window Extension of Large Language ModelsNTK-by-parts plus attention temperature. The most-deployed extension technique through 2024-2026.
- PaperLongRoPE: Extending LLM Context Window Beyond 2 Million TokensPer-dim factors found by evolutionary search. Frontier of long-context.
- BlogYaRN — explained by EleutherAIThe author-adjacent walkthrough. Best non-paper YaRN explainer.
- VideoEleuther — RoPE deep diveA whiteboard walk through the math.
- Repojzhang38/EasyContextOpen implementations of every major RoPE-extension technique. Read for clean reference code.
- DocsHugging Face — RoPE scaling explainedHow `rope_scaling` is interpreted across the major model families. Pragmatic.
TL;DR
- RoPE (Rotary Positional Embedding, Su et al., 2021) encodes position by rotating query and key vectors in 2D pairs by an angle proportional to position. The dot product then depends only on the relative position .
- It’s the de facto standard in 2024–2026: Llama family, Mistral, Qwen, DeepSeek, Gemma all use it.
- The base wavelength controls how far the model can extrapolate. Standard caps at ~4–8K tokens. Beyond that, attention scores collapse.
- NTK-aware / YaRN / LongRoPE are the three rescaling techniques that extend a model’s context to 32K → 128K → 1M tokens with light continued pretraining.
- For inference, PI (Position Interpolation), NTK-by-parts, and YaRN are the three you’ll see in real configs. LongRoPE (2024) is the current frontier for 2M+ token contexts.
Why this matters
Position is information. Without it, attention is a permutation-invariant set operation — “the cat sat on the mat” and “the mat sat on the cat” are identical to the model. Every Transformer needs some positional signal.
For a long time we used learned absolute embeddings, then sinusoidal, then ALiBi, then RoPE. RoPE won because:
- It’s purely relative —
attention(q_m, k_n)depends only onm - n, not absolute positions. - It composes cleanly with attention’s existing math — no extra parameters.
- It’s straightforward to scale to longer contexts than training saw, with the right tricks.
If you can’t read a RoPE config (theta = 500000, factor = 8.0, original_max_position = 8192), you can’t operate on a long-context model.
Mental model
The math is rotation in pairs of dimensions. For each pair of the head dim, rotate by an angle , where . Different dimension pairs rotate at very different frequencies — high-frequency pairs encode short-range distinctions, low-frequency pairs encode long-range.
Concrete walkthrough
The math
For a vector at position , treat each pair as a complex number and multiply by :
with for .
The crucial property: after rotating by and by , the dot product is
— purely relative.
Why it breaks at long context
At training time, the model sees positions up to (say 4K). The high-frequency dimensions cycle many times — cos(4096 · θ_0) has gone around hundreds of revolutions. The model has learned what those wrapping patterns mean.
At inference, if you go to , the high-frequency pairs are now in unseen phase regions. Attention scores degrade. The model produces garbage past ~8K.
The four rescaling techniques
1. Position Interpolation (PI) — Chen et al., 2023. Linearly compress positions: replace with where . Effectively makes all wavelengths longer. Quick, requires fine-tuning to recover quality. Now superseded.
2. NTK-aware — bloc97 (LocalLLaMA, 2023). Scale by so that high-frequency pairs are barely touched and low-frequency pairs interpolate smoothly. Better than PI; minimal fine-tuning.
3. YaRN (Yet another RoPE extensioN) — Peng et al., 2023. The current production default in many open models (Code Llama, Yi, Qwen-2.5 long-context variants). YaRN is NTK-by-parts — different scaling for different frequency bands plus an attention-temperature correction. Often used with ~100M tokens of continued pretraining.
4. LongRoPE — Ding et al., Microsoft, 2024. Search for non-uniform rescaling factors per dimension via evolutionary algorithms. Pushed Llama-2 7B from 4K → 2M tokens with minimal fine-tuning. The current frontier; reflected in 2025 model configs targeting >1M context.
Real configs you’ll see in 2026
// Llama-3.1 70B (8K → 128K via post-training)
"rope_scaling": {
"type": "llama3",
"factor": 8.0,
"original_max_position_embeddings": 8192,
"low_freq_factor": 1.0,
"high_freq_factor": 4.0
}
// Qwen2.5-7B-Instruct-1M
"rope_scaling": { "type": "yarn", "factor": 4.0, "original_max_position_embeddings": 262144 }
// DeepSeek-V3 (4K base, 128K via YaRN)
"rope_scaling": { "type": "yarn", "factor": 40.0, "beta_fast": 32, "beta_slow": 1, "mscale": 0.707 }
// LongRoPE-tuned (custom per-dim factors)
"rope_scaling": { "type": "longrope", "long_factor": [2.5, 2.7, 3.1, 3.3, ...], "short_factor": [...] }You read a model card by checking rope_scaling. That tells you the maximum honest context length and which technique was used.
Run it in your browser — visualize the rotation
You’ll see: at pos=0, identity. At pos=1, slight rotation. At pos=4096, the high-frequency pairs have wrapped many times; the low-frequency pairs barely budged. That’s why RoPE encodes both fine local and coarse global position.
Quick check
Key takeaways
- RoPE = rotate in 2D pairs by an angle proportional to position. Every modern Transformer except a few outliers uses it.
- Different dimension pairs rotate at very different rates. This is what gives RoPE both fine and coarse positional information.
rope_scalingin a model config tells you everything. Type (yarn,llama3,longrope), factor, base. Read it before assuming a context length.- Beyond 4× extrapolation requires fine-tuning. Pure inference-time scaling stops working past ~32K from a 4K base.
- LongRoPE pushed the boundary to 2M tokens. Expect this and successors to be the dominant long-context approach through 2026.
Go deeper
- PaperRoFormer: Enhanced Transformer with Rotary Position EmbeddingThe original RoPE paper. Surprisingly readable.
- PaperExtending Context Window of Large Language Models via Positional InterpolationPI — historical baseline, but the framing is foundational.
- PaperYaRN: Efficient Context Window Extension of Large Language ModelsNTK-by-parts plus attention temperature. The most-deployed extension technique through 2024-2026.
- PaperLongRoPE: Extending LLM Context Window Beyond 2 Million TokensPer-dim factors found by evolutionary search. Frontier of long-context.
- BlogYaRN — explained by EleutherAIThe author-adjacent walkthrough. Best non-paper YaRN explainer.
- VideoEleuther — RoPE deep diveA whiteboard walk through the math.
- Repojzhang38/EasyContextOpen implementations of every major RoPE-extension technique. Read for clean reference code.
- DocsHugging Face — RoPE scaling explainedHow `rope_scaling` is interpreted across the major model families. Pragmatic.