Sampling
When you call model.generate(temperature=0.7, top_p=0.9), the model itself stops at one specific point: it produces a vector of , one per vocabulary token. Everything that happens between “here are the logits” and “here is the next token” is sampling, and it lives outside the model. It’s the cheapest knob in your stack — a few microseconds per token, no GPU work — and the one most likely to be set wrong.
A bad temperature on a code model means hallucinated APIs. A bad top-p on a creative model means slop. A 2024 study showed switching default sampling from top_p=0.9 to min_p=0.05 cut hallucination on long-form generation by ~10% on the same model with no other changes. Sampling is free quality. This lesson walks through every lever — temperature, top-k, top-p, min-p, repetition penalties — in the order a production stack applies them, and shows why min-p has quietly become the 2024–2026 default while top-p is on its way out.
TL;DR
- The model outputs logits — one number per vocab token. Sampling is everything that turns those logits into a single chosen token.
- Temperature rescales logits before softmax. T < 1 sharpens; T > 1 flattens; T = 0 is greedy argmax.
- Top-k keeps only the k highest-probability tokens, renormalizes, samples. Crude but cheap.
- Top-p (nucleus) keeps the smallest set whose cumulative probability ≥ p, renormalizes, samples. The default for Llama / Mistral / GPT for years.
- Min-p keeps tokens whose probability is ≥
p_min × max_prob. Adapts to the distribution’s confidence — wider when the model is uncertain, narrower when it’s sharp. The 2024–2026 frontier; default in DeepSeek-R1, Qwen-3, and llama.cpp. - Order of operations matters. Logit penalties → temperature → top-k → top-p / min-p → sample. Doing it in the wrong order changes the result.
What “sampling” actually is
In a managed-runtime ML library — Hugging Face transformers, OpenAI’s API, vLLM’s serving engine — the user just sees model.generate(...). Inside, the loop is dead simple: run the model forward, get logits over the vocab, pick a token, append it, repeat. The “pick a token” step is the only place you have any agency, and it’s a tiny pure function: logits → token.
A logit is just the unnormalized score for a token. Pass logits through softmax and you get a probability distribution over the whole vocab — typically 32K to 256K entries. Sampling is then “given a distribution over tokens, draw one.” The fight is over which distribution: the raw one, or one we’ve sharpened, truncated, or biased before drawing.
Mental model
Each filter is a “what to throw away” step. The output is always a smaller distribution than the input; the sample is drawn at the end.
Temperature
Logits z of shape (V,). Probabilities are p = softmax(z / T). The math:
T = 1is unchanged.T → 0puts all mass onargmax(z)— pure greedy.T → ∞flattens to uniform.T = 0.7(typical chat default) is “be a little more decisive than uniform.”
Temperature acts on logits, not probabilities — softmax(z / 2) is mathematically different from softmax(z) / 2 (the latter isn’t even a distribution). Engines that “scale probabilities” instead of logits are subtly broken.
Top-k
Keep the k largest-probability tokens, set the rest to 0, renormalize, sample. Simple, fast, but k is workload-dependent: if the true distribution puts 99% of mass on 4 tokens, top_k=50 leaves 46 essentially-zero options that occasionally hit; if the distribution is spread over 200 reasonable tokens, top_k=50 truncates real information.
People usually set top_k=40 and leave it. It’s a safety net, not a primary lever.
Top-p (nucleus sampling)
Sort tokens by probability. Sum from the top until you cross threshold p. Keep that prefix; drop the rest. Renormalize. Sample.
def top_p(probs, p):
idx = np.argsort(-probs) # descending
csum = np.cumsum(probs[idx])
keep = idx[csum <= p]
if len(keep) == 0: keep = idx[:1] # always keep at least one
if csum[len(keep)-1] < p and len(keep) < len(idx):
keep = idx[:len(keep)+1] # include the boundary token
out = np.zeros_like(probs)
out[keep] = probs[keep]
return out / out.sum()Top-p adapts to the shape of the distribution: if the model is confident (most mass on top-3), top-p keeps top-3; if it’s diffuse (mass spread over 50), top-p keeps 50. That’s why it dethroned top-k as the default around 2019 (Holtzman et al.).
But it has a known failure mode: when the distribution is bimodal (one good answer plus a fat tail of near-equal weak alternatives), top-p often dips into the tail. The “right” answer fits in p=0.5, but at p=0.9 you’ve also admitted 30 marginal tokens.
Min-p — the 2024 fix
Top-p uses cumulative mass as its threshold; uses peak-relative mass.
def min_p(probs, p_min=0.05):
threshold = p_min * probs.max()
keep = probs >= threshold
out = np.where(keep, probs, 0)
return out / out.sum()If the top token has probability 0.6, min-p with p_min=0.05 keeps any token with probability ≥ 0.03. If the top token has probability 0.1 (the model is uncertain), min-p keeps any token with probability ≥ 0.005 — the gate widens automatically. The result: sharp distributions stay sharp, diffuse distributions stay diverse, exactly the opposite of what top-p does in those edge cases.
Empirically, min-p produces less hallucination on long-form generation and more diversity on creative tasks, both relative to top-p. Adopted by llama.cpp, vLLM, DeepSeek-R1, Qwen-3, and most local-LLM frontends in 2024–2025.
Order of operations
The conventional order, in vLLM / llama.cpp / Hugging Face:
logits
→ repetition / presence / frequency penalties (modify logits in place)
→ temperature (divide logits)
→ softmax → probabilities
→ top_k (keep top k)
→ top_p (cumulative nucleus)
→ min_p (peak-relative cutoff)
→ renormalize
→ sampleThe reason temperature comes before top-k/p: cooling logits with T < 1 concentrates mass on the top tokens, which changes which tokens survive top-k/p. Doing temperature after the filter would be a different operation; engines that get the order wrong produce subtly wrong samples.
Logit penalties
Three classics, all applied to logits before temperature:
| Penalty | What it does | Typical value | When |
|---|---|---|---|
repetition_penalty | Divide logits of recently-seen tokens by r > 1. | 1.05–1.2 | Most tasks. Prevents direct loops. |
presence_penalty (OpenAI-style) | Subtract a constant from logits of any token already used. | 0–1 | Encourages new topics. |
frequency_penalty | Subtract f × count(token) from logits. | 0–1 | Discourages over-use of a single word. |
Repetition penalty is the only one most production stacks turn on by default. Presence/frequency are knobs you reach for when generating divergent outputs (brainstorming, summarization variants).
Reasoning models flip the script
DeepSeek-R1, OpenAI o-series, Qwen-3 reasoning models all recommend temperature 0.6, top-p 0.95, min-p 0.0 for the reasoning trace, then often a second pass at lower temperature for the final answer. The reasoning trace benefits from diversity (find a path through the search space); the final answer benefits from determinism (don’t introduce a typo into the conclusion).
This is a real architectural thing, not a sampling trick: many production stacks now run two sampling configs per request, switching at a sentinel token like </think>.
Recipes that work
| Task | T | top_p | min_p | top_k | rep_pen |
|---|---|---|---|---|---|
| Code completion (precise) | 0.2 | — | 0.05 | — | 1.0 |
| Chat (default) | 0.7 | — | 0.05 | 40 | 1.05 |
| Creative writing | 0.9 | — | 0.02 | — | 1.1 |
| Reasoning trace (R1-style) | 0.6 | 0.95 | — | — | 1.0 |
| Final answer extraction | 0.0 | — | — | — | 1.0 |
| Summarization | 0.4 | — | 0.1 | — | 1.05 |
These are starting points, not laws. Tune on your eval, not your gut.
Run it in your browser
Pick a fake logit distribution, watch each sampler’s probability mass shift.
The teaching moment: at top-p=0.9, the model lets in the entire fuzzy tail. At min-p=0.2, it cleanly keeps only the two genuinely-confident options. On the same logits.
Quick check
Key takeaways
- Temperature → softmax → top-k → top-p / min-p → sample. This order is not negotiable; reversing it produces a different (usually worse) distribution.
- Min-p is the modern default for high-quality generation. Top-p is a 2019-era tool that struggles in bimodal cases. Adopt min-p where your stack supports it.
- Reasoning models want a two-config setup: high-diversity for the trace, deterministic for the final answer. Pin a sampler-switch on the
</think>(or equivalent) sentinel. - Repetition penalty 1.05–1.10 is almost always on. Presence and frequency penalties are situational.
- Sampling is free quality. Re-tune your defaults every time you change models — what worked for Llama-3.1 is not optimal for DeepSeek-R1 or Qwen-3.
Go deeper
- PaperThe Curious Case of Neural Text DegenerationThe nucleus sampling (top-p) paper. Section 4 shows why pure top-k is broken on language.
- PaperMin-p Sampling: Balancing Creativity and Coherence at High TemperatureThe min-p paper. Read section 3 for the dynamic-cutoff motivation; section 5 has the head-to-head with top-p.
- PaperDeepSeek-R1 Technical ReportSection "Inference Recipe" specifies T=0.6, top_p=0.95 for the reasoning trace. The two-config setup is real.
- BlogHugging Face — How to generate textBest gentle introduction to all the samplers, with side-by-side examples in transformers.
- BlogSampling: an interactive demoDrag a slider; watch the distribution. The single best intuition-builder for top-p vs min-p.
- DocsvLLM — Sampling ParametersAuthoritative on order-of-operations and which params are supported in v1.
- Repoggerganov/llama.cpp`common/sampling.cpp` is the canonical reference for sampler ordering. Read the `llama_sampler_chain` setup.
TL;DR
- The model outputs logits — one number per vocab token. Sampling is everything that turns those logits into a single chosen token.
- Temperature rescales logits before softmax. T < 1 sharpens; T > 1 flattens; T = 0 is greedy argmax.
- Top-k keeps only the k highest-probability tokens, renormalizes, samples. Crude but cheap.
- Top-p (nucleus) keeps the smallest set whose cumulative probability ≥ p, renormalizes, samples. The default for Llama / Mistral / GPT for years.
- Min-p keeps tokens whose probability is ≥
p_min × max_prob. Adapts to the distribution’s confidence — wider when the model is uncertain, narrower when it’s sharp. The 2024–2026 frontier; default in DeepSeek-R1, Qwen-3, and llama.cpp. - Order of operations matters. Logit penalties → temperature → top-k → top-p / min-p → sample. Doing it in the wrong order changes the result.
Why this matters
Sampling is the cheapest knob in your stack and the one most likely to be set wrong. A bad temperature on a code model means hallucinated APIs; a bad top-p on a creative model means slop. A 2024 study showed that switching default sampling from top_p=0.9 to min_p=0.05 cut hallucination on long-form generation by ~10% on the same model with no other changes. Sampling is free quality. Knowing which lever to pull, when, and in what order is one of the highest-leverage skills in applied LLM work.
Mental model
Each filter is a “what to throw away” step. The output is always a smaller distribution than the input; the sample is drawn at the end.
Concrete walkthrough
Temperature
Logits z of shape (V,). Probabilities are p = softmax(z / T). The math:
T = 1is unchanged.T → 0puts all mass onargmax(z)— pure greedy.T → ∞flattens to uniform.T = 0.7(typical chat default) is “be a little more decisive than uniform.”
Temperature acts on logits, not probabilities — softmax(z / 2) is mathematically different from softmax(z) / 2 (the latter isn’t even a distribution). Engines that “scale probabilities” instead of logits are subtly broken.
Top-k
Keep the k largest-probability tokens, set the rest to 0, renormalize, sample. Simple, fast, but k is workload-dependent: if the true distribution puts 99% of mass on 4 tokens, top_k=50 leaves 46 essentially-zero options that occasionally hit; if the distribution is spread over 200 reasonable tokens, top_k=50 truncates real information.
People usually set top_k=40 and leave it. It’s a safety net, not a primary lever.
Top-p (nucleus sampling)
Sort tokens by probability. Sum from the top until you cross threshold p. Keep that prefix; drop the rest. Renormalize. Sample.
def top_p(probs, p):
idx = np.argsort(-probs) # descending
csum = np.cumsum(probs[idx])
keep = idx[csum <= p]
if len(keep) == 0: keep = idx[:1] # always keep at least one
if csum[len(keep)-1] < p and len(keep) < len(idx):
keep = idx[:len(keep)+1] # include the boundary token
out = np.zeros_like(probs)
out[keep] = probs[keep]
return out / out.sum()Top-p adapts to the shape of the distribution: if the model is confident (most mass on top-3), top-p keeps top-3; if it’s diffuse (mass spread over 50), top-p keeps 50. That’s why it dethroned top-k as the default around 2019 (Holtzman et al.).
But it has a known failure mode: when the distribution is bimodal (one good answer plus a fat tail of near-equal weak alternatives), top-p often dips into the tail. The “right” answer fits in p=0.5, but at p=0.9 you’ve also admitted 30 marginal tokens.
Min-p — the 2024 fix
Top-p uses cumulative mass as its threshold; min-p uses peak-relative mass.
def min_p(probs, p_min=0.05):
threshold = p_min * probs.max()
keep = probs >= threshold
out = np.where(keep, probs, 0)
return out / out.sum()If the top token has probability 0.6, min-p with p_min=0.05 keeps any token with probability ≥ 0.03. If the top token has probability 0.1 (the model is uncertain), min-p keeps any token with probability ≥ 0.005 — the gate widens automatically. The result: sharp distributions stay sharp, diffuse distributions stay diverse, exactly the opposite of what top-p does in those edge cases.
Empirically, min-p produces less hallucination on long-form generation and more diversity on creative tasks, both relative to top-p. Adopted by llama.cpp, vLLM, DeepSeek-R1, Qwen-3, and most local-LLM frontends in 2024–2025.
Order of operations
The conventional order, in vLLM / llama.cpp / Hugging Face:
logits
→ repetition / presence / frequency penalties (modify logits in place)
→ temperature (divide logits)
→ softmax → probabilities
→ top_k (keep top k)
→ top_p (cumulative nucleus)
→ min_p (peak-relative cutoff)
→ renormalize
→ sampleThe reason temperature comes before top-k/p: cooling logits with T < 1 concentrates mass on the top tokens, which changes which tokens survive top-k/p. Doing temperature after the filter would be a different operation; engines that get the order wrong produce subtly wrong samples.
Logit penalties
Three classics, all applied to logits before temperature:
| Penalty | What it does | Typical value | When |
|---|---|---|---|
repetition_penalty | Divide logits of recently-seen tokens by r > 1. | 1.05–1.2 | Most tasks. Prevents direct loops. |
presence_penalty (OpenAI-style) | Subtract a constant from logits of any token already used. | 0–1 | Encourages new topics. |
frequency_penalty | Subtract f × count(token) from logits. | 0–1 | Discourages over-use of a single word. |
Repetition penalty is the only one most production stacks turn on by default. Presence/frequency are knobs you reach for when generating divergent outputs (brainstorming, summarization variants).
Reasoning models flip the script
DeepSeek-R1, OpenAI o-series, Qwen-3 reasoning models all recommend temperature 0.6, top-p 0.95, min-p 0.0 for the reasoning trace, then often a second pass at lower temperature for the final answer. The reasoning trace benefits from diversity (find a path through the search space); the final answer benefits from determinism (don’t introduce a typo into the conclusion).
This is a real architectural thing, not a sampling trick: many production stacks now run two sampling configs per request, switching at a sentinel token like </think>.
Recipes that work
| Task | T | top_p | min_p | top_k | rep_pen |
|---|---|---|---|---|---|
| Code completion (precise) | 0.2 | — | 0.05 | — | 1.0 |
| Chat (default) | 0.7 | — | 0.05 | 40 | 1.05 |
| Creative writing | 0.9 | — | 0.02 | — | 1.1 |
| Reasoning trace (R1-style) | 0.6 | 0.95 | — | — | 1.0 |
| Final answer extraction | 0.0 | — | — | — | 1.0 |
| Summarization | 0.4 | — | 0.1 | — | 1.05 |
These are starting points, not laws. Tune on your eval, not your gut.
Run it in your browser
Pick a fake logit distribution, watch each sampler’s probability mass shift.
The teaching moment: at top-p=0.9, the model lets in the entire fuzzy tail. At min-p=0.2, it cleanly keeps only the two genuinely-confident options. On the same logits.
Quick check
Key takeaways
- Temperature → softmax → top-k → top-p / min-p → sample. This order is not negotiable; reversing it produces a different (usually worse) distribution.
- Min-p is the modern default for high-quality generation. Top-p is a 2019-era tool that struggles in bimodal cases. Adopt min-p where your stack supports it.
- Reasoning models want a two-config setup: high-diversity for the trace, deterministic for the final answer. Pin a sampler-switch on the
</think>(or equivalent) sentinel. - Repetition penalty 1.05–1.10 is almost always on. Presence and frequency penalties are situational.
- Sampling is free quality. Re-tune your defaults every time you change models — what worked for Llama-3.1 is not optimal for DeepSeek-R1 or Qwen-3.
Go deeper
- PaperThe Curious Case of Neural Text DegenerationThe nucleus sampling (top-p) paper. Section 4 shows why pure top-k is broken on language.
- PaperMin-p Sampling: Balancing Creativity and Coherence at High TemperatureThe min-p paper. Read section 3 for the dynamic-cutoff motivation; section 5 has the head-to-head with top-p.
- PaperDeepSeek-R1 Technical ReportSection "Inference Recipe" specifies T=0.6, top_p=0.95 for the reasoning trace. The two-config setup is real.
- BlogHugging Face — How to generate textBest gentle introduction to all the samplers, with side-by-side examples in transformers.
- BlogSampling: an interactive demoDrag a slider; watch the distribution. The single best intuition-builder for top-p vs min-p.
- DocsvLLM — Sampling ParametersAuthoritative on order-of-operations and which params are supported in v1.
- Repoggerganov/llama.cpp`common/sampling.cpp` is the canonical reference for sampler ordering. Read the `llama_sampler_chain` setup.