Skip to content

SFT & Instruction Tuning

In a managed cloud environment, “train a model” is a YAML and a button. The orchestrator schedules workers, ingests data, runs your code, and writes checkpoints. You don’t see the schedulers, the GPU drivers, the gradient broadcasts. The whole stack is abstracted into one API call.

A modern PyTorch script looks the same way. Two lines of Python — trainer = SFTTrainer(...) then trainer.train(). Behind those two lines: every prompt token gets masked out of the loss, short conversations get packed into long sequences with a custom attention mask, weights replicate across 8 GPUs, NCCL all-reduces gradients every micro-batch, AdamW updates parameters, checkpoints stream to disk. None of it is in your code.

This lesson is about what those two lines actually do — and the three places the abstraction leaks: , , and . Get those three right and SFT is boring. Get any of them wrong and the model that comes out won’t follow instructions.

TL;DR

  • Supervised Fine-Tuning (SFT) is the first post-training step: you train on (instruction, response) pairs so the model learns to follow directions instead of just completing text.
  • Chat templates (ChatML, Llama 3 format) wrap conversations with special tokens that delineate system/user/assistant turns. Using the wrong template at inference = broken model. Always match training and serving templates.
  • Prompt loss masking zeros the loss on instruction tokens (set labels to -100). The model only learns to generate responses, not to memorize your prompts.
  • Sample packing concatenates multiple short conversations into one sequence, eliminating padding waste. With proper attention masking, this gives 2–4× training throughput on typical chat datasets.
  • Quality > quantity. 10K carefully curated examples often beats 1M noisy ones. The LIMA paper (2023) showed 1K high-quality examples suffice for strong instruction-following.

What SFT actually is

A pretrained LLM is a next-token autocompleter trained on web text. Give it "The capital of France is" and it predicts "Paris". Give it "What is the capital of France?" and it might predict "Many of you may not know..." because the most likely continuation in its training data was a quiz, not an answer.

SFT fixes that by training the model on (instruction, response) pairs and computing loss only on the response tokens. The model learns one thing: when the input ends with the assistant turn marker, generate a helpful answer.

It is the mandatory first step. Every chatbot, coding assistant, and reasoning model goes through SFT before any preference alignment (DPO, RLHF) or RL (GRPO). Skip it and there is no policy to align — the base model just keeps autocompleting.

Mental model

SFT is the bridge. Without it, preference alignment has nothing to align and RL has no base policy to improve.

Chat templates — the hidden contract

The first place the abstraction leaks. Different model families wrap conversations in different special-token formats. Llama 3:

<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|> The capital of France is Paris.<|eot_id|>

ChatML (Qwen, Mistral v0.3+):

<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user What is the capital of France?<|im_end|> <|im_start|>assistant The capital of France is Paris.<|im_end|>

The model learns one of these formats during pretraining or initial post-training. The cardinal rule: the template you train with must exactly match the template at inference. SFT with ChatML, serve with Llama 3 format, and the model sees gibberish where its turn markers should be.

Always use the tokenizer’s built-in helper:

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}, ] text = tokenizer.apply_chat_template(messages, tokenize=False)

apply_chat_template reads the format off the model card. Every working SFT pipeline calls it. Every broken one rolled their own and got a typo.

Prompt loss masking

The second place. Cross-entropy over a packed conversation, by default, computes a gradient for every token — system prompt, user question, chat-template scaffolding, assistant response. Three of those four are tokens the model will never have to generate at inference. The gradient updates flowing through them are wasted at best and actively harmful at worst (the model learns to parrot prompt patterns instead of answering them).

The fix is one line:

labels = input_ids.clone() for i, token_id in enumerate(input_ids): if in_prompt_region(i): # system + user turns labels[i] = -100 # CrossEntropyLoss ignores -100

PyTorch’s CrossEntropyLoss treats -100 as “skip this position.” The loss is now computed only on assistant tokens.

In , the standard library, this is handled when you pass structured messages:

from trl import SFTTrainer, SFTConfig config = SFTConfig( dataset_text_field="text", max_seq_length=4096, packing=True, # enable sample packing dataset_kwargs={ "add_special_tokens": False, # template already has them }, )

If you build a custom training loop, you have to mask yourself. The number of “my SFT didn’t work” debugging sessions that come down to forgetting this one step is non-trivial.

Sample packing

The third place. Real chat datasets have wildly variable conversation lengths — 50 tokens, 4000 tokens, every length in between. Naive padding to max_seq_length=4096 looks like this:

Seq 1: [tokens tokens tokens PAD PAD PAD PAD PAD PAD PAD] ← 70% padding Seq 2: [tokens tokens PAD PAD PAD PAD PAD PAD PAD PAD PAD] ← 80% padding Seq 3: [tokens tokens tokens tokens tokens PAD PAD PAD PAD] ← 40% padding

You’re paying the GPU to multiply zeros against zeros. Packing concatenates short conversations into one long sequence:

Packed: [seq1_tokens SEP seq2_tokens SEP seq3_tokens PAD] ← 5% padding

The subtlety: you must block cross-attention between packed samples using cu_seqlens (FlashAttention’s “cumulative sequence lengths”) or a block-diagonal attention mask. Without it, sample 2 attends to sample 1’s tokens — a silent data leak that you only notice when eval scores look weirdly good but the model behaves strangely on real inputs.

Done correctly, packing gives 2–4× training throughput on typical chat data.

The data quality ladder

Dataset sizeQualityTypical result
1K curated (LIMA-style)Expert-written, diverseStrong instruction-following, sometimes brittle on edge cases
10K–50K (OpenHermes, Capybara)Mix of synthetic + humanGood general-purpose chatbot
100K–500K (UltraChat, SlimOrca)Mostly synthetic, filteredRobust but can be generic
1M+ (WildChat, ShareGPT dumps)Noisy, diverseVolume helps coverage, but quality per-example is low

2025–2026 consensus: start with 10K–50K high-quality examples. Scale up only if evaluation shows coverage gaps. Synthetic data from frontier models (GPT-4o, Claude 3.5) is the most cost-effective source.

The training recipe

Here’s what those two SFTTrainer lines actually expand to:

# Full SFT recipe for a 7B model on a single A100/H100 from transformers import AutoModelForCausalLM, AutoTokenizer from trl import SFTTrainer, SFTConfig from peft import LoraConfig model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", torch_dtype="bfloat16", attn_implementation="flash_attention_2", ) lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, task_type="CAUSAL_LM", ) config = SFTConfig( output_dir="./sft-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, # effective bs = 16 learning_rate=2e-4, # typical LoRA LR num_train_epochs=3, max_seq_length=4096, packing=True, bf16=True, logging_steps=10, save_strategy="epoch", ) trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, # HF dataset with "messages" column peft_config=lora_config, ) trainer.train()

Notice what’s not in your code: the data collator that builds the masked-label tensors, the gradient accumulation that stages 4 micro-batches before each optimizer step, the bf16 autocast that decides which ops run in FP32 for stability, the wrapping that swaps every nn.Linear for a low-rank adapter, the FSDP/DDP shard-or-replicate decision (handled by accelerate if you launch with accelerate launch), the kernel that replaces the attention layer’s softmax(QKᵀ)·V with a tiled, SMEM-staged version. All of that hangs off one method call.

The Python you wrote is the slowest, most boring 1% of what’s running on the GPUs. The other 99% is FlashAttention kernels, NCCL all-reduces, AdamW fused-update kernels — the things in the rest of this module.

Run it in your browser

Python — editableSee how chat templates transform raw messages into tokenized sequences.
Ctrl+Enter to run

Quick check

Quick check
You SFT a Llama 3 model but forget to apply prompt loss masking. What's the most likely consequence?

Key takeaways

  1. SFT = teach the model to follow instructions. It’s the mandatory bridge between pretraining and alignment.
  2. Chat template mismatch is the #1 SFT bug. Always use tokenizer.apply_chat_template() — never format manually.
  3. Mask prompt tokens (labels[prompt_indices] = -100). The model should only learn to generate responses.
  4. Sample packing gives 2–4× throughput on typical chat data. Use it with proper attention masking to prevent cross-contamination.
  5. Quality > quantity. 10K expert examples beats 1M noisy ones. Start small, evaluate, scale if needed.

Go deeper

TL;DR

  • Supervised Fine-Tuning (SFT) is the first post-training step: you train on (instruction, response) pairs so the model learns to follow directions instead of just completing text.
  • Chat templates (ChatML, Llama 3 format) wrap conversations with special tokens that delineate system/user/assistant turns. Using the wrong template at inference = broken model. Always match training and serving templates.
  • Prompt loss masking zeros the loss on instruction tokens (set labels to -100). The model only learns to generate responses, not to memorize your prompts.
  • Sample packing concatenates multiple short conversations into one sequence, eliminating padding waste. With proper attention masking, this gives 2–4× training throughput on typical chat datasets.
  • Quality > quantity. 10K carefully curated examples often beats 1M noisy ones. The LIMA paper (2023) showed 1K high-quality examples suffice for strong instruction-following.

Why this matters

A raw pretrained model is an autocompleter — it predicts the next token in internet text. It won’t answer questions, follow instructions, or refuse harmful requests until you SFT it. Every chatbot, coding assistant, and AI agent starts with SFT before any preference alignment (DPO/RLHF). Get SFT wrong — wrong template, no loss masking, bad data — and every subsequent training stage inherits the damage.

Mental model

SFT is the bridge. Without it, preference alignment has nothing to align, and RL has no base policy to improve.

Concrete walkthrough

1. Chat templates — the hidden contract

Every model family defines a template that wraps conversations with special tokens. Here’s what Llama 3’s looks like:

<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|> What is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|> The capital of France is Paris.<|eot_id|>

And ChatML (used by Qwen, Mistral v0.3+):

<|im_start|>system You are a helpful assistant.<|im_end|> <|im_start|>user What is the capital of France?<|im_end|> <|im_start|>assistant The capital of France is Paris.<|im_end|>

The cardinal rule: the template you train with must exactly match the template at inference. If you SFT with ChatML but serve with Llama 3 format, the model sees gibberish. This is the #1 debugging issue in SFT.

from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") messages = [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "What is the capital of France?"}, {"role": "assistant", "content": "The capital of France is Paris."}, ] # This applies the correct template automatically: text = tokenizer.apply_chat_template(messages, tokenize=False)

2. Prompt loss masking

Without masking, the model wastes gradient updates learning to predict the instruction tokens — tokens it will never need to generate at inference. With masking:

# Simplified: mask everything except assistant responses labels = input_ids.clone() for i, token_id in enumerate(input_ids): if in_prompt_region(i): # system + user turns labels[i] = -100 # CrossEntropyLoss ignores -100 # Now loss is computed ONLY on assistant tokens

In trl (the standard SFT library):

from trl import SFTTrainer, SFTConfig config = SFTConfig( dataset_text_field="text", max_seq_length=4096, packing=True, # enable sample packing dataset_kwargs={ "add_special_tokens": False, # template already has them }, )

The SFTTrainer handles prompt masking automatically when you provide structured message data.

3. Sample packing

Without packing (naive padding):

Seq 1: [tokens tokens tokens PAD PAD PAD PAD PAD PAD PAD] ← 70% padding Seq 2: [tokens tokens PAD PAD PAD PAD PAD PAD PAD PAD PAD] ← 80% padding Seq 3: [tokens tokens tokens tokens tokens PAD PAD PAD PAD] ← 40% padding

With packing:

Packed: [seq1_tokens SEP seq2_tokens SEP seq3_tokens PAD] ← 5% padding

The key subtlety: you must block cross-attention between packed samples using cu_seqlens or a block-diagonal attention mask. Without this, sample 2 attends to sample 1’s tokens — a silent data leak.

4. The data quality ladder

Dataset sizeQualityTypical result
1K curated (LIMA-style)Expert-written, diverseStrong instruction-following, sometimes brittle on edge cases
10K–50K (OpenHermes, Capybara)Mix of synthetic + humanGood general-purpose chatbot
100K–500K (UltraChat, SlimOrca)Mostly synthetic, filteredRobust but can be generic
1M+ (WildChat, ShareGPT dumps)Noisy, diverseVolume helps coverage, but quality per-example is low

2025–2026 consensus: Start with 10K–50K high-quality examples. Scale up only if evaluation shows coverage gaps. Synthetic data from frontier models (GPT-4o, Claude 3.5) is the most cost-effective source.

The training recipe

# Full SFT recipe for a 7B model on a single A100/H100 from transformers import AutoModelForCausalLM, AutoTokenizer from trl import SFTTrainer, SFTConfig from peft import LoraConfig model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B", torch_dtype="bfloat16", attn_implementation="flash_attention_2", ) lora_config = LoraConfig( r=16, lora_alpha=32, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, task_type="CAUSAL_LM", ) config = SFTConfig( output_dir="./sft-output", per_device_train_batch_size=4, gradient_accumulation_steps=4, # effective bs = 16 learning_rate=2e-4, # typical LoRA LR num_train_epochs=3, max_seq_length=4096, packing=True, bf16=True, logging_steps=10, save_strategy="epoch", ) trainer = SFTTrainer( model=model, args=config, train_dataset=dataset, # HF dataset with "messages" column peft_config=lora_config, ) trainer.train()

Run it in your browser

Python — editableSee how chat templates transform raw messages into tokenized sequences.
Ctrl+Enter to run

Quick check

Quick check
You SFT a Llama 3 model but forget to apply prompt loss masking. What's the most likely consequence?

Key takeaways

  1. SFT = teach the model to follow instructions. It’s the mandatory bridge between pretraining and alignment.
  2. Chat template mismatch is the #1 SFT bug. Always use tokenizer.apply_chat_template() — never format manually.
  3. Mask prompt tokens (labels[prompt_indices] = -100). The model should only learn to generate responses.
  4. Sample packing gives 2–4× throughput on typical chat data. Use it with proper attention masking to prevent cross-contamination.
  5. Quality > quantity. 10K expert examples beats 1M noisy ones. Start small, evaluate, scale if needed.

Go deeper