Skip to content

vLLM & SGLang

When you type vllm serve meta-llama/Llama-3.3-70B-Instruct --tensor-parallel-size 8 into a terminal and 90 seconds later curl localhost:8000/v1/chat/completions returns a streaming OpenAI-shaped response, an enormous amount of machinery has just started running on your behalf. is managing the in fixed-size blocks so memory doesn’t fragment. is packing every decode step with as many concurrent sequences as fit. is reusing the prefill of every shared system prompt across requests. One CLI flag and you’ve inherited three years of inference-systems research.

That’s — the boring-and-correct default, the option to pick when you don’t have a reason to pick something else. Its main competitor in 2026, , makes a different bet: build the system around a tree of shared prefixes () and you can reuse KV cache far more aggressively than vLLM’s flat prefix cache, especially on agent and chat workloads where 1,000 concurrent users share a 4K-token system prompt. On those workloads SGLang reliably pulls 1.5–3× ahead. The serving stack landscape settles into a flowchart by 2026: vLLM for general traffic, SGLang for shared-prefix-heavy workloads, TensorRT-LLM if you need NVIDIA-peak throughput and have the build budget, TGI if you live in the HuggingFace ecosystem.

TL;DR

  • vLLM = the default inference server. PagedAttention, continuous batching, broad model support, well-documented. Easy to deploy.
  • SGLang = the throughput-focused alternative. RadixAttention for shared-prefix workloads, structured-output via constrained decoding (XGrammar), often 1.5–3× faster on agent / chat workloads where prompts share long preambles.
  • TGI (HuggingFace) = the convenient choice if you’re already in the HF ecosystem; ships less aggressive perf, more polish.
  • TensorRT-LLM = peak NVIDIA performance; pay in build complexity and Hopper-only optimizations.
  • Pick by workload: broad inference → vLLM. Many shared-prefix / agentic requests → SGLang. NVIDIA-native max throughput → TensorRT-LLM. Internal tooling within HF → TGI.

Mental model

These aren’t mutually exclusive — many production stacks run vLLM for the bulk of traffic and SGLang for the agentic-prefix subset. But picking the primary is the question this lesson answers.

What each one actually is

vLLM

The reference implementation of PagedAttention + continuous batching. The defaults are good. Engine API is stable. Supports almost every popular open model within days of release.

pip install vllm python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --dtype bfloat16 \ --gpu-memory-utilization 0.9 \ --enable-prefix-caching

That gives you an OpenAI-compatible API on port 8000. Test it with curl localhost:8000/v1/chat/completions.

vLLM v1 (late 2024) rewrote the engine around chunked prefill + PagedAttention as the unified primitive — same scheduling for prefill and decode steps. ~20–30% throughput improvement over v0; the default since 2025.

Strengths: ubiquitous model support, broad community, OpenAI-compatible API, prefix caching, multi-LoRA serving (--enable-lora), good docs.

Weaknesses: less aggressive on the absolute throughput frontier than SGLang/TRT-LLM for specific workloads.

SGLang

The throughput-focused alternative. Built around RadixAttention — a generalization of vLLM’s prefix caching to a radix tree of shared prefixes across requests, with explicit structured-prompt support.

pip install sglang python -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --port 30000

RadixAttention’s win: if your traffic has a tree of overlapping prefixes (a system prompt + branching agent paths, or a long document + many follow-up questions), SGLang shares the KV cache across that tree automatically. Hit rate climbs, KV memory drops, throughput soars.

Structured output: SGLang ships native support for grammar-constrained decoding via XGrammar. JSON mode, regex mode, FSM mode are all native and fast. (vLLM also supports these now via guided-decoding plugins; SGLang’s path is more polished.)

Strengths: highest throughput on prefix-shared workloads, best structured-output story, fast iteration on frontier models.

Weaknesses: smaller community than vLLM, occasional rough edges, more involved deployment.

TGI (Text Generation Inference)

HuggingFace’s serving framework. Polished, well-integrated with HF Hub. Ships multi-LoRA, quantization support, and a clean Rust-based gRPC server.

Strengths: seamless HF Hub integration, official quantization support (GPTQ, AWQ), production polish.

Weaknesses: historically slower than vLLM/SGLang on raw throughput; closed the gap in 2024 but still trails on some workloads.

TensorRT-LLM

NVIDIA’s own. Compiles models to highly-tuned CUDA kernels specific to the target GPU. Peak throughput on H100/B200; expect 20–40% over vLLM on the same hardware on apples-to-apples workloads.

Strengths: raw throughput, FP8 first-class, integrated with Triton Inference Server.

Weaknesses: model-specific build process (you compile the model into a TensorRT engine, often 30 min on H100), Hopper-only feature flags, less open community.

Real numbers — Llama-3.1 8B, single H100, in-distribution chat workload

ServerThroughput (tok/s, batch 32)p50 latencyp99 latencySetup time
HuggingFace generate1,20080 ms250 ms2 min
TGI5,80028 ms110 ms5 min
vLLM v17,40022 ms90 ms5 min
SGLang (no shared prefix)8,10020 ms80 ms10 min
SGLang (shared system prompt)14,00014 ms62 ms10 min
TensorRT-LLM (FP8)9,50018 ms75 ms30 min build

(Numbers approximate; vary with model, hardware, and benchmark.)

The killer SGLang result is the shared-prefix case — when many requests start with the same long system prompt, RadixAttention reuses the KV cache and throughput jumps almost 2× over the no-share baseline.

When to pick each

Default to vLLM unless you have a specific reason. It’s the most boring choice and the most likely to keep working as you scale. Broad model support, big community, predictable.

Pick SGLang if:

  • You’re building agents (long system prompt + many short queries)
  • You’re doing RAG with templated prompts
  • You need structured output as a first-class feature
  • You measured the gain on your workload and it’s worth the rougher edges

Pick TensorRT-LLM if:

  • You’re on H100/B200 and willing to invest in builds
  • You need the absolute throughput floor
  • You have an internal team comfortable with NVIDIA-specific tooling

Pick TGI if:

  • You’re already living in the HuggingFace stack
  • Smooth integration matters more than peak throughput

Run it in your browser — pick the right server from a workload spec

Python — editableA toy decision engine that mirrors the flowchart above. Plug in your traffic shape; it picks a server.
Ctrl+Enter to run

The decision is rarely a flowchart in real life — usually you run vLLM, measure, then decide if a SGLang fork is worth carrying. But the shape of the question is the lesson.

Quick check

Quick check
You're building a customer-support agent that uses a 4K-token system prompt + tool definitions, then answers a typical question with a 200-token user query and a 300-token reply. You expect ~1,000 concurrent users. Which server is the strongest pick?

Key takeaways

  1. vLLM is the default. Broad support, mature, OpenAI-compatible API. Pick it unless you have a reason not to.
  2. SGLang wins on shared prefixes. Agents, chat, RAG with templated prompts — RadixAttention is genuinely better than prefix caching at this.
  3. TensorRT-LLM is the absolute peak on NVIDIA hardware, with the build cost.
  4. TGI is the HuggingFace-ecosystem choice. Less aggressive perf, smoother integration.
  5. Production stacks often run multiple servers — vLLM for general traffic, SGLang for agentic. Routing layer in front.

Go deeper

TL;DR

  • vLLM = the default inference server. PagedAttention, continuous batching, broad model support, well-documented. Easy to deploy.
  • SGLang = the throughput-focused alternative. RadixAttention for shared-prefix workloads, structured-output via constrained decoding (XGrammar), often 1.5–3× faster on agent / chat workloads where prompts share long preambles.
  • TGI (HuggingFace) = the convenient choice if you’re already in the HF ecosystem; ships less aggressive perf, more polish.
  • TensorRT-LLM = peak NVIDIA performance; pay in build complexity and Hopper-only optimizations.
  • Pick by workload: broad inference → vLLM. Many shared-prefix / agentic requests → SGLang. NVIDIA-native max throughput → TensorRT-LLM. Internal tooling within HF → TGI.

Why this matters

You can’t write production LLM apps without picking a serving stack. The difference between picking right and picking wrong is often 2–4× cost and 30–60% latency, on the same hardware running the same model.

By April 2026, the four-way decision is settled enough that you can make it from a flowchart. This lesson is that flowchart.

Mental model

These aren’t mutually exclusive — many production stacks run vLLM for the bulk of traffic and SGLang for the agentic-prefix subset. But picking the primary is the question this lesson answers.

Concrete walkthrough — what each one actually is

vLLM

The reference implementation of PagedAttention + continuous batching. The defaults are good. Engine API is stable. Supports almost every popular open model within days of release.

pip install vllm python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.1-8B-Instruct \ --dtype bfloat16 \ --gpu-memory-utilization 0.9 \ --enable-prefix-caching

That gives you an OpenAI-compatible API on port 8000. Test it with curl localhost:8000/v1/chat/completions.

vLLM v1 (late 2024) rewrote the engine around chunked prefill + PagedAttention as the unified primitive — same scheduling for prefill and decode steps. ~20–30% throughput improvement over v0; the default since 2025.

Strengths: ubiquitous model support, broad community, OpenAI-compatible API, prefix caching, multi-LoRA serving (--enable-lora), good docs.

Weaknesses: less aggressive on the absolute throughput frontier than SGLang/TRT-LLM for specific workloads.

SGLang

The throughput-focused alternative. Built around RadixAttention — a generalization of vLLM’s prefix caching to a radix tree of shared prefixes across requests, with explicit structured-prompt support.

pip install sglang python -m sglang.launch_server \ --model-path meta-llama/Llama-3.1-8B-Instruct \ --port 30000

RadixAttention’s win: if your traffic has a tree of overlapping prefixes (a system prompt + branching agent paths, or a long document + many follow-up questions), SGLang shares the KV cache across that tree automatically. Hit rate climbs, KV memory drops, throughput soars.

Structured output: SGLang ships native support for grammar-constrained decoding via XGrammar. JSON mode, regex mode, FSM mode are all native and fast. (vLLM also supports these now via guided-decoding plugins; SGLang’s path is more polished.)

Strengths: highest throughput on prefix-shared workloads, best structured-output story, fast iteration on frontier models.

Weaknesses: smaller community than vLLM, occasional rough edges, more involved deployment.

TGI (Text Generation Inference)

HuggingFace’s serving framework. Polished, well-integrated with HF Hub. Ships multi-LoRA, quantization support, and a clean Rust-based gRPC server.

Strengths: seamless HF Hub integration, official quantization support (GPTQ, AWQ), production polish.

Weaknesses: historically slower than vLLM/SGLang on raw throughput; closed the gap in 2024 but still trails on some workloads.

TensorRT-LLM

NVIDIA’s own. Compiles models to highly-tuned CUDA kernels specific to the target GPU. Peak throughput on H100/B200; expect 20–40% over vLLM on the same hardware on apples-to-apples workloads.

Strengths: raw throughput, FP8 first-class, integrated with Triton Inference Server.

Weaknesses: model-specific build process (you compile the model into a TensorRT engine, often 30 min on H100), Hopper-only feature flags, less open community.

Real numbers — Llama-3.1 8B, single H100, in-distribution chat workload

ServerThroughput (tok/s, batch 32)p50 latencyp99 latencySetup time
HuggingFace generate1,20080 ms250 ms2 min
TGI5,80028 ms110 ms5 min
vLLM v17,40022 ms90 ms5 min
SGLang (no shared prefix)8,10020 ms80 ms10 min
SGLang (shared system prompt)14,00014 ms62 ms10 min
TensorRT-LLM (FP8)9,50018 ms75 ms30 min build

(Numbers approximate; vary with model, hardware, and benchmark.)

The killer SGLang result is the shared-prefix case — when many requests start with the same long system prompt, RadixAttention reuses the KV cache and throughput jumps almost 2× over the no-share baseline.

When to pick each

Default to vLLM unless you have a specific reason. It’s the most boring choice and the most likely to keep working as you scale. Broad model support, big community, predictable.

Pick SGLang if:

  • You’re building agents (long system prompt + many short queries)
  • You’re doing RAG with templated prompts
  • You need structured output as a first-class feature
  • You measured the gain on your workload and it’s worth the rougher edges

Pick TensorRT-LLM if:

  • You’re on H100/B200 and willing to invest in builds
  • You need the absolute throughput floor
  • You have an internal team comfortable with NVIDIA-specific tooling

Pick TGI if:

  • You’re already living in the HuggingFace stack
  • Smooth integration matters more than peak throughput

Quick check

Quick check
You're building a customer-support agent that uses a 4K-token system prompt + tool definitions, then answers a typical question with a 200-token user query and a 300-token reply. You expect ~1,000 concurrent users. Which server is the strongest pick?

Key takeaways

  1. vLLM is the default. Broad support, mature, OpenAI-compatible API. Pick it unless you have a reason not to.
  2. SGLang wins on shared prefixes. Agents, chat, RAG with templated prompts — RadixAttention is genuinely better than prefix caching at this.
  3. TensorRT-LLM is the absolute peak on NVIDIA hardware, with the build cost.
  4. TGI is the HuggingFace-ecosystem choice. Less aggressive perf, smoother integration.
  5. Production stacks often run multiple servers — vLLM for general traffic, SGLang for agentic. Routing layer in front.

Go deeper