vLLM / SGLang integration sketch¶
A draft architectural proposal for letting vLLM and SGLang accept
scree.Array as a canonical batch type. This is an external-facing
design doc — not yet an upstream PR.
What changes for vLLM¶
vLLM today uses its own packed-batch representation internally (SamplingMetadata + per-tier flat tensors). The conversion at the framework boundary is:
HF model output --> torch.Tensor (B, S, D) + attention_mask
|
v
vLLM internal layout --> packed_qkv: (total_tokens, ...) + cu_seqlens: (B+1,)
|
v
FlashAttention kernel --> cu_seqlens varlen path
vLLM already converged on the same packed layout scree uses. The only new code path needed is one type conversion at the boundary.
Proposed change¶
Add a single converter to vLLM's worker (Python-only, no kernel changes):
# vllm/worker/scree_bridge.py
from scree.bridges import from_torch_nested
def from_vllm_batch(vllm_metadata) -> scree.Array:
"""Wrap vLLM's packed_qkv + cu_seqlens as a scree.Array. Zero-copy."""
return scree.from_cu_seqlens(
values=vllm_metadata.packed_qkv,
cu_seqlens=vllm_metadata.cu_seqlens,
)
And the inverse for outputs:
def to_vllm_batch(arr: scree.Array, vllm_metadata) -> None:
"""In-place write of arr.values into vllm's output slot."""
# cu_seqlens is already shared (it IS arr.offsets); only values move.
vllm_metadata.output_buffer.copy_(arr.values)
That's the entire kernel-level integration: two functions, ~10 lines each, both zero-copy.
What this unlocks for vLLM¶
-
Cross-engine prefix sharing. Two vLLM workers running different models but with shared system prompts could share the prefix slice of their KV caches by sharing
scree.Arrayhandles. (Beyond v0.1 scope but a clean v0.2 direction.) -
Decoupling from FlashAttention. vLLM today calls
flash_attn_varlen_funcdirectly. With scree as the typed batch, that call becomesvarlen_attention_triton(arr.values, arr.values, arr.values, arr.offsets, causal=True)— same speed (we benchmark 1.21× of FA-2 on the forward path), with the option of switching to other backends if the platform doesn't have CUDA. -
Multimodal batching. vLLM's multimodal path currently has separate code for image-token interleaving. With
scree.Arrayit's the same type as text-only — no special multimodal kernel needed (seeexamples/05_multimodal_interleaved.py).
What changes for SGLang¶
SGLang's RadixAttention already uses packed cu_seqlens internally. The integration is the same one-converter pattern:
# sglang/srt/scree_bridge.py
def from_sglang_batch(sg_state) -> scree.Array:
return scree.from_cu_seqlens(sg_state.packed_qkv, sg_state.cu_seqlens)
Additional benefit specific to SGLang: their radix-tree prefix cache could be reformulated to operate on scree.Array handles instead of raw tensor slices. This is a v0.2 conversation, not a v0.1 ask.
What scree does NOT change in either engine¶
- Kernel selection (vLLM keeps choosing FlashAttention or its own; scree is just the type the args travel in)
- Memory management (paged KV cache, block tables — all stays the same)
- Sampling, scheduling, sequence management — all unchanged
- API surface to the application — unchanged
The proposal is purely type-level. Both engines already use packed values + cu_seqlens; we're proposing they reuse a typed name for it across engines.
What we want from the vLLM / SGLang teams¶
-
Validation of the type shape. Does
scree.Array(values, offsets, ragged_dim=0)capture every variant of varlen batching you do? Speak now if not — we want the primitive to fit your real workloads. -
Permission to draft a PR. A small (~50 LOC) PR adding the two converter functions per engine, gated behind an optional
--scree-batchflag so we can test without disturbing existing users. -
A test workload. Ideally a CI-runnable workload (small Llama model, one short prompt) that both your engine and scree-via-your-engine can run, so we can lock in numerical equivalence in your test suite.
Timeline (proposal)¶
If both teams are open to this:
- Week 0 (now): this document, soliciting feedback
- Week 2: draft PR for vLLM, gated behind a flag
- Week 4: same for SGLang
- Week 8: numerical equivalence tests landed in both repos
- Week 12: optional
--scree-batchflag exposed to users; gather feedback - Week 24 (post-v0.1 of scree): consider making it the default if there are no perf regressions
Open questions¶
- How does vLLM's paged KV cache interact with the
scree.Arrayview? Today scree treatsvaluesas a contiguous flat buffer; vLLM has it in HBM blocks. The bridge needs to handle non-contiguous values either by: - forcing a contiguous copy at the boundary (cheap, common case), or
- teaching
scree.Arrayabout non-contiguous storage (more general, more complex)
Open for discussion.
- Should scree gain a "view of view" type for slicing into a paged KV without copying? Useful for inference engines but adds complexity to the primitive.
Contact¶
This document is docs/vllm-integration-sketch.md in the scree repo. Issues, comments, and counter-proposals welcome at https://github.com/jvoltci/scree/discussions.
— scree maintainer