API reference¶
Auto-generated from the docstrings in src/scree/. If you find a discrepancy between
this page and the code, the code is right — please file an issue.
The library is intentionally small: one data class, five core operations, six bridges, four reference kernels, three GPU Triton kernels.
Core type and operations — scree.*¶
scree.Array¶
A packed values+offsets array with one variable-length dimension.
Variable-length sequences stored as a flat values buffer plus offsets pointing at row boundaries.
Example
Three sequences of lengths [4, 2, 5], each with feature dim 8:
values: shape (11, 8) # 4+2+5 along ragged_dim=0
offsets: [0, 4, 6, 11] # length B+1
ragged_dim: 0
Construct with scree.pack([seq1, seq2, seq3]).
__post_init__
¶
Source code in src/scree/_core.py
scree.pack¶
Pack a list of arrays into a single scree.Array.
All arrays must share dtype and all non-ragged dims. The first array determines the backend (numpy or torch).
Source code in src/scree/_core.py
scree.unpack¶
Unpack a scree.Array into a list of arrays.
Returned slices are views into the original values where possible.
Source code in src/scree/_core.py
scree.to_padded¶
Convert a scree.Array to a padded dense array + mask.
Returns (padded, mask) where:
- padded.shape == (batch_size, max_len, *feature_dims)
- mask.shape == (batch_size, max_len) — True for valid positions
Source code in src/scree/_core.py
161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 | |
scree.from_padded¶
Convert (padded, mask) to a scree.Array.
Assumes right-padding (mask is True on the left side of each row).
Source code in src/scree/_core.py
scree.from_cu_seqlens¶
Construct a scree.Array from FlashAttention's cu_seqlens convention.
FlashAttention's cu_seqlens is exactly scree's offsets. Zero-copy.
Source code in src/scree/_core.py
Bridges — scree.bridges¶
Migration helpers between scree and existing ecosystem objects. Each bridge is zero-copy where the underlying memory layout allows.
to_torch_nested / from_torch_nested¶
Convert a scree.Array to a torch.NestedTensor (jagged layout).
The conversion materializes per-row views from the packed buffer
and hands them to torch.nested.nested_tensor. Internally torch
may share the underlying storage; we don't promise zero-copy.
Source code in src/scree/bridges/_torch_nested.py
Convert a torch.nested.NestedTensor (jagged) to a scree.Array.
Uses the jagged NestedTensor's underlying values + offsets directly (zero-copy when supported by the torch version).
Source code in src/scree/bridges/_torch_nested.py
to_hf_padded / from_hf_padded¶
Convert a scree.Array to HF (hidden_states, attention_mask).
Returns (hidden_states, attention_mask) where attention_mask is
int64 with 1 for valid positions, 0 for padding (HF convention).
Source code in src/scree/bridges/_hf_padded.py
Convert HF (hidden_states, attention_mask) to a scree.Array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states
|
(array - like, shape(batch, seq_len, *features))
|
|
required |
attention_mask
|
(array - like, shape(batch, seq_len))
|
1 for real tokens, 0 for padding (HF convention). |
required |
Source code in src/scree/bridges/_hf_padded.py
to_torch / to_numpy¶
Re-export a scree.Array with its values/offsets as torch tensors.
Zero-copy on CPU via torch.from_numpy; zero-copy on GPU via DLPack.
Source code in src/scree/bridges/_dlpack.py
Re-export a scree.Array with its values/offsets as numpy arrays.
Zero-copy from CPU torch tensors; for GPU torch tensors, copies to host.
Source code in src/scree/bridges/_dlpack.py
Reference kernels — scree.kernels.reference¶
Pure-Python (or PyTorch / MLX / JAX) implementations of the four varlen kernels. Used as ground truth in CI tests of the optimized Triton kernels. Not for production speed — they iterate Python-level over sequences.
varlen_attention¶
Reference (slow but correct) implementation of varlen self-attention.
varlen_attention
¶
Variable-length self-attention.
Each sequence in the batch attends only to itself — no cross-sequence attention. This is the operation that powers FlashAttention-varlen and the packed inference path of vLLM/SGLang; here we ship the obviously correct slow reference for use as a ground truth in CI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
q
|
Array
|
Each with shape |
required |
k
|
Array
|
Each with shape |
required |
v
|
Array
|
Each with shape |
required |
causal
|
bool
|
If True, apply a lower-triangular mask within each sequence. |
False
|
Returns:
| Type | Description |
|---|---|
Array
|
Same offsets as |
Source code in src/scree/kernels/reference/varlen_attention.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
varlen_layernorm¶
Reference (slow but correct) implementation of varlen layernorm.
Layernorm is per-token, so for variable-length data it's just elementwise
normalization over the last (feature) dim. No cross-row interaction —
the only reason this needs a varlen implementation is to operate directly
on a packed scree.Array without unpacking.
varlen_layernorm
¶
varlen_layernorm(arr: Array, weight: object | None = None, bias: object | None = None, eps: float = 1e-05) -> Array
LayerNorm over the last dim of a packed scree.Array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
arr
|
Array
|
Packed values of shape |
required |
weight
|
optional
|
Scale and shift parameters of shape |
None
|
bias
|
optional
|
Scale and shift parameters of shape |
None
|
eps
|
float
|
Numerical stability epsilon. |
1e-05
|
Source code in src/scree/kernels/reference/varlen_layernorm.py
varlen_rmsnorm¶
Reference (slow but correct) implementation of varlen RMSNorm.
RMSNorm (Zhang & Sennrich, 2019) drops the mean-subtraction step from LayerNorm — it normalizes by the root-mean-square only. It is the norm used by LLaMA, Mistral, Mixtral, DeepSeek, Qwen, and most modern open transformers, replacing LayerNorm in nearly every architecture released since 2023.
Like LayerNorm, RMSNorm is per-token (no cross-row interaction), so for variable-length data it's just elementwise on the packed buffer.
varlen_rmsnorm
¶
RMSNorm over the last dim of a packed scree.Array.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
arr
|
Array
|
Packed values of shape |
required |
weight
|
optional
|
Scale parameter of shape |
None
|
eps
|
float
|
Numerical stability epsilon (typical: 1e-6 for LLaMA-family). |
1e-06
|
Source code in src/scree/kernels/reference/varlen_rmsnorm.py
varlen_softmax¶
Reference (slow but correct) implementation of varlen softmax.
Softmax along the ragged dimension. Unlike layernorm, this is non-trivial for packed data because softmax must be computed within each sequence separately — not across the full concatenated buffer.
varlen_softmax
¶
Softmax along the ragged dimension, per-sequence.
Each row (sequence) is softmaxed independently. The output has the same shape and offsets as the input.
Source code in src/scree/kernels/reference/varlen_softmax.py
Triton kernels — scree.kernels.triton¶
CUDA-only. Importing scree.kernels.triton is safe on non-CUDA platforms
(TRITON_AVAILABLE is False and no kernel symbols are exported), but calling the
kernels without CUDA raises an informative error.
varlen_attention_triton¶
The forward kernel — 1.30× of FA-2 on H100 for the headline workload.
varlen_attention_triton_autograd¶
Autograd-aware wrapper. Forward + backward both run on Triton kernels (FA-2 style:
preprocess + dKV + dQ). Use this when you need gradients to flow through q, k,
v. Full training step at 1.61× of FA-2.
varlen_rmsnorm_triton¶
13.97× speedup vs PyTorch reference on H100 (no native RMSNorm in PyTorch).
varlen_layernorm_triton¶
1.31× speedup vs torch.nn.functional.layer_norm on H100.
What's NOT in the public API¶
Names with a leading underscore in any module are private and subject to change without notice. In particular:
scree._core._is_torch,_is_mlx,_is_jax— backend dispatch predicatesscree.kernels.triton._varlen_attn_fwd_kernel— the raw Triton kernelscree.kernels.triton._varlen_attn_bwd_*_kernel— raw backward kernelsscree.kernels.triton._backward.varlen_attention_triton_backward— the host-side backward orchestrator (used by the autograd wrapper)