Strides & Layout
In a managed-runtime numerical library, you tend to imagine a tensor as a 2D box of numbers — the matrix. The shape is what defines it. Reshape it and a new matrix appears. Transpose it and a new matrix appears.
That mental model is wrong, and the wrongness costs you. Underneath, a tensor is one flat 1D buffer of bytes plus a tiny pile of integers — a shape, a stride table, and an offset — that describe how to walk through that buffer. Transpose is O(1) because it just swaps two stride numbers. Slicing is O(1) because it shifts an offset. Most “operations” are not operations on data; they’re metadata changes that describe a different walk over the same bytes.
This is liberating until it bites. A non-contiguous tensor read in the wrong order can be 10× slower than a contiguous one, even though the math is identical, because the cache hates non-sequential access. Until you can read a tensor’s strides, you cannot reason about whether your code is fast or slow.
TL;DR
- A tensor is
(buffer, shape, strides, dtype, offset). The buffer holds the bytes; shape and strides decide how to walk them. - Stride[i] = how many elements to skip to advance by 1 along axis i. For a contiguous row-major
(M, N)tensor: strides are(N, 1). Column-major:(1, M). - Most tensor ops can be done without copying by changing strides:
transpose,view,permute,expand, slicing along an axis. That’s whyT.transpose(0, 1)is O(1) — it just swaps strides. - The cost shows up later: a non-contiguous tensor read in the wrong order can be 10× slower than a contiguous one because it ruins cache locality and breaks coalesced GPU loads.
- Almost every “why is my model slow?” mystery in PyTorch traces back to strides. A
transposefollowed by aviewblows up with a stride error; anas_stridedon a temporary creates a footgun. Reading strides is reading the cost model.
Why this matters
Every modern tensor framework — PyTorch, NumPy, JAX, TensorFlow — represents tensors as a buffer + a stride table. Operations that seem expensive (transpose a 1B-element tensor) are O(1); operations that seem free (printing a tensor) sometimes copy gigabytes. The mismatch between “what looks free” and “what is free” is exactly what strides encode.
Mental model
One buffer, three different “tensors” — same bytes, different walking instructions. Transpose, slice, broadcast, view: all are stride manipulations.
A tensor is (buffer, shape, strides, offset)
The minimal tensor type:
class Tensor:
def __init__(self, buf, shape, strides, offset=0):
self.buf = buf # 1D array of dtype-sized elements
self.shape = shape # tuple of ints
self.strides = strides # tuple of ints (in elements, not bytes)
self.offset = offset # starting position in the buffer
def get(self, *indices):
i = self.offset
for idx, stride in zip(indices, self.strides):
i += idx * stride
return self.buf[i]Element at logical index (i, j) lives at buffer position offset + i*stride[0] + j*stride[1]. That’s the entire data model. Every other operation is a derived view that adjusts shape, strides, or offset.
The standard layouts
For a 2D tensor of shape (M, N):
| Layout | Strides | Memory order |
|---|---|---|
| Row-major (C) | (N, 1) | (0,0), (0,1), ..., (0,N-1), (1,0), ... |
| Column-major (Fortran) | (1, M) | (0,0), (1,0), ..., (M-1,0), (0,1), ... |
Both store the same N×M elements; the layout is just “which axis varies fastest as we walk through memory.”
For higher rank, the same idea: stride (d_{n-1}*d_{n-2}*...*d_1, ..., d_{n-1}*d_{n-2}, d_{n-1}, 1) is canonical row-major. PyTorch defaults to row-major; many BLAS libraries default to column-major; both work, but mixing them by accident is a constant source of bugs.
Operations as stride changes
Transpose swaps two strides:
def transpose(t, axis_a, axis_b):
new_shape = list(t.shape)
new_strides = list(t.strides)
new_shape[axis_a], new_shape[axis_b] = new_shape[axis_b], new_shape[axis_a]
new_strides[axis_a], new_strides[axis_b] = new_strides[axis_b], new_strides[axis_a]
return Tensor(t.buf, tuple(new_shape), tuple(new_strides), t.offset)No copy. The transposed tensor shares the buffer with its original.
Slicing along an axis changes shape, strides, offset:
# t[2:5, :]
new_offset = t.offset + 2 * t.strides[0]
new_shape = (3, t.shape[1])
new_strides = t.stridesBroadcasting sets a stride to 0 (the same element is “read” repeatedly):
# A (3,) tensor "broadcast" to shape (5, 3) — without copying
new_shape = (5, 3)
new_strides = (0, 1) # advancing along axis 0 doesn't move in memoryReshape is the expensive one — it sometimes can’t be done without copying:
- If the new shape is compatible with current strides (the new walk is monotonic in the buffer), it’s O(1) — no copy.
- Otherwise PyTorch raises (
.viewfails) or silently copies (.reshapefalls back).
The classic gotcha: t.transpose(0, 1).view(-1) fails because transpose made strides non-contiguous, and view requires contiguity. You either call t.transpose(0, 1).contiguous().view(-1) (explicit copy) or t.transpose(0, 1).reshape(-1) (implicit copy if needed).
Why non-contiguous = slow
Modern CPUs and GPUs love sequential memory access:
- CPU L1/L2 caches load a 64-byte on every access. If your next read is the next address, it’s free; if it’s 10 KB away, you pay another cache-line load.
- GPU coalesced loads require the 32 threads of a warp to read 32 consecutive bytes (or 128 bytes total). Non-coalesced loads serialize into many separate transactions.
- DRAM page walks add latency every time you cross a page boundary (typically 4 KB).
A tensor traversed in stride order is sequential in memory → fast. A transposed tensor traversed in the original order is jumping by N elements per step → cache thrashing. Same bytes, same op count, ~10× slower.
Reading strides in production code
Every PyTorch tensor exposes .stride():
import torch
t = torch.randn(1024, 1024)
print(t.shape, t.stride()) # torch.Size([1024, 1024]) (1024, 1)
print(t.is_contiguous()) # True
t2 = t.transpose(0, 1)
print(t2.shape, t2.stride()) # torch.Size([1024, 1024]) (1, 1024)
print(t2.is_contiguous()) # False — strides not in descending order
# This will copy:
t3 = t2.contiguous()
print(t3.stride()) # (1024, 1) — back to canonicalThe single most useful debugging move when a kernel is mysteriously slow: print .stride() and .is_contiguous() for every tensor it touches. Half the time, the answer is “I forgot to .contiguous() after a .transpose().”
Strides matter for kernels too
Almost every Triton or CUTLASS kernel takes stride_* arguments:
@triton.jit
def matmul_kernel(A, B, C, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, ...):
...
a_ptrs = A + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
...This is exactly the offset-from-strides math from above. Pass the wrong strides, your kernel reads garbage. Strides are the universal cost model; kernel APIs surface them explicitly because a kernel is a program over a strided buffer.
Run it in your browser
The transpose returns instantly regardless of buffer size — it never touches the data. That’s the productive surprise of strided layouts.
Quick check
Key takeaways
- A tensor = (buffer, shape, strides, offset). Strides decide how to walk the buffer.
- Transpose, slice, broadcast, expand are all O(1) — they just change the stride/offset metadata.
- Reshape sometimes requires a copy.
viewerrors when it can’t avoid one;reshapecopies silently. - Non-contiguous → cache misses → ~10× slowdown. This is the single most common source of “mysteriously slow” tensor code.
- Read
.stride()and.is_contiguous()first when debugging perf. Kernel APIs take strides explicitly because strides are the cost model.
Go deeper
- DocsPyTorch — Tensor ViewsAuthoritative list of view-producing ops vs ones that may copy. Bookmark this.
- DocsNumPy — Internal Memory LayoutSame model, slightly different vocabulary. Worth reading for the C-vs-F-contiguous nuance.
- BlogPyTorch Internals — Edward Z. YangBest free essay on PyTorch's tensor representation. The "Stride visualizer" section is gold.
- BlogA guide to NumPy strides — Alex RileyWalks through every common stride trick with diagrams. The intuition-builder.
- VideoTinygrad — A Tensor Library from Scratch (livestream)Watch tinygrad's tensor type get built from nothing — the [module capstone](./index) is a structured weekend version of this.
- Repotinygrad/tinygradReference for "what does a 200-line tensor library look like?" — `tinygrad/shape/` is the strided view layer.
- Repopytorch/pytorch`aten/src/ATen/core/TensorImpl.h` is the canonical strided-tensor implementation in modern code.
TL;DR
- A tensor is
(buffer, shape, strides, dtype, offset). The buffer holds the bytes; shape and strides decide how to walk them. - Stride[i] = how many elements to skip to advance by 1 along axis i. For a contiguous row-major
(M, N)tensor: strides are(N, 1). Column-major:(1, M). - Most tensor ops can be done without copying by changing strides:
transpose,view,permute,expand, slicing along an axis. That’s whyT.transpose(0, 1)is O(1) — it just swaps strides. - The cost shows up later: a non-contiguous tensor read in the wrong order can be 10× slower than a contiguous one because it ruins cache locality and breaks coalesced GPU loads.
- Almost every “why is my model slow?” mystery in PyTorch traces back to strides. A
transposefollowed by aviewblows up with a stride error; anas_stridedon a temporary creates a footgun. Reading strides is reading the cost model.
Why this matters
Every modern tensor framework — PyTorch, NumPy, JAX, TensorFlow — represents tensors as a buffer + a stride table. Operations that seem expensive (transpose a 1B-element tensor) are O(1); operations that seem free (printing a tensor) sometimes copy gigabytes. The mismatch between “what looks free” and “what is free” is exactly what strides encode. Until you can read a tensor’s strides, you cannot reason about whether your code is fast or slow.
Mental model
One buffer, three different “tensors” — same bytes, different walking instructions. Transpose, slice, broadcast, view: all are stride manipulations.
Concrete walkthrough
A tensor is (buffer, shape, strides, offset)
The minimal tensor type:
class Tensor:
def __init__(self, buf, shape, strides, offset=0):
self.buf = buf # 1D array of dtype-sized elements
self.shape = shape # tuple of ints
self.strides = strides # tuple of ints (in elements, not bytes)
self.offset = offset # starting position in the buffer
def get(self, *indices):
i = self.offset
for idx, stride in zip(indices, self.strides):
i += idx * stride
return self.buf[i]Element at logical index (i, j) lives at buffer position offset + i*stride[0] + j*stride[1]. That’s the entire data model. Every other operation is a derived view that adjusts shape, strides, or offset.
The standard layouts
For a 2D tensor of shape (M, N):
| Layout | Strides | Memory order |
|---|---|---|
| Row-major (C) | (N, 1) | (0,0), (0,1), ..., (0,N-1), (1,0), ... |
| Column-major (Fortran) | (1, M) | (0,0), (1,0), ..., (M-1,0), (0,1), ... |
Both store the same N×M elements; the layout is just “which axis varies fastest as we walk through memory.”
For higher rank, the same idea: stride (d_{n-1}*d_{n-2}*...*d_1, ..., d_{n-1}*d_{n-2}, d_{n-1}, 1) is canonical row-major. PyTorch defaults to row-major; many BLAS libraries default to column-major; both work, but mixing them by accident is a constant source of bugs.
Operations as stride changes
Transpose swaps two strides:
def transpose(t, axis_a, axis_b):
new_shape = list(t.shape)
new_strides = list(t.strides)
new_shape[axis_a], new_shape[axis_b] = new_shape[axis_b], new_shape[axis_a]
new_strides[axis_a], new_strides[axis_b] = new_strides[axis_b], new_strides[axis_a]
return Tensor(t.buf, tuple(new_shape), tuple(new_strides), t.offset)No copy. The transposed tensor shares the buffer with its original.
Slicing along an axis changes shape, strides, offset:
# t[2:5, :]
new_offset = t.offset + 2 * t.strides[0]
new_shape = (3, t.shape[1])
new_strides = t.stridesBroadcasting sets a stride to 0 (the same element is “read” repeatedly):
# A (3,) tensor "broadcast" to shape (5, 3) — without copying
new_shape = (5, 3)
new_strides = (0, 1) # advancing along axis 0 doesn't move in memoryReshape is the expensive one — it sometimes can’t be done without copying:
- If the new shape is compatible with current strides (the new walk is monotonic in the buffer), it’s O(1) — no copy.
- Otherwise PyTorch raises (
.viewfails) or silently copies (.reshapefalls back).
The classic gotcha: t.transpose(0, 1).view(-1) fails because transpose made strides non-contiguous, and view requires contiguity. You either call t.transpose(0, 1).contiguous().view(-1) (explicit copy) or t.transpose(0, 1).reshape(-1) (implicit copy if needed).
Why non-contiguous = slow
Modern CPUs and GPUs love sequential memory access:
- CPU L1/L2 caches load a 64-byte cache line on every access. If your next read is the next address, it’s free; if it’s 10 KB away, you pay another cache-line load.
- GPU coalesced loads require the 32 threads of a warp to read 32 consecutive bytes (or 128 bytes total). Non-coalesced loads serialize into many separate transactions.
- DRAM page walks add latency every time you cross a page boundary (typically 4 KB).
A tensor traversed in stride order is sequential in memory → fast. A transposed tensor traversed in the original order is jumping by N elements per step → cache thrashing. Same bytes, same op count, ~10× slower.
Reading strides in production code
Every PyTorch tensor exposes .stride():
import torch
t = torch.randn(1024, 1024)
print(t.shape, t.stride()) # torch.Size([1024, 1024]) (1024, 1)
print(t.is_contiguous()) # True
t2 = t.transpose(0, 1)
print(t2.shape, t2.stride()) # torch.Size([1024, 1024]) (1, 1024)
print(t2.is_contiguous()) # False — strides not in descending order
# This will copy:
t3 = t2.contiguous()
print(t3.stride()) # (1024, 1) — back to canonicalThe single most useful debugging move when a kernel is mysteriously slow: print .stride() and .is_contiguous() for every tensor it touches. Half the time, the answer is “I forgot to .contiguous() after a .transpose().”
Strides matter for kernels too
Almost every Triton or CUTLASS kernel takes stride_* arguments:
@triton.jit
def matmul_kernel(A, B, C, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, ...):
...
a_ptrs = A + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak
...This is exactly the offset-from-strides math from above. Pass the wrong strides, your kernel reads garbage. Strides are the universal cost model; kernel APIs surface them explicitly because a kernel is a program over a strided buffer.
Run it in your browser
The transpose returns instantly regardless of buffer size — it never touches the data. That’s the productive surprise of strided layouts.
Quick check
Key takeaways
- A tensor = (buffer, shape, strides, offset). Strides decide how to walk the buffer.
- Transpose, slice, broadcast, expand are all O(1) — they just change the stride/offset metadata.
- Reshape sometimes requires a copy.
viewerrors when it can’t avoid one;reshapecopies silently. - Non-contiguous → cache misses → ~10× slowdown. This is the single most common source of “mysteriously slow” tensor code.
- Read
.stride()and.is_contiguous()first when debugging perf. Kernel APIs take strides explicitly because strides are the cost model.
Go deeper
- DocsPyTorch — Tensor ViewsAuthoritative list of view-producing ops vs ones that may copy. Bookmark this.
- DocsNumPy — Internal Memory LayoutSame model, slightly different vocabulary. Worth reading for the C-vs-F-contiguous nuance.
- BlogPyTorch Internals — Edward Z. YangBest free essay on PyTorch's tensor representation. The "Stride visualizer" section is gold.
- BlogA guide to NumPy strides — Alex RileyWalks through every common stride trick with diagrams. The intuition-builder.
- VideoTinygrad — A Tensor Library from Scratch (livestream)Watch tinygrad's tensor type get built from nothing — the [module capstone](./index) is a structured weekend version of this.
- Repotinygrad/tinygradReference for "what does a 200-line tensor library look like?" — `tinygrad/shape/` is the strided view layer.
- Repopytorch/pytorch`aten/src/ATen/core/TensorImpl.h` is the canonical strided-tensor implementation in modern code.