Skip to content

Strides & Layout

In a managed-runtime numerical library, you tend to imagine a tensor as a 2D box of numbers — the matrix. The shape is what defines it. Reshape it and a new matrix appears. Transpose it and a new matrix appears.

That mental model is wrong, and the wrongness costs you. Underneath, a tensor is one flat 1D buffer of bytes plus a tiny pile of integers — a shape, a stride table, and an offset — that describe how to walk through that buffer. Transpose is O(1) because it just swaps two stride numbers. Slicing is O(1) because it shifts an offset. Most “operations” are not operations on data; they’re metadata changes that describe a different walk over the same bytes.

This is liberating until it bites. A non-contiguous tensor read in the wrong order can be 10× slower than a contiguous one, even though the math is identical, because the cache hates non-sequential access. Until you can read a tensor’s strides, you cannot reason about whether your code is fast or slow.

TL;DR

  • A tensor is (buffer, shape, strides, dtype, offset). The buffer holds the bytes; shape and strides decide how to walk them.
  • Stride[i] = how many elements to skip to advance by 1 along axis i. For a contiguous row-major (M, N) tensor: strides are (N, 1). Column-major: (1, M).
  • Most tensor ops can be done without copying by changing strides: transpose, view, permute, expand, slicing along an axis. That’s why T.transpose(0, 1) is O(1) — it just swaps strides.
  • The cost shows up later: a non-contiguous tensor read in the wrong order can be 10× slower than a contiguous one because it ruins cache locality and breaks coalesced GPU loads.
  • Almost every “why is my model slow?” mystery in PyTorch traces back to strides. A transpose followed by a view blows up with a stride error; an as_strided on a temporary creates a footgun. Reading strides is reading the cost model.

Why this matters

Every modern tensor framework — PyTorch, NumPy, JAX, TensorFlow — represents tensors as a buffer + a stride table. Operations that seem expensive (transpose a 1B-element tensor) are O(1); operations that seem free (printing a tensor) sometimes copy gigabytes. The mismatch between “what looks free” and “what is free” is exactly what strides encode.

Mental model

One buffer, three different “tensors” — same bytes, different walking instructions. Transpose, slice, broadcast, view: all are stride manipulations.

A tensor is (buffer, shape, strides, offset)

The minimal tensor type:

class Tensor: def __init__(self, buf, shape, strides, offset=0): self.buf = buf # 1D array of dtype-sized elements self.shape = shape # tuple of ints self.strides = strides # tuple of ints (in elements, not bytes) self.offset = offset # starting position in the buffer def get(self, *indices): i = self.offset for idx, stride in zip(indices, self.strides): i += idx * stride return self.buf[i]

Element at logical index (i, j) lives at buffer position offset + i*stride[0] + j*stride[1]. That’s the entire data model. Every other operation is a derived view that adjusts shape, strides, or offset.

The standard layouts

For a 2D tensor of shape (M, N):

LayoutStridesMemory order
Row-major (C)(N, 1)(0,0), (0,1), ..., (0,N-1), (1,0), ...
Column-major (Fortran)(1, M)(0,0), (1,0), ..., (M-1,0), (0,1), ...

Both store the same N×M elements; the layout is just “which axis varies fastest as we walk through memory.”

For higher rank, the same idea: stride (d_{n-1}*d_{n-2}*...*d_1, ..., d_{n-1}*d_{n-2}, d_{n-1}, 1) is canonical row-major. PyTorch defaults to row-major; many BLAS libraries default to column-major; both work, but mixing them by accident is a constant source of bugs.

Operations as stride changes

Transpose swaps two strides:

def transpose(t, axis_a, axis_b): new_shape = list(t.shape) new_strides = list(t.strides) new_shape[axis_a], new_shape[axis_b] = new_shape[axis_b], new_shape[axis_a] new_strides[axis_a], new_strides[axis_b] = new_strides[axis_b], new_strides[axis_a] return Tensor(t.buf, tuple(new_shape), tuple(new_strides), t.offset)

No copy. The transposed tensor shares the buffer with its original.

Slicing along an axis changes shape, strides, offset:

# t[2:5, :] new_offset = t.offset + 2 * t.strides[0] new_shape = (3, t.shape[1]) new_strides = t.strides

Broadcasting sets a stride to 0 (the same element is “read” repeatedly):

# A (3,) tensor "broadcast" to shape (5, 3) — without copying new_shape = (5, 3) new_strides = (0, 1) # advancing along axis 0 doesn't move in memory

Reshape is the expensive one — it sometimes can’t be done without copying:

  • If the new shape is compatible with current strides (the new walk is monotonic in the buffer), it’s O(1) — no copy.
  • Otherwise PyTorch raises (.view fails) or silently copies (.reshape falls back).

The classic gotcha: t.transpose(0, 1).view(-1) fails because transpose made strides non-contiguous, and view requires contiguity. You either call t.transpose(0, 1).contiguous().view(-1) (explicit copy) or t.transpose(0, 1).reshape(-1) (implicit copy if needed).

Why non-contiguous = slow

Modern CPUs and GPUs love sequential memory access:

  • CPU L1/L2 caches load a 64-byte on every access. If your next read is the next address, it’s free; if it’s 10 KB away, you pay another cache-line load.
  • GPU coalesced loads require the 32 threads of a warp to read 32 consecutive bytes (or 128 bytes total). Non-coalesced loads serialize into many separate transactions.
  • DRAM page walks add latency every time you cross a page boundary (typically 4 KB).

A tensor traversed in stride order is sequential in memory → fast. A transposed tensor traversed in the original order is jumping by N elements per step → cache thrashing. Same bytes, same op count, ~10× slower.

Reading strides in production code

Every PyTorch tensor exposes .stride():

import torch t = torch.randn(1024, 1024) print(t.shape, t.stride()) # torch.Size([1024, 1024]) (1024, 1) print(t.is_contiguous()) # True t2 = t.transpose(0, 1) print(t2.shape, t2.stride()) # torch.Size([1024, 1024]) (1, 1024) print(t2.is_contiguous()) # False — strides not in descending order # This will copy: t3 = t2.contiguous() print(t3.stride()) # (1024, 1) — back to canonical

The single most useful debugging move when a kernel is mysteriously slow: print .stride() and .is_contiguous() for every tensor it touches. Half the time, the answer is “I forgot to .contiguous() after a .transpose().”

Strides matter for kernels too

Almost every Triton or CUTLASS kernel takes stride_* arguments:

@triton.jit def matmul_kernel(A, B, C, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, ...): ... a_ptrs = A + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak ...

This is exactly the offset-from-strides math from above. Pass the wrong strides, your kernel reads garbage. Strides are the universal cost model; kernel APIs surface them explicitly because a kernel is a program over a strided buffer.

Run it in your browser

Python — editableImplement a strided tensor view, transpose without copying, and watch the buffer stay shared.
Ctrl+Enter to run

The transpose returns instantly regardless of buffer size — it never touches the data. That’s the productive surprise of strided layouts.

Quick check

Fill in the blank
A tensor of shape (M, N) with row-major (C-order) layout has strides:
The slow axis (rows) skips N elements per step; the fast axis (columns) skips 1.
Quick check
A user does `y = x.transpose(0, 1).view(-1)` on a 2D tensor and gets a runtime error about view not being supported on non-contiguous tensors. The cleanest fix:

Key takeaways

  1. A tensor = (buffer, shape, strides, offset). Strides decide how to walk the buffer.
  2. Transpose, slice, broadcast, expand are all O(1) — they just change the stride/offset metadata.
  3. Reshape sometimes requires a copy. view errors when it can’t avoid one; reshape copies silently.
  4. Non-contiguous → cache misses → ~10× slowdown. This is the single most common source of “mysteriously slow” tensor code.
  5. Read .stride() and .is_contiguous() first when debugging perf. Kernel APIs take strides explicitly because strides are the cost model.

Go deeper

TL;DR

  • A tensor is (buffer, shape, strides, dtype, offset). The buffer holds the bytes; shape and strides decide how to walk them.
  • Stride[i] = how many elements to skip to advance by 1 along axis i. For a contiguous row-major (M, N) tensor: strides are (N, 1). Column-major: (1, M).
  • Most tensor ops can be done without copying by changing strides: transpose, view, permute, expand, slicing along an axis. That’s why T.transpose(0, 1) is O(1) — it just swaps strides.
  • The cost shows up later: a non-contiguous tensor read in the wrong order can be 10× slower than a contiguous one because it ruins cache locality and breaks coalesced GPU loads.
  • Almost every “why is my model slow?” mystery in PyTorch traces back to strides. A transpose followed by a view blows up with a stride error; an as_strided on a temporary creates a footgun. Reading strides is reading the cost model.

Why this matters

Every modern tensor framework — PyTorch, NumPy, JAX, TensorFlow — represents tensors as a buffer + a stride table. Operations that seem expensive (transpose a 1B-element tensor) are O(1); operations that seem free (printing a tensor) sometimes copy gigabytes. The mismatch between “what looks free” and “what is free” is exactly what strides encode. Until you can read a tensor’s strides, you cannot reason about whether your code is fast or slow.

Mental model

One buffer, three different “tensors” — same bytes, different walking instructions. Transpose, slice, broadcast, view: all are stride manipulations.

Concrete walkthrough

A tensor is (buffer, shape, strides, offset)

The minimal tensor type:

class Tensor: def __init__(self, buf, shape, strides, offset=0): self.buf = buf # 1D array of dtype-sized elements self.shape = shape # tuple of ints self.strides = strides # tuple of ints (in elements, not bytes) self.offset = offset # starting position in the buffer def get(self, *indices): i = self.offset for idx, stride in zip(indices, self.strides): i += idx * stride return self.buf[i]

Element at logical index (i, j) lives at buffer position offset + i*stride[0] + j*stride[1]. That’s the entire data model. Every other operation is a derived view that adjusts shape, strides, or offset.

The standard layouts

For a 2D tensor of shape (M, N):

LayoutStridesMemory order
Row-major (C)(N, 1)(0,0), (0,1), ..., (0,N-1), (1,0), ...
Column-major (Fortran)(1, M)(0,0), (1,0), ..., (M-1,0), (0,1), ...

Both store the same N×M elements; the layout is just “which axis varies fastest as we walk through memory.”

For higher rank, the same idea: stride (d_{n-1}*d_{n-2}*...*d_1, ..., d_{n-1}*d_{n-2}, d_{n-1}, 1) is canonical row-major. PyTorch defaults to row-major; many BLAS libraries default to column-major; both work, but mixing them by accident is a constant source of bugs.

Operations as stride changes

Transpose swaps two strides:

def transpose(t, axis_a, axis_b): new_shape = list(t.shape) new_strides = list(t.strides) new_shape[axis_a], new_shape[axis_b] = new_shape[axis_b], new_shape[axis_a] new_strides[axis_a], new_strides[axis_b] = new_strides[axis_b], new_strides[axis_a] return Tensor(t.buf, tuple(new_shape), tuple(new_strides), t.offset)

No copy. The transposed tensor shares the buffer with its original.

Slicing along an axis changes shape, strides, offset:

# t[2:5, :] new_offset = t.offset + 2 * t.strides[0] new_shape = (3, t.shape[1]) new_strides = t.strides

Broadcasting sets a stride to 0 (the same element is “read” repeatedly):

# A (3,) tensor "broadcast" to shape (5, 3) — without copying new_shape = (5, 3) new_strides = (0, 1) # advancing along axis 0 doesn't move in memory

Reshape is the expensive one — it sometimes can’t be done without copying:

  • If the new shape is compatible with current strides (the new walk is monotonic in the buffer), it’s O(1) — no copy.
  • Otherwise PyTorch raises (.view fails) or silently copies (.reshape falls back).

The classic gotcha: t.transpose(0, 1).view(-1) fails because transpose made strides non-contiguous, and view requires contiguity. You either call t.transpose(0, 1).contiguous().view(-1) (explicit copy) or t.transpose(0, 1).reshape(-1) (implicit copy if needed).

Why non-contiguous = slow

Modern CPUs and GPUs love sequential memory access:

  • CPU L1/L2 caches load a 64-byte cache line on every access. If your next read is the next address, it’s free; if it’s 10 KB away, you pay another cache-line load.
  • GPU coalesced loads require the 32 threads of a warp to read 32 consecutive bytes (or 128 bytes total). Non-coalesced loads serialize into many separate transactions.
  • DRAM page walks add latency every time you cross a page boundary (typically 4 KB).

A tensor traversed in stride order is sequential in memory → fast. A transposed tensor traversed in the original order is jumping by N elements per step → cache thrashing. Same bytes, same op count, ~10× slower.

Reading strides in production code

Every PyTorch tensor exposes .stride():

import torch t = torch.randn(1024, 1024) print(t.shape, t.stride()) # torch.Size([1024, 1024]) (1024, 1) print(t.is_contiguous()) # True t2 = t.transpose(0, 1) print(t2.shape, t2.stride()) # torch.Size([1024, 1024]) (1, 1024) print(t2.is_contiguous()) # False — strides not in descending order # This will copy: t3 = t2.contiguous() print(t3.stride()) # (1024, 1) — back to canonical

The single most useful debugging move when a kernel is mysteriously slow: print .stride() and .is_contiguous() for every tensor it touches. Half the time, the answer is “I forgot to .contiguous() after a .transpose().”

Strides matter for kernels too

Almost every Triton or CUTLASS kernel takes stride_* arguments:

@triton.jit def matmul_kernel(A, B, C, M, N, K, stride_am, stride_ak, stride_bk, stride_bn, ...): ... a_ptrs = A + offs_m[:, None] * stride_am + offs_k[None, :] * stride_ak ...

This is exactly the offset-from-strides math from above. Pass the wrong strides, your kernel reads garbage. Strides are the universal cost model; kernel APIs surface them explicitly because a kernel is a program over a strided buffer.

Run it in your browser

Python — editableImplement a strided tensor view, transpose without copying, and watch the buffer stay shared.
Ctrl+Enter to run

The transpose returns instantly regardless of buffer size — it never touches the data. That’s the productive surprise of strided layouts.

Quick check

Fill in the blank
A tensor of shape (M, N) with row-major (C-order) layout has strides:
The slow axis (rows) skips N elements per step; the fast axis (columns) skips 1.
Quick check
A user does `y = x.transpose(0, 1).view(-1)` on a 2D tensor and gets a runtime error about view not being supported on non-contiguous tensors. The cleanest fix:

Key takeaways

  1. A tensor = (buffer, shape, strides, offset). Strides decide how to walk the buffer.
  2. Transpose, slice, broadcast, expand are all O(1) — they just change the stride/offset metadata.
  3. Reshape sometimes requires a copy. view errors when it can’t avoid one; reshape copies silently.
  4. Non-contiguous → cache misses → ~10× slowdown. This is the single most common source of “mysteriously slow” tensor code.
  5. Read .stride() and .is_contiguous() first when debugging perf. Kernel APIs take strides explicitly because strides are the cost model.

Go deeper