Skip to content

EXO & Swarm Inference

In a cloud-serving stack, “run a 70B” means: rent an H100 (or part of one), pay $2/hour, get 50+ tok/s, scale up if you need more. The hardware is rented; the network between layers is internal to a single GPU; the bottleneck is somebody else’s problem.

The phone-runnable LLM era hits a hard ceiling at ~7B per device. A laptop fits a 30B at 4-bit. Nothing user-grade fits a 70B at quality. But four devices on a wifi network combined have ~50 GB of usable memory — easily enough. So the question becomes: can you shard the layers of a 70B across a phone, two laptops, and a desktop on the same SSID, and stream activations between them as each token decodes? The answer turns out to be yes, with a real but instructive throughput penalty.

The pattern is pipeline-parallel inference applied at the device-fleet level. Same algorithm data centers use to train trillion-parameter models, applied at the smallest possible scale. The orchestration is Python (mDNS discovery, partition planning, RPC), the inference kernels are whatever each device’s runtime ships (MLX on a Mac, llama.cpp on an Android, tinygrad on a Linux box). Python is the SDK; the per-device runtime is what runs the layers; the LAN is the new HBM bus.

The interesting twist: the network usually dominates. With wifi-6 you get ~5 ms per hop and 3–5 tok/s on 70B; with contended wifi-5 you slip to under 1 tok/s. That changes the kind of optimization that matters — not “make my CUDA kernel faster” but “wire the laptop via ethernet and put the phone on the same SSID.”

TL;DR

  • Swarm inference is pipeline-parallel inference applied at the device-fleet level: shard a model’s layers across N devices on a LAN, send activations between them, run a model none of them could run alone.
  • EXO (exo-explore/exo) is the open-source reference: Python framework, mDNS discovery, dynamic partitioning across iOS / Android / macOS / Linux. Uses tinygrad and mlx as backends. Released early 2024, ~12K GitHub stars.
  • Petals (bigscience/petals) is the BitTorrent-of-LLMs cousin — a public swarm where strangers share unused compute. Slower and less private than EXO but useful for genuinely large models (Llama-3.1-405B).
  • The math: a 4-device LAN running pipeline-parallel inference on a 70B model averages 3–5 tok/s — readable but not snappy. The bottleneck is per-hop network latency, not compute.
  • Wifi-6 or wired ethernet, ~12–30 ms per hop. Wifi-5 (.11ac) hits 50–100 ms tail latency under contention and tanks tok/s.

Why this matters

Single-device LLM inference is bounded by what fits in one device’s memory. A phone fits a 7B; a laptop fits a 30B. Nothing single fits a 70B at any quantization that keeps quality. Yet 4 devices on a wifi network combined have ~50 GB of usable memory — easily enough.

The 2026 reality: small startups (and engineers in countries with patchy or expensive cloud access) are running 70B and 405B inference across LANs of cheap phones and laptops. The OSS tools made it real. Knowing how the network determines the throughput is the load-bearing skill — the rest is software engineering.

This is also the answer to “what’s the future of distributed inference at the edge?” — sharded across the devices users already own, not on a cloud GPU. Privacy by default (the model weights are local; the activations cross your own LAN). Cost by default (hardware you bought; no per-token fees).

Mental model

The pipeline-parallel pattern at fleet scale:

  1. Partition by memory budget: each device’s available memory determines how many layers it gets. Memory-rich devices get more layers.
  2. One forward pass = one full pipeline traversal. Token N requires activation hand-off across all devices.
  3. The slowest hop dominates. If one device or one network link is slow, the whole pipeline is slow.

How EXO partitions a model

The orchestrator is Python — what authors interact with. Each peer’s inference is whichever local runtime is available:

# Sketch of EXO's partitioning step (simplified from exo/topology/partitioning_strategy.py). from dataclasses import dataclass @dataclass class Peer: name: str mem_usable_gb: float tflops: float def partition(peers: list[Peer], n_layers: int) -> dict[str, list[int]]: total_mem = sum(p.mem_usable_gb for p in peers) assignments = {} cursor = 0 for p in sorted(peers, key=lambda x: -x.mem_usable_gb): share = round(n_layers * p.mem_usable_gb / total_mem) assignments[p.name] = list(range(cursor, cursor + share)) cursor += share # Stuff any leftover layers into the largest device. while cursor < n_layers: biggest = max(peers, key=lambda x: x.mem_usable_gb).name assignments[biggest].append(cursor) cursor += 1 return assignments

EXO inspects each peer at startup and gets:

  • Total memory.
  • Estimated TFLOPS (heuristic from chip identifier).
  • Network position (mDNS-derived hop estimate).

A typical 4-device partition for Llama-3.3-70B Q4 (~40 GB):

DeviceMemoryLayersLayer count
MacBook M2 24 GB18 GB usable0–2829
Linux laptop 32 GB24 GB usable51–7929
Pixel 8 12 GB8 GB usable35–5016
iPhone 15 Pro 8 GB5 GB usable29–346

Total: 80 layers across 4 devices. The phone’s 6 layers consume ~3 GB; the bigger devices each take ~14 GB.

EXO supports re-partitioning when a device joins or leaves the cluster. Re-partition takes 10–60 s (the displaced layers stream from disk to the new device).

Per-token latency math

For pipeline-parallel inference, the per-token latency is the sum of:

  • Per-device compute on the assigned layers.
  • Per-hop network transfer of the activations.

The activation tensor between layers is small for transformers — (batch, seq, hidden_dim) of FP16 = at hidden_dim 8192 (Llama-3.3-70B), seq=1, batch=1, that’s 16 KB per layer boundary. Across 4 device-boundaries: ~64 KB total over the network per token.

At wifi-6 (~1 Gbps with 5–10 ms RTT under no contention):

  • Network: ~5 ms × 4 hops = ~20 ms per token of network overhead.
  • Compute: 80 layers / 4 devices × ~5 ms per layer (M2-class) = ~100 ms total, but parallelized → critical-path is one device’s share = ~40 ms.

Per-token: ~60 ms ≈ 17 tok/s in theory.

Reality: 3–5 tok/s in EXO’s published benchmarks. The gap is overhead — Python serialization, mDNS bookkeeping, scheduling jitter, the slowest phone hitting thermal throttle.

Still: a 70B running at 3–5 tok/s on $2K of consumer hardware fully offline is qualitatively a different product than “cloud-only”.

Network is the bottleneck — and the failure mode

Wifi performance under contention:

NetworkMedian RTTTail (p99)EXO tok/s on 70B
Wifi-6 (.11ax), no contention5 ms12 ms4–5
Wifi-6, 5 active devices8 ms30 ms3–4
Wifi-5 (.11ac), no contention10 ms25 ms3–4
Wifi-5, contended apartment15 ms60 ms1–2 (often stalls)
Wired ethernet0.5 ms1 ms5–6
Tailscale over WAN30–80 ms200 msunder 1

If you can wire one device, do it — the host coordinator (the device facing the user) wired via ethernet alone bumps tok/s ~30%.

The phone problem

The phone in the cluster is usually the bottleneck. Reasons:

  1. Smaller compute: phone = 1 TFLOPS-class, laptop = 5–15 TFLOPS-class.
  2. Aggressive thermal throttling: phones throttle hard after 60 s of compute.
  3. Background app suspension: iOS may suspend the EXO process if the user switches apps. Pin EXO with UIApplication.shared.beginBackgroundTask.

The pragmatic pattern: assign the phone the minimum viable layer set (embedding + a few decoder blocks). Don’t put hot layers (e.g., the layer with the heavy MoE expert) on the phone.

Petals — the public-swarm cousin

Petals takes EXO’s pattern public: a global swarm where anyone can host part of a model and anyone can use the swarm. Run by BigScience; primarily hosts Llama-3.1-405B and a few other very-large models.

Trade-offs vs EXO:

  • Petals can run 405B, EXO usually can’t (you need 50+ devices on your own LAN, which nobody has).
  • Petals is slower — hops cross the public internet (50–200 ms RTT vs 5–10 ms LAN).
  • Petals is less private — your activations transit through someone else’s GPU. Use only for non-sensitive prompts.
  • Petals is free — no compute cost.

Use Petals when the model is genuinely too large for your LAN and you don’t mind the slower / less-private trade-off. Use EXO for everything else.

Hybrid edge-to-cloud

EXO also supports a “cloud peer” mode: one peer is a rented GPU (Modal, Lambda) handling the heaviest layers, while the rest of the model runs on local devices. This recovers most of the privacy benefit (the activation tensors crossing the cloud are still derived from your prompt; not the prompt itself) while letting you run models the LAN alone can’t fit.

Pattern: cloud handles the embedding + first few layers + last few layers (the parts where activations carry the most semantic info); local devices handle the middle “bulk” layers (which are mostly mechanical transformations). Tune the split based on your privacy-vs-cost preferences.

Verifying the cluster is actually distributed (not silently centralized)

A common failure mode: one peer fails to load its assigned layers, and EXO falls back to running the entire model on the remaining peer with enough memory. The cluster is “running” but it’s actually a single-device fallback. Always verify with exo --trace:

$ exo --trace [peer macos-1] forward layer 0..28 took 12 ms [peer macos-1] sent activations to ios-1 (4096 bytes, 8 ms) [peer ios-1] forward layer 29..34 took 18 ms [peer ios-1] sent activations to pixel-1 (4096 bytes, 11 ms) ...

The trace should show all peers receiving activations on every token. If only one peer’s name appears, you’ve silently centralized.

Run it in your browser

A useful demo: run the partitioning math, see how device choices change tok/s. The lessons jump out fast.

Python — editablePartition a 70B across hypothetical devices and predict throughput. The phone vs no-phone difference is dramatic.
Ctrl+Enter to run

The math says: dropping the phone often helps despite removing 5 GB of memory budget — the phone is on the critical path. And bad wifi can erase the entire benefit of the cluster.

Quick check

Quick check
You set up an EXO cluster on a home wifi-6 network with a MacBook M2 (24 GB), an iPhone 15 (8 GB), and a Linux desktop (32 GB) — running Llama-3.3-70B Q4 (~40 GB). Cluster reports peer-discovery success but tok/s is 0.8 — barely usable. What's the most likely cause?

Key takeaways

  1. Swarm inference is pipeline-parallel inference applied to a device fleet — same algorithm as data-center training, applied across phones and laptops on a LAN.
  2. EXO is the open-source reference, supports iOS/Android/macOS/Linux, dynamic partitioning, mDNS discovery.
  3. Petals is the public-swarm variant — slower, less private, but runs models that don’t fit any single LAN.
  4. The network is the bottleneck — wifi-6 with no contention is the floor for usable performance; wifi-5 contended is unusable.
  5. The phone is usually the bottleneck device — assign minimum viable layers; pin to background-task entitlement.
  6. 3–5 tok/s on 70B across a 4-device LAN is the realistic performance envelope. Not snappy, but viable for non-real-time workloads.
  7. Hybrid edge-cloud (one rented GPU + your devices) recovers most privacy while running models the LAN alone can’t fit.

Go deeper

TL;DR

  • Swarm inference is pipeline-parallel inference applied at the device-fleet level: shard a model’s layers across N devices on a LAN, send activations between them, run a model none of them could run alone.
  • EXO (exo-explore/exo) is the open-source reference: Python framework, mDNS discovery, dynamic partitioning across iOS / Android / macOS / Linux. Uses tinygrad and mlx as backends. Released early 2024, ~12K GitHub stars.
  • Petals (bigscience/petals) is the BitTorrent-of-LLMs cousin — a public swarm where strangers share unused compute. Slower and less private than EXO but useful for genuinely large models (Llama-3.1-405B).
  • The math: a 4-device LAN running pipeline-parallel inference on a 70B model averages 3–5 tok/s — readable but not snappy. The bottleneck is per-hop network latency, not compute.
  • Wifi-6 or wired ethernet, ~12–30 ms per hop. Wifi-5 (.11ac) hits 50–100 ms tail latency under contention and tanks tok/s.

Why this matters

Single-device LLM inference is bounded by what fits in one device’s memory. A phone fits a 7B; a laptop fits a 30B. Nothing single fits a 70B at any quantization that keeps quality. Yet 4 devices on a wifi network combined have ~50 GB of usable memory — easily enough.

The 2026 reality: small startups (and engineers in countries with patchy or expensive cloud access) are running 70B and 405B inference across LANs of cheap phones and laptops. The OSS tools made it real. Knowing how the network determines the throughput is the load-bearing skill — the rest is software engineering.

This is also the answer to “what’s the future of distributed inference at the edge?” — sharded across the devices users already own, not on a cloud GPU. Privacy by default (the model weights are local; the activations cross your own LAN). Cost by default (hardware you bought; no per-token fees).

Mental model

The pipeline-parallel pattern at fleet scale:

  1. Partition by memory budget: each device’s available memory determines how many layers it gets. Memory-rich devices get more layers.
  2. One forward pass = one full pipeline traversal. Token N requires activation hand-off across all devices.
  3. The slowest hop dominates. If one device or one network link is slow, the whole pipeline is slow.

Concrete walkthrough

How EXO partitions a model

EXO inspects each peer at startup and gets:

  • Total memory.
  • Estimated TFLOPS (heuristic from chip identifier).
  • Network position (mDNS-derived hop estimate).

Then it solves a partition problem: assign layers L0,...,Ln1L_0, ..., L_{n-1} to devices D0,...,Dm1D_0, ..., D_{m-1} minimizing the longest device’s load. The straightforward greedy works well enough — sort devices by memory descending, assign layers proportionally.

A typical 4-device partition for Llama-3.3-70B Q4 (~40 GB):

DeviceMemoryLayersLayer count
MacBook M2 24 GB18 GB usable0–2829
Linux laptop 32 GB24 GB usable51–7929
Pixel 8 12 GB8 GB usable35–5016
iPhone 15 Pro 8 GB5 GB usable29–346

Total: 80 layers across 4 devices. The phone’s 6 layers consume ~3 GB; the bigger devices each take ~14 GB.

EXO supports re-partitioning when a device joins or leaves the cluster. Re-partition takes 10–60 s (the displaced layers stream from disk to the new device).

Per-token latency math

For pipeline-parallel inference, the per-token latency is the sum of:

  • Per-device compute on the assigned layers.
  • Per-hop network transfer of the activations.

The activation tensor between layers is small for transformers — (batch, seq, hidden_dim) of FP16 = at hidden_dim 8192 (Llama-3.3-70B), seq=1, batch=1, that’s 16 KB per layer boundary. Across 4 device-boundaries: ~64 KB total over the network per token.

At wifi-6 (~1 Gbps with 5–10 ms RTT under no contention):

  • Network: ~5 ms × 4 hops = ~20 ms per token of network overhead.
  • Compute: 80 layers / 4 devices × ~5 ms per layer (M2-class) = ~100 ms total, but parallelized → critical-path is one device’s share = ~40 ms.

Per-token: ~60 ms ≈ 17 tok/s in theory.

Reality: 3–5 tok/s in EXO’s published benchmarks. The gap is overhead — Python serialization, mDNS bookkeeping, scheduling jitter, the slowest phone hitting thermal throttle.

Still: a 70B running at 3–5 tok/s on $2K of consumer hardware fully offline is qualitatively a different product than “cloud-only”.

Network is the bottleneck — and the failure mode

Wifi performance under contention:

NetworkMedian RTTTail (p99)EXO tok/s on 70B
Wifi-6 (.11ax), no contention5 ms12 ms4–5
Wifi-6, 5 active devices8 ms30 ms3–4
Wifi-5 (.11ac), no contention10 ms25 ms3–4
Wifi-5, contended apartment15 ms60 ms1–2 (often stalls)
Wired ethernet0.5 ms1 ms5–6
Tailscale over WAN30–80 ms200 msunder 1

If you can wire one device, do it — the host coordinator (the device facing the user) wired via ethernet alone bumps tok/s ~30%.

The phone problem

The phone in the cluster is usually the bottleneck. Reasons:

  1. Smaller compute: phone = 1 TFLOPS-class, laptop = 5–15 TFLOPS-class.
  2. Aggressive thermal throttling: phones throttle hard after 60 s of compute.
  3. Background app suspension: iOS may suspend the EXO process if the user switches apps. Pin EXO with UIApplication.shared.beginBackgroundTask.

The pragmatic pattern: assign the phone the minimum viable layer set (embedding + a few decoder blocks). Don’t put hot layers (e.g., the layer with the heavy MoE expert) on the phone.

Petals — the public-swarm cousin

Petals takes EXO’s pattern public: a global swarm where anyone can host part of a model and anyone can use the swarm. Run by BigScience; primarily hosts Llama-3.1-405B and a few other very-large models.

Trade-offs vs EXO:

  • Petals can run 405B, EXO usually can’t (you need 50+ devices on your own LAN, which nobody has).
  • Petals is slower — hops cross the public internet (50–200 ms RTT vs 5–10 ms LAN).
  • Petals is less private — your activations transit through someone else’s GPU. Use only for non-sensitive prompts.
  • Petals is free — no compute cost.

Use Petals when the model is genuinely too large for your LAN and you don’t mind the slower / less-private trade-off. Use EXO for everything else.

Hybrid edge-to-cloud

EXO also supports a “cloud peer” mode: one peer is a rented GPU (Modal, Lambda) handling the heaviest layers, while the rest of the model runs on local devices. This recovers most of the privacy benefit (the activation tensors crossing the cloud are still derived from your prompt; not the prompt itself) while letting you run models the LAN alone can’t fit.

Pattern: cloud handles the embedding + first few layers + last few layers (the parts where activations carry the most semantic info); local devices handle the middle “bulk” layers (which are mostly mechanical transformations). Tune the split based on your privacy-vs-cost preferences.

Verifying the cluster is actually distributed (not silently centralized)

A common failure mode: one peer fails to load its assigned layers, and EXO falls back to running the entire model on the remaining peer with enough memory. The cluster is “running” but it’s actually a single-device fallback. Always verify with exo --trace:

$ exo --trace [peer macos-1] forward layer 0..28 took 12 ms [peer macos-1] sent activations to ios-1 (4096 bytes, 8 ms) [peer ios-1] forward layer 29..34 took 18 ms [peer ios-1] sent activations to pixel-1 (4096 bytes, 11 ms) ...

The trace should show all peers receiving activations on every token. If only one peer’s name appears, you’ve silently centralized.

Run it in your browser

A useful demo: run the partitioning math, see how device choices change tok/s. The lessons jump out fast.

Python — editablePartition a 70B across hypothetical devices and predict throughput. The phone vs no-phone difference is dramatic.
Ctrl+Enter to run

The math says: dropping the phone often helps despite removing 5 GB of memory budget — the phone is on the critical path. And bad wifi can erase the entire benefit of the cluster.

Quick check

Quick check
You set up an EXO cluster on a home wifi-6 network with a MacBook M2 (24 GB), an iPhone 15 (8 GB), and a Linux desktop (32 GB) — running Llama-3.3-70B Q4 (~40 GB). Cluster reports peer-discovery success but tok/s is 0.8 — barely usable. What's the most likely cause?

Key takeaways

  1. Swarm inference is pipeline-parallel inference applied to a device fleet — same algorithm as data-center training, applied across phones and laptops on a LAN.
  2. EXO is the open-source reference, supports iOS/Android/macOS/Linux, dynamic partitioning, mDNS discovery.
  3. Petals is the public-swarm variant — slower, less private, but runs models that don’t fit any single LAN.
  4. The network is the bottleneck — wifi-6 with no contention is the floor for usable performance; wifi-5 contended is unusable.
  5. The phone is usually the bottleneck device — assign minimum viable layers; pin to background-task entitlement.
  6. 3–5 tok/s on 70B across a 4-device LAN is the realistic performance envelope. Not snappy, but viable for non-real-time workloads.
  7. Hybrid edge-cloud (one rented GPU + your devices) recovers most privacy while running models the LAN alone can’t fit.

Go deeper