TFLite & LiteRT
In a cloud Python world, “ship to Android” sounds straightforward — bundle a model, call it from Java. The reality on the device is messier: there are at least four chips that could run your matmul (CPU, GPU, NPU, DSP), each from a different vendor, each with a different op-coverage table. There is no Python on the user’s phone. There is no pip to install a runtime at first launch. The runtime that runs your model has to be ~1 MB, has to know about every op you used, and has to gracefully fall back to the CPU when one of the chips says “sorry, I don’t do LayerNorm.”
That runtime is TFLite (renamed LiteRT in 2024). It’s the reason live captions, Pixel Camera Night Sight, on-device translation, and the non-cloud half of Galaxy AI ship at billion-device scale. The Python is the SDK; what runs on the device is a 1 MB C++ runtime that walks a flatbuffer and dispatches op by op.
The interesting design choice is the delegate model: pluggable backends (XNNPACK, GPU, NNAPI, Hexagon) that the runtime tries in order, with the CPU as a safety net. If the new NPU doesn’t support an op, the model doesn’t crash — it falls back to the next delegate, then to plain CPU. That’s why TFLite ships across billions of devices that the Android team can’t possibly QA exhaustively.
TL;DR
- TFLite is the runtime that ships in every Android device with Google Play Services — billions of installs. As of 2024 it’s been renamed to LiteRT (“Lite Runtime”) and the Python/Android packages migrated to
litertnamespaces. Same code, new branding; existing TFLite code still works. - The runtime is tiny (~1 MB binary), reads
.tfliteflatbuffer files, and exposes a uniform Tensor I/O API across C++, Java/Kotlin, Swift, Python, and JS. - Performance comes from delegates — pluggable backends that move ops off the CPU. The four that matter: XNNPACK (CPU SIMD), GPU delegate (OpenCL/OpenGL/Metal), NNAPI (Android’s NPU/DSP routing layer), Hexagon delegate (Qualcomm direct).
- Quantization is the same INT8 / INT4 story as elsewhere; the TFLite-specific bit is representative-dataset post-training quantization (PTQ-PCM) and the on-device-friendly selective op fallback when a delegate doesn’t support an op.
- Use TFLite/LiteRT when: you’re shipping to Android, you want NNAPI’s automatic NPU routing, your model is a CNN/Transformer that fits ONNX-style ops cleanly. Use llama.cpp instead for chat-style LLMs (the GGUF + KV-cache stack is purpose-built for that).
Why this matters
ExecuTorch and Core ML get the LLM-on-phone press. TFLite/LiteRT ships the actual production AI in your phone right now. Live captions, on-device translation, smart compose, the Pixel Camera’s Night Sight neural processing, the Galaxy AI features that don’t talk to a cloud — all running through TFLite delegates routing to the device’s NPU.
Three reasons to know it cold:
- Android coverage: NNAPI can route to every Android NPU automatically (Hexagon, Mali, MediaTek APU, Samsung NPU) without per-vendor code. Nothing else does this.
- The non-LLM 90%: most production mobile ML is still CNN-shaped — ASR, OCR, image classification, on-device translation, super-resolution. TFLite is the right runtime for that whole class of models.
- Maturity: ten years of production hardening. Predictable failure modes, well-known op-coverage tables, the Android ML team actively ships fixes.
The 2024 LiteRT rebrand signals Google’s intent to keep investing — the new name comes with first-class JAX support and a cleaner litert.aot ahead-of-time compile path.
Mental model
The flatbuffer file is the artifact — frozen graph + quantization metadata. The runtime walks it op by op, asks each delegate “can you handle this?”, and falls back to the next delegate (down to plain CPU) for any op no delegate accepts. That selective fallback is why TFLite ships everywhere: an op the new NPU doesn’t support runs on CPU instead of crashing the model.
Converting a model to .tflite
The conversion is offline Python — author surface, runs on your dev machine.
import tensorflow as tf
# Convert a Keras / TF SavedModel
converter = tf.lite.TFLiteConverter.from_saved_model("my_model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT] # int8 PTQ
converter.representative_dataset = lambda: gen() # 100–500 calibration samples
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8, # require int8 everywhere
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_bytes = converter.convert()
open("my_model.tflite", "wb").write(tflite_bytes)The representative_dataset is the killer feature. needs to see real input distributions to pick activation ranges; pass 100–500 samples drawn from your real data and the converter calibrates per-tensor scales. Skip it and you get default ranges that often clip badly.
Loading on Android (Kotlin)
The Kotlin API is the user surface — what your Android app actually calls. Underneath it’s the C++ TFLite runtime; Kotlin is the SDK shell.
// build.gradle: implementation("com.google.ai.edge.litert:litert:1.0.1")
val options = Interpreter.Options().apply {
addDelegate(GpuDelegate()) // try GPU first
addDelegate(NnApiDelegate()) // then NPU via NNAPI
setNumThreads(4) // CPU fallback
}
val interpreter = Interpreter(loadModelFile("my_model.tflite"), options)
val inputBuffer = ByteBuffer.allocateDirect(...)
val outputBuffer = ByteBuffer.allocateDirect(...)
interpreter.run(inputBuffer, outputBuffer)The delegate ordering matters — the runtime asks each delegate in order. GPU delegate first means GPU-friendly ops (matmul, conv) run on GPU; NNAPI catches the rest if the device has a compatible NPU; CPU is the safety net.
What XNNPACK actually does
XNNPACK is the default CPU delegate for TFLite — and it’s the reason “the CPU path” isn’t slow. It ships hand-written ARM NEON / x86 AVX kernels for every op and chooses at startup based on cpuid. On a Pixel 8 (Cortex-A715), XNNPACK is roughly 2–3× faster than naive C++ for the same op.
Critically, XNNPACK supports dynamic shape ops that NNAPI / GPU delegate often refuse. So XNNPACK is what catches the long tail: the embedding lookup, the layer norm with non-power-of-2 hidden dim, the custom activation. XNNPACK is what makes the rest of the system shippable.
NNAPI: the routing layer
NNAPI is Android’s neural-network HAL. The vendor writes a NNAPI driver that exposes their NPU’s capabilities; the runtime (TFLite via the NnApiDelegate) decides at load time which subgraph each NPU can run.
This is huge: you don’t need vendor-specific code paths in your app. The same .tflite file routes to:
- Samsung NPU on Galaxy S24
- MediaTek APU on Dimensity SoCs
- Hexagon DSP on Snapdragon
- Mali GPU on older Pixels (no NPU)
- Just CPU on devices with broken NNAPI drivers
Not free though: NNAPI op coverage varies by vendor. Some OEMs ship buggy NNAPI drivers (Samsung’s Galaxy S20 era was infamous). The pragmatic pattern: enable NNAPI behind a runtime switch. If NNAPI fails (op not supported, driver bug), fall back to GPU then CPU.
Hexagon delegate: bypass NNAPI for Snapdragon
For Snapdragon-only deployment (a lot of robotics / embedded use cases), the Hexagon delegate is faster than NNAPI because it speaks directly to QNN/cDSP without the HAL middleware. Use this when you’re shipping a fixed-hardware device and you know it’s Snapdragon — for general Android apps stick with NNAPI.
TFLite vs. ExecuTorch — when to pick which
| If… | Use |
|---|---|
| You’re starting from a TF/Keras/JAX model | TFLite (no conversion friction) |
| You’re starting from a PyTorch model + need maximum mobile perf | ExecuTorch (shorter PyTorch → mobile path) |
| You want NNAPI to route to every Android NPU automatically | TFLite (NNAPI is its native delegate; ExecuTorch’s vendor backends are explicit) |
| Your model is a chat-style LLM | Neither — use llama.cpp or MLC |
| Your model is a CNN/CV/ASR pipeline | TFLite (mature op coverage; XNNPACK is excellent) |
| You’re shipping to robots/drones with fixed Snapdragon | TFLite + Hexagon delegate (or QNN direct) |
The LiteRT-specific 2024 changes
- Package rename:
tflite_runtime→ai-edge-litert(Python),org.tensorflow:tensorflow-lite→com.google.ai.edge.litert(Android). - JAX support: convert directly from JAX without going through TF first.
litert.aot: ahead-of-time compile a.tfliteto a target’s native binary, skipping the runtime dispatcher overhead. Useful for very-low-latency embedded targets.- API is otherwise identical — old TFLite code runs unchanged.
Run it in your browser
A useful demo: simulate the delegate-fallback decision tree TFLite walks for each op. This is the heart of why TFLite “just works” across so many devices.
The selective-fallback model is what makes TFLite actually ship in production. ExecuTorch has the same pattern but you have to wire each backend explicitly; TFLite gets it for free.
Quick check
Key takeaways
- TFLite (now LiteRT) is Android’s standard ML runtime — billions of devices, the actual production path for non-LLM mobile AI.
- Delegates are the perf story: XNNPACK (CPU), GPU delegate, NNAPI (auto-route to any Android NPU), Hexagon delegate (Snapdragon direct).
- NNAPI is the secret weapon — one binary routes to every vendor’s NPU. Vendor-quality varies; gate behind try/fallback.
- Selective op fallback means a graph never fails because of one unsupported op — the runtime degrades gracefully to CPU.
- XNNPACK is the reason the CPU path is fast — hand-tuned ARM/x86 SIMD kernels, 2–3× over naive C++.
- For LLMs use llama.cpp or MLC. TFLite is the right call for CNN/CV/ASR models.
Go deeper
- DocsLiteRT DocumentationThe current canonical docs (post-2024 rebrand). Start here.
- DocsLiteRT Delegates GuideXNNPACK / GPU / NNAPI / Hexagon side-by-side.
- DocsAndroid NNAPI GuideThe HAL contract every Android NPU driver implements. Worth skimming so you understand what TFLite is calling.
- PaperTensorFlow Lite Micro: Embedded ML on Constrained HardwareThe TFLite-Micro paper — the embedded sibling. Shows the runtime's minimum-footprint path for $5 microcontrollers.
- Repogoogle/XNNPACKThe CPU-SIMD kernel library. Read `src/qs8-gemm/` for hand-written INT8 kernel families.
- BlogIntroducing LiteRTThe 2024 rebrand announcement; what changed and why.
- VideoTFLite Internals: From Flatbuffer to TokenWalks the runtime's dispatch loop op by op — useful if you're debugging a delegate fallback.
TL;DR
- TFLite is the runtime that ships in every Android device with Google Play Services — billions of installs. As of 2024 it’s been renamed to LiteRT (“Lite Runtime”) and the Python/Android packages migrated to
litertnamespaces. Same code, new branding; existing TFLite code still works. - The runtime is tiny (~1 MB binary), reads
.tfliteflatbuffer files, and exposes a uniform Tensor I/O API across C++, Java/Kotlin, Swift, Python, and JS. - Performance comes from delegates — pluggable backends that move ops off the CPU. The four that matter: XNNPACK (CPU SIMD), GPU delegate (OpenCL/OpenGL/Metal), NNAPI (Android’s NPU/DSP routing layer), Hexagon delegate (Qualcomm direct).
- Quantization is the same INT8 / INT4 story as elsewhere; the TFLite-specific bit is representative-dataset post-training quantization (PTQ-PCM) and the on-device-friendly selective op fallback when a delegate doesn’t support an op.
- Use TFLite/LiteRT when: you’re shipping to Android, you want NNAPI’s automatic NPU routing, your model is a CNN/Transformer that fits ONNX-style ops cleanly. Use llama.cpp instead for chat-style LLMs (the GGUF + KV-cache stack is purpose-built for that).
Why this matters
ExecuTorch and Core ML get the LLM-on-phone press. TFLite/LiteRT ships the actual production AI in your phone right now. Live captions, on-device translation, smart compose, the Pixel Camera’s Night Sight neural processing, the Galaxy AI features that don’t talk to a cloud — all running through TFLite delegates routing to the device’s NPU.
Three reasons to know it cold:
- Android coverage: NNAPI can route to every Android NPU automatically (Hexagon, Mali, MediaTek APU, Samsung NPU) without per-vendor code. Nothing else does this.
- The non-LLM 90%: most production mobile ML is still CNN-shaped — ASR, OCR, image classification, on-device translation, super-resolution. TFLite is the right runtime for that whole class of models.
- Maturity: ten years of production hardening. Predictable failure modes, well-known op-coverage tables, the Android ML team actively ships fixes.
The 2024 LiteRT rebrand signals Google’s intent to keep investing — the new name comes with first-class JAX support and a cleaner litert.aot ahead-of-time compile path.
Mental model
The flatbuffer file is the artifact — frozen graph + quantization metadata. The runtime walks it op by op, asks each delegate “can you handle this?”, and falls back to the next delegate (down to plain CPU) for any op no delegate accepts. That selective fallback is why TFLite ships everywhere: an op the new NPU doesn’t support runs on CPU instead of crashing the model.
Concrete walkthrough
Converting a model to .tflite
import tensorflow as tf
# Convert a Keras / TF SavedModel
converter = tf.lite.TFLiteConverter.from_saved_model("my_model/")
converter.optimizations = [tf.lite.Optimize.DEFAULT] # int8 PTQ
converter.representative_dataset = lambda: gen() # 100–500 calibration samples
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS_INT8, # require int8 everywhere
]
converter.inference_input_type = tf.int8
converter.inference_output_type = tf.int8
tflite_bytes = converter.convert()
open("my_model.tflite", "wb").write(tflite_bytes)The representative_dataset is the killer feature. PTQ needs to see real input distributions to pick activation ranges; pass 100–500 samples drawn from your real data and the converter calibrates per-tensor scales. Skip it and you get default ranges that often clip badly.
Loading on Android (Kotlin)
// build.gradle: implementation("com.google.ai.edge.litert:litert:1.0.1")
val options = Interpreter.Options().apply {
addDelegate(GpuDelegate()) // try GPU first
addDelegate(NnApiDelegate()) // then NPU via NNAPI
setNumThreads(4) // CPU fallback
}
val interpreter = Interpreter(loadModelFile("my_model.tflite"), options)
val inputBuffer = ByteBuffer.allocateDirect(...)
val outputBuffer = ByteBuffer.allocateDirect(...)
interpreter.run(inputBuffer, outputBuffer)The delegate ordering matters — the runtime asks each delegate in order. GPU delegate first means GPU-friendly ops (matmul, conv) run on GPU; NNAPI catches the rest if the device has a compatible NPU; CPU is the safety net.
What XNNPACK actually does
XNNPACK is the default CPU delegate for TFLite — and it’s the reason “the CPU path” isn’t slow. It ships hand-written ARM NEON / x86 AVX kernels for every op and chooses at startup based on cpuid. On a Pixel 8 (Cortex-A715), XNNPACK is roughly 2–3× faster than naive C++ for the same op.
Critically, XNNPACK supports dynamic shape ops that NNAPI / GPU delegate often refuse. So XNNPACK is what catches the long tail: the embedding lookup, the layer norm with non-power-of-2 hidden dim, the custom activation. XNNPACK is what makes the rest of the system shippable.
NNAPI: the routing layer
NNAPI is Android’s neural-network HAL. The vendor writes a NNAPI driver that exposes their NPU’s capabilities; the runtime (TFLite via the NnApiDelegate) decides at load time which subgraph each NPU can run.
This is huge: you don’t need vendor-specific code paths in your app. The same .tflite file routes to:
- Samsung NPU on Galaxy S24
- MediaTek APU on Dimensity SoCs
- Hexagon DSP on Snapdragon
- Mali GPU on older Pixels (no NPU)
- Just CPU on devices with broken NNAPI drivers
Not free though: NNAPI op coverage varies by vendor. Some OEMs ship buggy NNAPI drivers (Samsung’s Galaxy S20 era was infamous). The pragmatic pattern: enable NNAPI behind a runtime switch. If NNAPI fails (op not supported, driver bug), fall back to GPU then CPU.
Hexagon delegate: bypass NNAPI for Snapdragon
For Snapdragon-only deployment (a lot of robotics / embedded use cases), the Hexagon delegate is faster than NNAPI because it speaks directly to QNN/cDSP without the HAL middleware. Use this when you’re shipping a fixed-hardware device and you know it’s Snapdragon — for general Android apps stick with NNAPI.
TFLite vs. ExecuTorch — when to pick which
| If… | Use |
|---|---|
| You’re starting from a TF/Keras/JAX model | TFLite (no conversion friction) |
| You’re starting from a PyTorch model + need maximum mobile perf | ExecuTorch (shorter PyTorch → mobile path) |
| You want NNAPI to route to every Android NPU automatically | TFLite (NNAPI is its native delegate; ExecuTorch’s vendor backends are explicit) |
| Your model is a chat-style LLM | Neither — use llama.cpp or MLC |
| Your model is a CNN/CV/ASR pipeline | TFLite (mature op coverage; XNNPACK is excellent) |
| You’re shipping to robots/drones with fixed Snapdragon | TFLite + Hexagon delegate (or QNN direct) |
The LiteRT-specific 2024 changes
- Package rename:
tflite_runtime→ai-edge-litert(Python),org.tensorflow:tensorflow-lite→com.google.ai.edge.litert(Android). - JAX support: convert directly from JAX without going through TF first.
litert.aot: ahead-of-time compile a.tfliteto a target’s native binary, skipping the runtime dispatcher overhead. Useful for very-low-latency embedded targets.- API is otherwise identical — old TFLite code runs unchanged.
Run it in your browser
A useful demo: simulate the delegate-fallback decision tree TFLite walks for each op. This is the heart of why TFLite “just works” across so many devices.
The selective-fallback model is what makes TFLite actually ship in production. ExecuTorch has the same pattern but you have to wire each backend explicitly; TFLite gets it for free.
Quick check
Key takeaways
- TFLite (now LiteRT) is Android’s standard ML runtime — billions of devices, the actual production path for non-LLM mobile AI.
- Delegates are the perf story: XNNPACK (CPU), GPU delegate, NNAPI (auto-route to any Android NPU), Hexagon delegate (Snapdragon direct).
- NNAPI is the secret weapon — one binary routes to every vendor’s NPU. Vendor-quality varies; gate behind try/fallback.
- Selective op fallback means a graph never fails because of one unsupported op — the runtime degrades gracefully to CPU.
- XNNPACK is the reason the CPU path is fast — hand-tuned ARM/x86 SIMD kernels, 2–3× over naive C++.
- For LLMs use llama.cpp or MLC. TFLite is the right call for CNN/CV/ASR models.
Go deeper
- DocsLiteRT DocumentationThe current canonical docs (post-2024 rebrand). Start here.
- DocsLiteRT Delegates GuideXNNPACK / GPU / NNAPI / Hexagon side-by-side.
- DocsAndroid NNAPI GuideThe HAL contract every Android NPU driver implements. Worth skimming so you understand what TFLite is calling.
- PaperTensorFlow Lite Micro: Embedded ML on Constrained HardwareThe TFLite-Micro paper — the embedded sibling. Shows the runtime's minimum-footprint path for $5 microcontrollers.
- Repogoogle/XNNPACKThe CPU-SIMD kernel library. Read `src/qs8-gemm/` for hand-written INT8 kernel families.
- BlogIntroducing LiteRTThe 2024 rebrand announcement; what changed and why.
- VideoTFLite Internals: From Flatbuffer to TokenWalks the runtime's dispatch loop op by op — useful if you're debugging a delegate fallback.