On-Device AI 2026: Developer Guide to NPUs and Edge Inference
Why On-Device AI Matters More Than Ever in 2026
Two years ago, "on-device AI" meant running a tiny sentiment classifier or waking up a voice assistant. In 2026, it means running 70-billion-parameter language models portably on a laptop, processing a 30B MoE model's long prompt in under three seconds, and shipping production inference to end users without a cloud API bill.
The shift happened fast. Apple M5 Pro and Max chips landed with 307–614 GB/s of memory bandwidth, making them genuinely competitive with cloud inference endpoints for certain workloads. Qualcomm's Snapdragon X Elite shipped in Copilot+ PCs with a 45 TOPS NPU and an established developer SDK. And Apple previewed Core AI — a full replacement for Core ML expected at WWDC 2026 in June — that unifies on-device and Apple-hosted models under a single Swift API for iOS 27 and macOS 27.
The developer tooling caught up too. Ollama crossed 90,000 GitHub stars and became the de facto local inference layer. MLX, Apple's framework built specifically for unified-memory Silicon, now outperforms llama.cpp by 20–87% on models under 14B parameters. ONNX Runtime with Qualcomm's QNN Execution Provider makes Windows NPU development straightforward.
This guide gives you a practical map: what hardware does what, which framework to use, and how to go from model download to production-grade on-device inference.
Understanding the NPU vs CPU vs GPU Split
Before picking tools, you need to understand what the compute triangle actually looks like for LLM inference. The three processors have fundamentally different strengths.
NPU (Neural Processing Unit) — NPUs are optimized for matrix-vector multiplication, which is exactly what happens during the decode phase of LLM inference: generating each new token requires multiplying a vector (the KV cache state) by massive weight matrices. An independent 2026 study found NPU reduces latency on matrix-vector multiplication by 58.54% compared to GPU. The catch: NPUs are memory-bandwidth bound, not compute bound, so raw TOPS numbers can be misleading.
GPU — GPUs excel at the prefill (prompt processing) phase, where you run matrix-matrix multiplication across the entire input sequence in parallel. For a long-context prompt being processed in one shot, the GPU wins by 22.6% lower latency and 2x higher throughput versus NPU. GPU also handles batching multiple requests efficiently — critical for server-side deployments.
CPU — Slowest for large models but universal and power-efficient for small models (3B parameters and under). llama.cpp's CPU path remains relevant for embedded devices, microcontrollers, and scenarios where GPU/NPU aren't available.
The practical implication: for interactive inference of 7B+ models today, GPU via Metal (Apple) or CUDA (NVIDIA) is still the recommended execution path. NPU shines for smaller, app-embedded models where you want low power draw and deterministic latency — not for the 70B Llama Maverick you're running interactively in a terminal.
The more interesting insight is that for Apple Silicon, the distinction between "GPU" and "NPU" is collapsing. M5 introduced Neural Accelerators embedded in every GPU core, so the GPU and Neural Engine now cooperate on inference tasks rather than competing. That's part of why M5 shows 4x the LLM performance of M4 — not TOPS, but architecture.
The NPU Landscape: Comparing Apple, Qualcomm, and Intel
Every major chip now ships with an NPU, but the numbers vary enormously.
| Chip | NPU TOPS | Memory Bandwidth | Best For |
|---|---|---|---|
| Apple M5 Max (40-core GPU) | ~38 TOPS NE + Neural Accel. | 614 GB/s | Large models (70B+) |
| Apple M5 Pro (20-core GPU) | ~38 TOPS NE + Neural Accel. | 307 GB/s | Mid-range (14B–30B) |
| Apple M5 (base) | 16-core Neural Engine | 153.6 GB/s | Small models (≤7B) |
| Qualcomm Snapdragon X Elite | 45 TOPS | ~135 GB/s | Windows app inference |
| Intel Lunar Lake (Core Ultra 200V) | ~48 TOPS NPU 4 | ~120 GB/s | Copilot+ thin-and-light |
| Intel Arrow Lake (Core Ultra 200S) | 13 TOPS | varies | Desktop, limited AI tasks |
| AMD Ryzen AI 300 (Strix Point) | 50 TOPS | ~100 GB/s | Copilot+ laptops |
The winner column is a bit misleading — raw TOPS don't determine real-world LLM speed. Memory bandwidth does. The Apple M5 Max's 614 GB/s is the number that lets it run 70B models comfortably. The Qualcomm X Elite's 135 GB/s puts it in the 7B–13B sweet spot for interactive inference.
A useful rule: at Q4_K_M quantization (the most practical precision for local inference), budget 0.6–0.7 GB per billion parameters. A 7B model needs ~4.5 GB, a 14B needs ~9 GB, a 70B needs ~42 GB. Cross-reference that against your chip's RAM ceiling and you know what's feasible.
Apple M5 and the MLX Framework
If you're developing on Apple Silicon in 2026, MLX is the first tool to reach for — not llama.cpp or Ollama's default backend. Apple built MLX from scratch for unified memory architecture: the CPU, GPU, and Neural Engine all share the same physical memory pool, so a tensor never has to be copied between memory banks.
The performance difference is meaningful. In benchmarks across M5 series chips, MLX-LM delivers 20–87% higher token throughput than llama.cpp for models under 14B parameters. For larger models, the gap narrows (Ollama with Metal is competitive), but MLX remains the fastest option for anything you can fit in VRAM.
M5 Pro and Max make 70B portable. The M5 Max with 128 GB unified memory can load a 70B model at Q4 quantization (roughly 42 GB) with room to spare. In practice, a dense 14B model loads its full context in under 10 seconds, and a 30B MoE model processes a long prompt in under 3 seconds. These are interactive, not batch, numbers.
Setting Up MLX on Apple Silicon
# Install mlx-lm
pip install mlx-lm
# Run inference on a downloaded model
mlx_lm.generate \
--model mlx-community/Llama-3.2-3B-Instruct-4bit \
--prompt "Explain NPU vs GPU for LLM inference in 3 sentences."
# Convert a Hugging Face model to MLX format
mlx_lm.convert \
--hf-path meta-llama/Llama-3.2-3B-Instruct \
--mlx-path ./llama-3.2-3b-mlx \
--quantize \
--q-bits 4
The mlx-community organization on Hugging Face maintains pre-converted 4-bit MLX versions of most popular models, so you usually don't need to convert manually.
For higher-level access with a REST API, Ollama with the default Metal backend is the pragmatic choice: one command install, OpenAI-compatible /v1/chat/completions endpoint, automatic model management.
# Install and run a model in two commands
brew install ollama
ollama run llama3.2:3b
Apple Core AI and iOS 27: What's Coming in June
Apple confirmed it is replacing Core ML with Core AI, expected at WWDC 2026 (June 8–12). This is a significant architectural change for iOS and macOS developers.
Core ML was powerful but narrow: you compiled a model into an .mlpackage, called it synchronously, and managed the model lifecycle yourself. Core AI extends this with three capabilities that matter for developers:
1. Unified on-device and hosted inference. Core AI gives you a single Swift API to call models running locally on the Neural Engine or remotely on Apple's Private Cloud Compute. You don't change your calling code — the framework picks the best execution path based on model size, privacy settings, and network availability.
2. Foundation Models access. Apple's own Foundation Models (the same ones powering the upgraded Siri in iOS 27) are available to third-party developers through Core AI. These are production-quality models Apple maintains and updates — no download required, no model management overhead.
3. Third-party model integration. Early reports suggest Core AI will support third-party models, with MCP (Model Context Protocol) being a possibility for tool-calling integration. This would let developers ship apps that use on-device Llama, Gemma, or Phi models with the same API as Apple's own Foundation Models.
For developers targeting iOS 27 and macOS 27, Core AI is the forward path. Core ML will be deprecated gradually. The practical implication now: don't invest heavily in Core ML pipeline tooling if you're building new apps — wait for WWDC to see the Core AI API surface before committing.
Qualcomm X Elite: The Windows NPU Path
On Windows, Qualcomm's Snapdragon X Elite is the primary NPU target for Copilot+ PCs. The developer path is more fragmented than Apple's, but workable.
The recommended stack: ONNX Runtime + QNN Execution Provider. You export your model to ONNX format, then ONNX Runtime routes it through the QNN SDK to the Hexagon NPU. Qualcomm AI Hub provides pre-validated models that have already been optimized for Snapdragon X Elite, so you don't need to tune every model from scratch.
# Install ONNX Runtime with QNN backend (Windows)
pip install onnxruntime-qnn
# Run inference with QNN provider
import onnxruntime as ort
session = ort.InferenceSession(
"model.onnx",
providers=["QNNExecutionProvider"],
provider_options=[{"backend_path": "QnnHtp.dll"}]
)
Supported precisions are FP16, INT16, and INT8. For Llama 3.2 3B, Qualcomm's reference implementation hits approximately 100ms time-to-first-token on a 128-token prompt — fast enough for responsive chatbot UX.
For developers building cross-platform Windows + macOS apps with a shared inference layer, ONNX Runtime is the practical common ground: use the QNN provider on Windows and the CoreML execution provider on macOS. The model conversion is the same; only the provider string changes.
Practical Deployment: Choosing Your Stack
The right stack depends on where your model runs and who uses it.
App-embedded, mobile/laptop: CoreML (today) or Core AI (iOS 27+) on Apple. ONNX Runtime + QNN on Qualcomm Windows. These are for smaller models (1B–7B range) embedded in shipped apps where the user never manages models.
Developer machine, interactive: Ollama. One command, OpenAI-compatible API, automatic backend selection (Metal on Mac, CUDA on Linux, CPU fallback). Use MLX-LM directly when you want maximum throughput on Apple Silicon.
On-premises server: vLLM. PagedAttention, continuous batching, speculative decoding — it handles multi-user load that Ollama wasn't designed for. vLLM's approach to NPU is still GPU-first, so this path is for machines with discrete GPUs or Apple Silicon servers (Mac Studio / Mac Pro).
# Typical Ollama setup for development
curl -fsSL https://ollama.com/install.sh | sh
ollama pull llama3.2:3b-instruct-q4_K_M
ollama serve # starts OpenAI-compatible endpoint at localhost:11434
Quantization: The Right Precision for the Right Hardware
Q4_K_M is the practical default for 2026. It cuts memory to roughly 0.65 GB per billion parameters with minimal quality loss — independent research puts the performance degradation at under 2% on most coding and reasoning benchmarks for models above 7B parameters.
| Quantization | GB per B params | Quality vs FP16 | Best for |
|---|---|---|---|
| FP16 | ~2.0 GB | Baseline | GPU with large VRAM |
| Q8_0 | ~1.0 GB | ~99% | Quality-first edge deployments |
| Q4_K_M | ~0.65 GB | ~97-98% | General on-device use |
| Q3_K_M | ~0.48 GB | ~93-95% | Memory-constrained devices |
| Q2_K | ~0.32 GB | ~85-90% | Experimental only |
For Apple Silicon, MLX handles quantization automatically when you use --quantize during conversion. For GGUF-based tools like llama.cpp and Ollama, the quantization level is baked into the model file name (e.g., llama3.2:3b-instruct-q4_K_M).
Common Mistakes When Going On-Device
Using TOPS as the primary benchmark. A 50 TOPS AMD NPU sounds better than Apple's 16-core Neural Engine on paper. In practice, Apple M5 Max runs 70B models and that AMD chip doesn't, because memory bandwidth is the bottleneck for LLM decode, not TOPS. Always check memory bandwidth and RAM ceiling alongside TOPS.
Ignoring the model format. CoreML takes .mlpackage. ONNX Runtime takes .onnx. Ollama takes GGUF. MLX takes its own format. These are not interchangeable. Plan your conversion pipeline before you're two weeks into model fine-tuning.
Running inference on battery without power profiling. Continuous inference on NPU/GPU at full load drains a MacBook Pro M5's battery in under 2 hours. For deployed apps, implement inference throttling and allow the framework's energy efficiency modes. Both CoreML and MLX expose quality/performance tradeoffs; use them.
Assuming all models quantize well. Mixture-of-experts models (Qwen3.5, Llama 4 Maverick) have sparse activations that interact differently with quantization than dense models. Test your target model explicitly at your target quantization — don't assume the Q4_K_M results for Llama 3.2 transfer directly to a MoE model.
Not planning for Core AI migration. If you're building iOS or macOS apps using Core ML today, start following Core AI developments closely. Apple will announce it at WWDC in June 2026. Apps using low-level Core ML calls will need migration. Starting that work before WWDC's API is finalized is risky; starting it after WWDC is smart.
FAQ
Q: Can I run a 70B model on a standard MacBook Pro M5?
Only the M5 Max with 128 GB unified memory can handle 70B at Q4_K_M quantization (~42 GB). The base M5 (up to 32 GB) is comfortable through 14B models at Q4. The M5 Pro (64 GB) handles 30B models well and can technically run 70B in a pinch if no other applications are competing for RAM. For standard M5 MacBook Pros (16-32 GB configurations), 14B is the practical ceiling.
Q: Is TOPS a useful number when comparing NPUs?
Partially. TOPS (Tera Operations Per Second) tells you compute throughput, which matters for the prompt-processing (prefill) phase of inference. For the decode phase — generating each token — memory bandwidth is more important. For LLM workloads specifically, look at both TOPS and memory bandwidth, and compare them to the parameter count of your target model. An NPU with 48 TOPS but 32 GB/s bandwidth is less useful for a 7B model decode than one with 30 TOPS and 135 GB/s bandwidth.
Q: Should I wait for Core AI or build with Core ML now?
If you're targeting release before WWDC 2026 (June 8), finish your Core ML integration — it's stable and well-documented. If your target release is Q3 2026 or later, plan around Core AI. Build your inference abstraction layer now so the Core AI migration is a provider swap, not an architectural rewrite. Apple typically offers migration guides and deprecation timelines; Core ML will remain functional for at least two major OS versions after Core AI ships.
Q: How do I choose between Ollama, MLX-LM, and llama.cpp on a Mac?
- Ollama: Best for serving a REST API locally, sharing models across multiple projects, or integrating with tools that expect an OpenAI-compatible endpoint. Default for most developer workflows.
- MLX-LM: Best for raw throughput when you're running inference inside Python scripts directly, fine-tuning, or benchmarking. 20–87% faster than llama.cpp for models under 14B.
- llama.cpp: Best for cross-platform code that also needs to run on Linux/Windows CPU, or when you need GGUF's flexibility (custom architectures, very small models, LoRA adapters at runtime).
Q: What quantization level should I use for a production iOS app?
For an app-embedded model via CoreML or Core AI: Q8_0 if you have the memory budget (larger models with minimal quality loss), Q4_K_M otherwise. INT4 quantization via CoreML's MLComputeUnits.all will route through the Neural Engine automatically on A-series and M-series chips. Avoid anything below Q3 in production — the quality degradation at that level is noticeable to end users in open-ended generation tasks.
Key Takeaways
The hardware is finally there. Apple M5 Max and Qualcomm X Elite aren't edge devices compromising on quality — they're running production-scale models at interactive speeds. The developer tooling in 2026 (MLX, Ollama, ONNX Runtime + QNN) is mature enough to ship on.
Memory bandwidth beats TOPS. When evaluating NPU hardware for LLM inference, look at GB/s first, TOPS second. The Apple M5 Max's 614 GB/s is the reason it handles 70B models, not the number of Neural Engine cores.
Core AI is the most important on-device development event of 2026. Apple replacing Core ML with a framework that unifies on-device and hosted inference, includes access to Foundation Models, and possibly supports MCP — that changes what's possible in iOS apps. Follow WWDC closely.
Build your inference abstraction now. Whether you're targeting Apple or Qualcomm, the right move is an inference interface layer in your app that can swap backends. The frameworks are evolving fast enough that hard-coding calls to CoreML, QNN, or MLX today will require refactoring within 12 months.
Quantization is a first-class decision. Q4_K_M is your default. Know the memory requirements before you pick a model — there's nothing worse than shipping a feature that runs on your M5 Max test machine but OOMs on a user's 16 GB device.
On-device AI in 2026 is a real production option, not a demo. Apple M5 with MLX sets the benchmark; Qualcomm X Elite with ONNX Runtime is the Windows path; and Apple's Core AI framework at WWDC will define the next two years of iOS/macOS inference. Use memory bandwidth to evaluate hardware, Q4_K_M as your quantization default, and Ollama as your development inference layer — then swap in the platform-native SDK when you're ready to ship.
Prefer a deep-dive walkthrough? Watch the full video on YouTube.
Need content like this
for your blog?
We run AI-powered technical blogs. Start with a free 3-article pilot.