Disclosure: StrideNote Studio is a research and media production studio that evaluates local AI tools as part of its workflow.
Run the same 4 GB quantized 7B model through three different inference engines on the same Mac, and the text generation speed changes by a measurable margin. On an M2 Ultra with 192 GB of unified memory, MLX sustained about 230 tokens per second on a Qwen-2.5 7B model with 4-bit quantization, while llama.cpp delivered about 150 tok/s under the same conditions, according to a systematic comparison by Shrivastava et al. (2025) published on arXiv. That is a real gap on a single metric. But the full picture is more complex because each engine targets a different bottleneck, and the best choice depends on your Apple Silicon generation and your workload.
Why does MLX have an M5 Neural Accelerator advantage?
MLX is Apple’s first-party machine learning framework, built from the ground up for the Metal graphics architecture. It launched in late 2023 and has since become the default inference path for users inside Apple’s ecosystem who want direct GPU access without third-party abstraction layers, as we covered in our Ollama vs MLX vs Jan comparison.
The M5 generation, released in late 2025, added dedicated Neural Accelerators. These are hardware units designed for the matrix-multiplication operations that dominate transformer inference. These units are accessible through MLX via the Metal 4 Tensor API. In benchmarks published by Apple Machine Learning Research (2025), a Qwen-2.5 7B model running on an M5 MacBook Pro showed up to 4x faster time-to-first-token compared to the same model on an M4 machine. Generation throughput improved 19% to 27%. The improvement tracks the M5’s memory bandwidth increase from 120 GB/s to 153 GB/s.
The Shrivastava et al. comparison, which ran on an M2 Ultra, found MLX sustained the highest generation throughput across the models tested. The study is notable because it is the first peer-reviewed head-to-head of these engines and includes full reproduction scripts. MLX reached approximately 230 tok/s on the Qwen-2.5 7B 4-bit test, while llama.cpp reached approximately 150 tok/s under identical settings.
The MLX advantage is clearest on Apple’s own hardware, especially M5 and later. The framework gets first access to new Metal features, and the model conversion tools are maintained by the same team that builds the inference engine. That vertical integration matters when a new chip ships and developers want accelerated inference on day one.
What did llama.cpp gain from the Metal Tensor API?
llama.cpp is the most widely deployed open-source inference engine for local LLMs. It supports more model formats, more quantization schemes, and more hardware backends than any alternative. On Apple Silicon, it has historically lagged MLX on raw throughput because it must map operations through the Metal Performance Primitives layer rather than calling Apple’s Metal APIs directly.
That gap narrowed significantly between late 2025 and early 2026. Georgi Gerganov, the creator of llama.cpp, merged support for the Metal 4 Tensor API in PR #16634. The change gates Neural Accelerator access to M5 and later chips. It reworked how llama.cpp performs matrix-matrix multiplication on Apple GPUs and unlocked hardware paths that were previously unavailable to it.
The impact showed up in PR #20962, merged in March 2026. A benchmark on an M5 Max with 40 GPU cores and 64 GB of memory measured LLaMA 7B Q4_0 prompt processing at 3246 tokens per second. That is a 229% improvement over the same build before the Tensor API optimizations. Text generation, or decode, improved more modestly from 102 to 110 tok/s. This confirms that decode remains memory-bandwidth bound rather than compute bound on this engine.
The same pull request showed what the M5 upgrade buys. Prompt processing for Mistral 8B Q8_0 on an M4 Max ran at 631 tok/s. On an M5 Max, it ran at 2695 tok/s, a 4.27x improvement. A follow-up change in PR #19369 improved CPU and GPU interleaving during graph encoding. The improvement delivered 1% to 6% decode throughput gains across models from DeepSeek MoE to Gemma 3 to Qwen 3.
llama.cpp does not lead on raw decode speed against MLX or MetalRT. But its ecosystem reach, format support, and the speed of its recent Metal optimizations make it the engine to watch when the next round of Apple hardware ships. No other engine can run as many different model formats out of the box.
What does MetalRT do differently?
MetalRT is a newer entrant developed by RunAnywhere and released in early 2026. It is a dedicated Metal inference engine that skips the abstraction layers both MLX and llama.cpp sit on. It is not a fork of either project. It is a ground-up implementation that calls Metal directly for matrix operations and memory management.
The company published benchmarks in April 2026 comparing MetalRT against mlx-lm, llama.cpp, Ollama, and uzudil on an M4 Max with 64 GB of memory. The test used the same MLX 4-bit model files for MetalRT and mlx-lm. The shared model format gave a direct engine-to-engine comparison without a format variable. MetalRT’s peak decode speed reached 658 tok/s on Qwen3-0.6B. Against mlx-lm, it was 1.10x to 1.19x faster on decode across the model range. Against llama.cpp, the gap was 1.35x to 2.14x.
These numbers are strongest on decode, which is the bottleneck users feel most directly during chat and text generation. Each token comes out slightly faster, and over thousands of tokens the difference adds up to seconds of wall-clock time. MetalRT does not currently match MLX on time-to-first-token for large prompt sizes on M5 hardware. Its ecosystem is small. The engine supports a narrower range of model formats and has fewer community tools. But for decode-bound workloads on M4 and M5 hardware, it is the fastest open-source option measured so far.
How do the three engines compare by the numbers?
The table below consolidates available data across the three engines from published benchmarks.
| Metric | MLX | llama.cpp | MetalRT |
|---|---|---|---|
| Peak decode, small model (M4 Max) | ~553 tok/s (Qwen3-0.6B) | ~394 tok/s (Qwen3-0.6B) | 658 tok/s (Qwen3-0.6B) |
| Sustained throughput, 7B Q4 (M2 Ultra) | ~230 tok/s | ~150 tok/s | Not tested on M2 Ultra |
| Prompt processing, 7B Q4_0 (M5 Max) | Not separately benchmarked | 3246 tok/s (LLaMA 7B) | Not separately benchmarked |
| TTFT speedup M5 vs M4 | Up to 4x | Not separately benchmarked | Not separately benchmarked |
| Ecosystem and format support | Apple-backed, MLX-native | Largest, universal GGUF | Small, MLX-format models |
| Active development pace | Apple research team | 30+ contributors per month | RunAnywhere team |
The pattern is clear. MLX leads on prompt processing and time-to-first-token when running on M5 hardware, thanks to direct Neural Accelerator access. MetalRT leads on decode throughput on M4 and M5 machines, with a meaningful margin over both alternatives. llama.cpp has closed the gap substantially in the last six months, especially on prompt processing, but still trails on decode.
How do you choose an engine for your workload?
For interactive chat on an M5 MacBook Pro, MLX gives the fastest first-token response. The TTFT improvement from the Neural Accelerators means the model starts answering before the user notices a delay. For batch generation and long-form text output on the same hardware, MetalRT’s decode advantage matters more. Each token comes out faster, and the difference accumulates quickly.
For users on M4 or earlier hardware, the picture shifts. MLX still benefits from Apple’s framework optimizations, but the Neural Accelerator gap disappears. MetalRT shows its largest decode advantage on these chips. Large parts of the install base are on M4 or M3 machines, and for those systems MetalRT is the strongest option for decode-bound work.
For multi-user serving scenarios, the vllm-mlx project developed by Wayner Barrios et al. (2026) achieved 21% to 87% higher throughput than llama.cpp on an M4 Max, with continuous batching scaling to 4.3x aggregate throughput at 16 concurrent requests. That makes MLX the current leader for server-like workloads on Apple Silicon.
All three engines are free and open source, and all three are under active development. The gap between them is narrower at the end of 2026 than it was at the start of the year, and it will keep shrinking as M5 adoption grows and optimizations propagate across all three codebases.
The best engine is not the one with the highest headline number. It is the one that fixes the bottleneck your workload hits first. For most users on M5 hardware doing interactive work, that means MLX for prompt processing and MetalRT for decode. For users on older hardware or with format compatibility needs, llama.cpp remains the most practical choice. Once the engine question is settled, our guide to which local LLM to run in 2026 helps you choose the model to feed it.
Sources: Shrivastava et al. “Production-Grade Local LLM Inference on Apple Silicon” (arXiv, 2025), Apple Machine Learning Research “Exploring LLMs with MLX and M5” (2025), RunAnywhere “MetalRT Fastest LLM Decode Engine” (2026)