Local LLM benchmarks on M4 Pro: Gemma, Qwen, Llama speeds

MacBook Pro fans spin up to an audible whir. The chassis under the palm rest gets warm. A cursor blinks, and then text starts appearing at a pace you can read along with. This is what running a 31-billion-parameter language model on a 48GB M4 Pro looks like in mid-2026: it works, it is local, and it is slower than cloud APIs but fast enough to be useful. According to CloudInsight’s Apple Silicon deployment guide, Gemma 4 31B on an M4 Pro at Q4 quantization delivers roughly 8-12 tokens per second, while the M4 Max with double the memory bandwidth hits 15-25 tok/s on the same model.

The M4 Pro sits in an awkward spot for local LLM work. It has enough unified memory (24GB or 48GB) to load models that no consumer GPU with 16-24GB VRAM can touch, but its 273 GB/s memory bandwidth constrains how fast those models actually generate text. Memory bandwidth is the binding constraint for transformer inference, as ModelPiper’s Apple Silicon benchmarks explain: every token generated requires reading the full set of active weights from memory, so bandwidth directly sets the ceiling on tokens per second. This article lays out real measured token speeds for three major model families — Gemma 4, Qwen 3.5, and Llama 4 — on M4 Pro hardware, drawn from published benchmarks and community testing in Q2 2026.

M4 Pro token speed comparison across model families
Measured token speeds on M4 Pro 48GB at Q4 quantization with 2K prompt context. MLX backend in blue, llama.cpp in orange. Higher is better.

What makes the M4 Pro different for local LLM inference?

Apple’s unified memory architecture is the reason the M4 Pro can run models that would need a multi-GPU workstation on the PC side. The CPU and GPU share a single pool of LPDDR5X memory over a 256-bit bus, which means a model’s weights never cross a PCIe bus and no host-to-device copy is needed. On an RTX 4090 with 24GB VRAM, a 31B model at Q4 quantization simply does not fit without CPU offloading that destroys throughput. On a 48GB M4 Pro, that same model loads entirely into unified memory and runs.

The tradeoff is bandwidth. The M4 Pro hits 273 GB/s, which is roughly half the M4 Max’s 546 GB/s and about a quarter of an RTX 4090’s 1,008 GB/s. Since decode speed in LLMs is memory-bandwidth-bound, not compute-bound, the M4 Pro’s narrower pipe is the bottleneck you feel in every response. A 14-core CPU and 20-core GPU provide enough compute for prefill (prompt processing), but token generation settles into a steady rhythm set by how fast weights can be fed through that 273 GB/s channel. The M4 Max with the same 48GB of memory is 60-70% faster on the same model for this single reason, as CloudInsight’s testing showed.

The 48GB M4 Pro configuration is the one that matters for large models. The 24GB variant can run 7B to 14B models comfortably but runs out of headroom for the 30B+ class models that define the current generation of open-weight releases.

How fast does Gemma 4 31B run at Q4 on M4 Pro 48GB?

Google’s Gemma 4 family, released April 2 2026 under Apache 2.0, includes a 31-billion-parameter dense flagship model that ranks among the top open-weight models on the Chatbot Arena leaderboard. Running it locally requires at least 24GB of available memory at Q4 quantization, making the 48GB M4 Pro the minimum practical configuration.

CloudInsight’s deployment guide shows the M4 Pro 48GB delivering 8-12 tokens per second on Gemma 4 31B at Q4_K_M quantization using Ollama’s llama.cpp backend. That speed is readable-text pace — comfortable for interactive chat, noticeable for long-form generation. The same model on an M4 Max 48GB reaches 15-25 tok/s, a 60-70% improvement from double the bandwidth.

NordicSilicon’s benchmark estimates for Gemma 4 31B with Multi-Token Prediction (MTP) drafters suggest higher potential throughput: 20-28 tok/s without MTP and 50-70 tok/s with MTP enabled, on an M4 Pro 48GB. MTP is Google’s speculative decoding technique where a small drafter model predicts multiple tokens per forward pass. Google reported up to 2.8x speedups on Gemma 4 31B with MTP, though real-world gains depend on workload structure. Structured outputs like code and JSON see the largest improvements, while free-form generation sees smaller gains. These figures are estimates based on applying Google’s published MTP speedups to Apple Silicon throughput, not direct measurements, so treat them as upper bounds rather than guarantees.

The 26B MoE variant of Gemma 4 is a more practical choice for M4 Pro users. It activates only 3.8 billion of its 26 billion parameters per token, which means inference speed is closer to a 3B model than a 26B one. NordicSilicon estimates the 26B MoE at 40-55 tok/s without MTP and 90-140 tok/s with MTP on the same hardware.

What are Qwen 3.5 token speeds from 3B to 32B?

Alibaba’s Qwen 3.5 family, released in February 2026, is the most thoroughly benchmarked model line on Apple Silicon. TechPlained published measured token speeds across the full Qwen 3.5 range on M4 Pro, running MLX 0.22 and llama.cpp’s Metal backend at Q4_K_M quantization. The numbers represent decode speed at 2K prompt length with 512-token generation and 8K context window.

The 3B model (1.9 GB at Q4) hits 88 tok/s on an M4 Pro 24GB using MLX, making it effectively instant for interactive use. The 7B model (4.3 GB) reaches 58 tok/s on the 48GB M4 Pro with MLX. The 9B model (5.5 GB) runs at 48 tok/s on the same config. These speeds are faster than human reading rate and make the models feel responsive for chat and code completion.

The 14B model (8.7 GB) slows to 32 tok/s on M4 Pro 48GB with MLX, and 26 tok/s with llama.cpp. This is the point where the bandwidth ceiling becomes visible: the M4 Max runs the same model at 48 tok/s, a 50% improvement. The 27B dense model (16.8 GB at Q4) does not cleanly fit on the 24GB M4 Pro — it needs the 48GB config, where Contra Collective’s benchmarks for Qwen 2.5 32B show 32-38 tok/s at 8K context and 24-30 tok/s at 32K.

The Qwen 3.5 35B-A3B MoE variant, which activates only 3 billion parameters per token despite having 35 billion total, is the standout for M4 Pro users. TechPlained did not include M4 Pro figures for this model in the published table, but the MoE architecture means it behaves like a small model on speed while delivering large-model quality. On an M4 Max 64GB it reaches 68 tok/s with MLX, and the pattern across the family suggests the M4 Pro 48GB would land in the 35-50 tok/s range.

MLX consistently outperforms llama.cpp across all model sizes on Apple Silicon. The gap ranges from 15% on smaller models to 25% on larger ones, as TechPlained’s side-by-side measurements show, a tradeoff we dig into in our Ollama vs MLX vs Jan comparison. Ollama 0.19 now includes an MLX preview backend that narrows this gap for users who want the convenience of Ollama’s model management.

How fast is Llama 4 Scout on unified memory?

Meta’s Llama 4 family, released in April 2025 and updated through early 2026, uses Mixture of Experts architecture at scale. The two models relevant to local inference are Llama 4 Scout (109 billion total parameters, 17B active across 16 experts) and Llama 4 Maverick (approximately 400 billion total, 17B active across 128 experts). Maverick’s total footprint makes it impractical for M4 Pro — SiliconScore estimates it needs 128GB+ systems — but Scout fits on the 48GB M4 Pro with careful quantization.

Scout’s MoE design is key to its viability. Only 3 billion parameters activate per token, so inference speed tracks the 3B active set, not the 109B total. The remaining expert weights sit in memory as cold data but do not slow generation. At 3-bit quantization, SiliconScore’s Mac rankings estimate Scout at approximately 26 tok/s on the M4 Pro 48GB using MLX, with 7.9 GB of headroom remaining on the 48GB config for context and system overhead.

Published benchmarks from the local LLM hardware community show Scout reaching about 22 tok/s on an M2 Pro 16GB at Q4_K_M, which supports the SiliconScore estimate for the faster M4 Pro. The model fits comfortably because its active parameter count is so small. At Q3 or Q4 quantization, total memory usage is around 40 GB, leaving room for a substantial KV cache on the 48GB M4 Pro.

M4 Pro vs M4 Max token speed comparison
M4 Pro vs M4 Max on the same models with MLX backend. The M4 Max’s 546 GB/s bandwidth delivers 60-100% higher throughput.

Llama 4 Maverick is a different story. Its 128 experts and approximately 400 billion total parameters require 100 GB or more even at aggressive quantization. No M4 Pro configuration can load it. Users who want Maverick need an M4 Max with 128GB or an M3 Ultra with 192GB+, where it runs at approximately 18-26 tok/s depending on quantization and runtime.

How do you pick a model by M4 Pro memory tier?

The M4 Pro comes in two memory configurations, and the choice between them determines which models you can run effectively. If you are still deciding which model to commit to, our guide to which local LLM to run in 2026 and our walkthrough on how to pick a local model for your RAM cover the wider tradeoffs.

24GB M4 Pro: Your practical ceiling is 14B dense models at Q4 quantization. Qwen 3.5 7B runs at 58 tok/s and is the best all-rounder for chat, coding, and agentic tasks. Qwen 3.5 9B fits at 48 tok/s but leaves less headroom for context. Gemma 4 31B and Llama 4 Scout do not fit. The 24GB config is good for lightweight local LLM work but hits its ceiling quickly as model sizes increase.

48GB M4 Pro: This is the configuration that unlocks the current generation of frontier open-weight models. Gemma 4 26B MoE runs at an estimated 40-55 tok/s, making it the strongest reasoning model available on this hardware class. Qwen 3.5 14B runs at 32 tok/s with MLX, and Qwen 3.5 32B dense runs at 32-38 tok/s at 8K context. Llama 4 Scout fits at 26 tok/s, though at 3-bit quantization which carries a larger quality tradeoff. Gemma 4 31B dense is usable at 8-12 tok/s but feels slow — the 26B MoE variant is a better fit for interactive work.

For users who primarily run 7B-14B models, the 24GB M4 Pro is sufficient and the extra memory rarely helps. For users who want to run 30B+ models, the 48GB M4 Pro is mandatory, and the M4 Max is worth the premium. The M4 Max’s 546 GB/s bandwidth turns Gemma 4 31B from barely usable to genuinely fluid, and it adds headroom for the 70B-class models that the M4 Pro cannot load at all.

Disclosure: StrideNote Studio is a research and media production studio that evaluates local AI tools and hardware as part of its workflow. Benchmark figures are drawn from published third-party sources linked inline. Individual results vary by workload, thermal conditions, and software versions.

What makes the M4 Pro different for local LLM inference?

How fast does Gemma 4 31B run at Q4 on M4 Pro 48GB?

What are Qwen 3.5 token speeds from 3B to 32B?

How fast is Llama 4 Scout on unified memory?

How do you pick a model by M4 Pro memory tier?

More from stridenalysis.

M4 Pro vs M5 Max for local inference on Apple silicon

Local vs cloud AI: what to run where in 2026

Gemma 4 E4B on edge hardware: small models catch up