Gemma 4 26B MoE local: quality per gigabyte on unified memory

A single ollama pull gemma4:26b command downloads 16 gigabytes of weights. Two minutes later, you are talking to a model that scores 82.7% on MMLU and generates text at 61 tokens per second on a desktop that draws 49 watts under load. That speed and quality combination changes the calculus for anyone running local AI on unified memory hardware. The 26B MoE from Google DeepMind, released April 2 2026 under the Apache 2.0 license, is the first open model where the efficiency argument is stronger than the raw size argument. On a system with shared CPU-GPU memory, it matters more than any dense alternative at any quantization.

What does the 26B MoE architecture do for memory efficiency?

The model has 26 billion total parameters arranged as a mixture of 128 expert sub-networks. A learned router activates only 8 experts per token, plus one shared expert, for a total of roughly 3.8 billion active parameters per forward pass. The remaining 22 billion weights sit in memory as a static lookup pool. They occupy space. They do not consume compute.

This is the distinction that matters for unified memory. On a discrete GPU system, VRAM is the scarce resource and the active-parameter count determines speed. On an Apple Silicon Mac or an NVIDIA GB10-class machine, the same memory pool serves both storage and computation. The 26B MoE needs all 26B weights in that pool at Q4 precision, which is about 16 to 18 gigabytes depending on the quantization variant. But the per-token compute is only about 4 billion parameters worth of matrix operations. That is why the model generates text at speeds that feel like a 4B dense model while answering questions at a quality level near the 31B dense flagship.

The Gemma 4 26B MoE guide at gemma4-ai.com provides the per-quantization breakdown: Q4_K_M sits at roughly 16 gigabytes, Q5_K_M at 19, Q8_0 at 28, and FP16 at 52 gigabytes. The 256K context window adds KV cache overhead that grows with prompt length, but the MoE architecture keeps the compute side light even when the context is large. For a 4K-token session on a 32GB M1 Pro, oMLX benchmarks show peak memory of 14.9 gigabytes with 30.9 tokens per second generation speed. The model fits with room to spare.

What speed can you expect at each unified memory budget?

The practical question for anyone buying a machine today is what speed to expect at each memory tier. The data from published benchmarks and community tests maps cleanly to three budget ranges, and it pairs well with our guide on how to pick a local model for your RAM.

At the 24 gigabyte tier, which covers the M4 Pro 24GB and any Windows laptop with 24GB of unified or shared memory, Q4_K_M is the realistic choice. The model artifact alone takes 16 to 18 gigabytes, leaving 6 to 8 gigabytes for the OS, the KV cache, and other applications. Context length needs discipline. Keep it at 4K to 8K unless you close every other application. Generation speed on an M4 Pro at this tier is estimated at 25 to 35 tokens per second, based on Nordic Silicon’s Apple Silicon benchmarks, which rate the 26B MoE as the best quality-per-tok-s configuration on Apple hardware.

At the 48 gigabyte tier, the most common high-end MacBook Pro configuration, the experience changes. The model loads at Q5_K_M or even Q8_0 with full 128K context headroom. A 48GB M4 Max runs the 26B MoE at roughly 30 to 40 tokens per second in Q5, with enough memory left for the browser, the terminal, and a coding assistant running in the background. This is the tier where the MoE advantage becomes obvious: the same machine cannot run the 31B dense model at any usable speed without aggressive quantization and context limits.

At the 128 gigabyte tier, represented by the NVIDIA GB10 platform in Project DIGITS and the DGX Spark, the 26B MoE runs at its full potential. Subterra Technologies benchmarked five Gemma 4 variants on a single GB10 box and found the 26B MoE at Q4_K_M achieving 61.1 tokens per second average generation and 616 tokens per second prompt processing, with only 17 gigabytes of observed memory. The 31B dense model at Q4_K_M on the same hardware managed 10.3 tokens per second while consuming 68 gigabytes. The 6x speed gap is not a measurement artifact. It is the architectural difference between a model that activates 4 billion parameters per token and one that activates all 31 billion.

How does quality per GB compare to the 31B dense model?

The headline benchmark numbers tell a clear story. Gemma 4 benchmark comparisons published on April 18 2026 show the 26B MoE at 82.7% on MMLU versus 87.1% for the 31B dense, 73.2% versus 76.8% on HumanEval, and 88.4% versus 91.2% on GSM8K. The 31B wins every row by 3 to 5 points. That is a real quality difference. The question is what it costs to get those extra points.

The 31B dense at Q4_K_M needs roughly 20 gigabytes of VRAM according to Oflight’s hardware requirements guide. But as the Subterra benchmarks show, the runtime footprint on unified memory systems is much higher. The 31B Q4_K_M actually consumed 68 gigabytes during inference on the GB10. KV cache growth and runtime buffers added nearly 50 gigabytes of overhead that the model size alone did not predict. The 26B MoE Q4_K_M consumed 17 gigabytes with no such divergence. In practice, the 26B MoE delivers 82.7% MMLU at 17 gigabytes, or 4.86 quality points per gigabyte. The 31B dense delivers 87.1% MMLU at 68 gigabytes, or 1.28 quality points per gigabyte. The MoE model is nearly 4 times more memory-efficient in real-world conditions.

The gap narrows on coding and reasoning. On HumanEval, the 26B scores 73.2% to the 31B’s 76.8%, a difference of 3.6 points that matters less in daily use than the speed advantage. A developer waiting 10 seconds for a code suggestion from the 31B at 7.8 tokens per second on a 24GB RTX 4090, documented by n1n.ai’s local benchmark, is having a different experience than one getting the same suggestion in under a second from the 26B MoE at 45 tokens per second. The quality difference between the two models is smaller than the quality difference between getting an answer quickly and getting it slowly enough to break your flow.

Which quantization should you choose for your memory budget?

The 26B MoE gives back some of the memory advantage at higher quantization levels. Q8_0 uses 28 gigabytes and Q5_K_M uses roughly 19. The practical question is whether the extra precision matters for the work you do.

Subterra’s benchmark results on this point are worth reading carefully. Across seven workload categories, including reasoning, math, coding, JSON extraction, creative writing, and long-form throughput, there was no observable quality difference between Q4_K_M and any higher quantization. All five variants they tested answered the same math problems correctly, produced the same valid JSON, and generated the same working code. The only difference was speed, and it favored Q4. That matches the broader experience in the local AI community: Q4_K_M is a reliable default for MoE architectures because the quantization noise distributes across a large parameter pool and the router network remains reliable at 4-bit precision.

For users with 32 to 48 gigabytes of unified memory, Q5_K_M is a reasonable upgrade if you have the headroom and want the extra safety margin for long-context reasoning. The Nordic Silicon guide recommends Q5 as the “comfortable” tier for 48GB Macs, with generation speed in the 25 to 35 tokens per second range. For anyone on 24GB machines, Q4_K_M is the realistic and correct choice. Dropping to Q3 or lower is not recommended by any of the sources surveyed. The quality degradation at 3-bit on MoE models is larger than on dense models because the router itself becomes less reliable.

What do the benchmarks miss about daily use?

Published benchmarks measure tokens per second and MMLU points at fixed context lengths. They do not capture three factors that determine whether a model feels good to use on a unified memory machine.

The first is prompt processing speed. On the GB10, the 26B MoE processed prompts at 616 tokens per second at Q4_K_M. That means a 4,000-token document or chat history is ingested in about 6.5 seconds before generation begins. The 31B dense processed the same prompt at 286 tokens per second, taking about 14 seconds. For retrieval-augmented generation workflows, where the user’s question plus retrieved context can easily reach 4,000 tokens, the MoE variant halves the waiting time before the first word appears. The benefit compounds across every interaction in a session.

The second is latency under thinking mode. The GB10 benchmark analysis found that thinking mode barely changes per-token speed. The cost is in extra tokens emitted during internal reasoning. For straightforward questions, thinking mode tripled or quadrupled completion time without improving the answer. For hard math and multi-step reasoning, it added value. The practical takeaway: leave thinking mode off by default and toggle it only for genuinely difficult problems. On the 26B MoE, even with thinking mode on, the total time is still lower than the 31B dense without it.

The third factor is the compound latency of agentic workflows. If a coding agent makes five tool calls, each requiring a model invocation, a 61 tok/s model finishes in roughly the time a 10 tok/s model takes for a single call. The gap widens with every additional step. For anyone building local agents, autocorrect systems, or iterative code assistants on unified memory, the 26B MoE is not just a good fit. It is the only practical option until the 31B dense finds an efficiency breakthrough or unified memory capacities cross 192 gigabytes on consumer hardware. We reach a similar conclusion in our pick for the best local model for OpenCode.

The next generation of Apple Silicon, with the M4 Ultra reportedly supporting up to 192 gigabytes of unified memory, will make the 31B dense more comfortable at FP16. But the 26B MoE will still be the faster choice for interactive work, and the quality gap between them will remain small. The model that wins on quality per gigabyte and quality per second is the one that makes local inference feel like a product, not a science project. Right now, that model is the 26B MoE.

What does the 26B MoE architecture do for memory efficiency?

What speed can you expect at each unified memory budget?

How does quality per GB compare to the 31B dense model?

Which quantization should you choose for your memory budget?

What do the benchmarks miss about daily use?

More from stridenalysis.

M4 Pro vs M5 Max for local inference on Apple silicon

Local vs cloud AI: what to run where in 2026

Gemma 4 E4B on edge hardware: small models catch up