Gemma 4 E4B on edge hardware: small models catch up

Disclosure: StrideNote Studio is a research and media production studio that evaluates local AI tools as part of its workflow.

A developer on a MacBook Air with 8 GB of unified memory pulled the Gemma 4 4B model in early April, ran a few prompts, and found something that would have been impossible twelve months earlier. The model running on her laptop, a 4.5 billion parameter edge model, was outscoring Google’s previous-generation flagship on math, coding, and agentic tool use. The 27 billion parameter Gemma 3 that needed a powerful GPU was suddenly trailing a model one-sixth its size that could run on a Raspberry Pi.

That data point comes from Google DeepMind’s official benchmark table. Gemma 4 E4B scores 42.5% on AIME 2026 mathematics compared to Gemma 3 27B’s 20.8%. It scores 52.0% on LiveCodeBench v6 versus 29.1%. On the tau-2 bench agentic tool use task, it reaches 57.5% against Gemma 3’s 6.6%. A model that fits in 5 GB of RAM at Q4 quantization beats its predecessor’s 27-billion-parameter dense model on every metric that matters for production use.

What do the Gemma 4 E4B benchmark numbers actually show?

The comparison between Gemma 4 E4B and Gemma 3 27B is the cleanest illustration of how far small models have come. Both are dense transformer architectures from the same research team at Google DeepMind, separated by roughly a year of development. The parameter counts are not close. E4B uses 4.5 billion effective parameters. Gemma 3 uses 27 billion. But the results tell a different story.

On MMMLU, the multilingual question-answering benchmark, E4B scores 69.4% against Gemma 3’s 67.6%. On MMMU Pro, which tests multimodal reasoning, E4B reaches 52.6% versus 49.7%. These are narrow margins, but they are the margins you would expect if the models were the same size. The two biggest gaps are in math reasoning and code generation. AIME 2026, a rigorous competition math test, shows E4B more than doubling the predecessor’s score. LiveCodeBench v6 shows 52.0% versus 29.1%.

The biggest number in the entire table is the tau-2 bench score for agentic tool use. Gemma 3 27B scored 6.6%. Gemma 4 E4B scored 57.5%. That is not an incremental improvement. It reflects a fundamental architectural shift in how the model handles multi-step, tool-driven tasks. Google DeepMind’s own data shows the model succeeding on more than half of complex retail agent scenarios, where its predecessor failed on more than nine out of ten.

How does a 4.5 billion parameter model run on a phone?

The hardware data from Google’s LiteRT-LM team confirms that E4B is not just benchmark-competitive in theory. It runs on shipping consumer devices today. The Google AI Edge performance table lists measured throughput across Android, iOS, macOS, Windows, and IoT hardware.

On an iPhone 17 Pro using the GPU backend, E4B delivers 25 tokens per second decode with a 0.9 second time to first token. On a MacBook Pro M4 Max, it hits 101 tokens per second on GPU and 27 on CPU. The model file is 3.65 GB, small enough to install alongside other applications.

The most impressive data point is the Raspberry Pi 5 with 16 GB of RAM. E4B runs at 3 tokens per second decode on CPU with 3069 MB peak memory. Three tokens per second is slow for interactive chat but usable for batch inference, local summarization, and embedded automation. A model that answers competition math questions runs on a single-board computer that costs USD 80.

Google also introduced Multi-Token Prediction for Gemma 4, which the LiteRT-LM documentation describes as delivering up to 2.2x decode speedup on mobile GPUs and up to 1.5x on mobile CPUs with zero quality degradation. The feature is enabled by default for the E4B model on GPU backends. The real-world performance on mobile devices is significantly better than the base numbers suggest.

Why does agentic capability matter more than raw accuracy?

The tau-2 benchmark jump from 6.6% to 57.5% is the most consequential number in the entire Gemma 4 suite, and it is worth understanding what it measures. Tau-2 evaluates a model’s ability to complete multi-step, tool-mediated tasks in a retail environment. It tests whether a model can hold a plan across multiple tool calls, recover from errors, and return structured outputs.

Gemma 3 was essentially unable to do this. A score of 6.6% means the model succeeded on roughly one task out of fifteen. Gemma 4 E4B at 57.5% means the same class of model succeeds on more than half of those tasks, running on edge hardware with no cloud dependency. For teams building on-device agents like voice assistants that book appointments or field service tools that update inventory, this is the number that changes the architecture decision.

The implication is direct. An edge model that can handle agentic workflows removes the round-trip latency, connectivity requirement, and per-token cost of sending every agent step to a cloud API. The full agent loop runs locally. The device does not need to be online.

What does the Apache 2.0 license mean for edge teams?

Previous Gemma releases shipped under a custom license that included a Prohibited Use Policy, which enterprise legal teams consistently flagged during procurement review. That friction kept Gemma out of many production stacks as a result. Gemma 4 ships under Apache 2.0, removing those usage carve-outs entirely.

For edge teams, this matters because edge deployments often fall into gray areas in permissive-but-custom licenses. A model running on a device in a regulated industry, or in a product that may later cross usage thresholds, needs a license that does not create legal debt. Apache 2.0 grants commercial use, modification, redistribution, and derivative work rights with no monthly active user thresholds and no field-of-use restrictions. An organization can fine-tune E4B on proprietary data, ship it on 10,000 devices, and the legal surface area is the same as shipping any open-source library.

The license change also affects deployment speed. Teams that previously needed legal sign-off for Gemma can now treat it like any other Apache 2.0 dependency. That removes a gate that slowed edge AI pilots at many organizations.

Where does E4B fit in the 2026 model selection?

The edge model tiers in 2026 have settled into a clear pattern. Google’s E2B at 2.3 billion parameters is the starting point for phones, IoT, and low-power hardware. E4B at 4.5 billion is the step-up for laptops, higher-end phones, and scenarios where the 2B model’s quality gap becomes noticeable. Above E4B, the 26B A4B MoE model targets consumer GPUs, and the 31B dense model targets workstations.

The memory requirements table on Gemma4.org recommends E4B as the safest local starting point for laptop-class and quick-evaluation setups. At 5 GB in Q4 quantization, it fits in the memory budget of any modern laptop, most tablets, and increasingly capable IoT hardware. The 128K context window covers full document processing, long chat histories, and code file analysis.

The practical choice for most teams in mid-2026 comes down to which edge model fits their workload. If the use case involves math, coding, or tool use, the E4B’s benchmark advantage over the E2B on AIME and LiveCodeBench justifies the extra memory. If the task is straightforward classification, summarization, or simple Q&A, the E2B remains the more efficient choice.

Six months ago, the answer to “which model runs on a phone and produces reliable output” was a list of compromises. The model that fit was not smart enough. The model that was smart enough did not fit. Gemma 4 E4B does not end every compromise. At 3 tokens per second, the Raspberry Pi experience is still patience-testing. But it closes the gap that kept edge AI in the pilot phase. The teams building on-device products in the second half of 2026 face a deployment question, not a capability question. The model can do the work. The remaining variable is which runtime and quantization strategy gets them there fastest.

What do the Gemma 4 E4B benchmark numbers actually show?

How does a 4.5 billion parameter model run on a phone?

Why does agentic capability matter more than raw accuracy?

What does the Apache 2.0 license mean for edge teams?

Where does E4B fit in the 2026 model selection?

More from stridenalysis.

M4 Pro vs M5 Max for local inference on Apple silicon

Local vs cloud AI: what to run where in 2026

What Ollama NVFP4 means for local model quality