What Ollama NVFP4 means for local model quality

Ollama users who pull a Gemma 4 12B model today get two 4-bit options: q4_K_M at roughly 7.5 GB or nvfp4 at roughly 7.8 GB. The file size is nearly identical. The quality difference is not. According to Ollama’s June 11, 2026 benchmark, NVFP4 roughly halves the perplexity gap between the q4_K_M quant and the unquantized BF16 baseline on the Gemma 4 12B model.

NVFP4 is a 4-bit floating-point format introduced with NVIDIA Blackwell GPUs. Its purpose is to shrink models by roughly 4x relative to FP16 while preserving more accuracy than other 4-bit methods. Ollama added NVFP4 support through its MLX engine in a preview release in March 2026, then shipped production-ready performance in the June 2026 update of the MLX backend. The result is a local 4-bit model that sounds closer to its full-precision source than anything available through standard GGUF quantization.

Here is what the format does, how it compares to what most Ollama users run today, and where it still falls short.

What does NVFP4 do differently from standard 4-bit?

Standard 4-bit quantization converts each weight into a small integer, then stores a shared scale factor for every block of values. The block size and the precision of that scale factor determine how much information survives the conversion. NVIDIA’s NVFP4 format uses a block size of 16 values, half the 32-value block used by its predecessor MXFP4. The scale factor itself is stored as an FP8 (E4M3) value instead of a power-of-two integer, which lets it represent fractional scales instead of only powers of two.

The result is a two-level scaling system. Each micro-block of 16 values gets an FP8 local scale. A second FP32 scale applies per tensor. Together, these layers recover dynamic range that uniform integer quantization loses. For a model weight distribution where values cluster near zero with occasional outliers, which is how most LLM weights actually behave, this two-level approach maps the distribution more accurately than a single integer scale over a larger block.

NVIDIA’s Blackwell Tensor Cores support NVFP4 natively in hardware, with mixed-precision execution that accumulates in FP16. On Apple Silicon, Ollama’s MLX backend emulates the same arithmetic using Apple’s Metal framework. The format itself is the same in both cases, which means a model quantized for datacenter inference on Blackwell hardware can run on a MacBook with identical numerical behavior.

How does NVFP4 quality compare to q4_K_M?

The headline number from Ollama’s benchmark is straightforward. On perplexity for Gemma 4 12B, the gap between q4_K_M and unquantized BF16 is roughly twice the gap between NVFP4 and BF16. NVFP4 cuts the quality loss of 4-bit quantization in half.

Perplexity is a coarse metric. It measures how well the model predicts the next token in a held-out test set, which correlates with general output quality but does not capture every dimension of model behavior. On coding, reasoning, and instruction-following tasks, the gap may be narrower or wider depending on the model and the prompt. Red Hat’s NVFP4 evaluation across models from 8B to over 400B parameters found that accuracy recovery was strongest at larger model scales, where the redundancy in the weights gives the format more room to work.

The practical difference for an Ollama user is that a 4-bit NVFP4 model on a 16 GB MacBook produces outputs closer to what the same model would produce on a server running BF16. For chat and agentic workflows, the difference is often noticeable in the coherence of longer responses and the quality of tool-call formatting. For simple Q&A, many users will not see a difference at all.

How much faster is NVFP4 on the MLX engine?

NVFP4 does not only improve quality. It also runs faster than q4_K_M on Ollama’s updated MLX backend. According to Ollama’s June 2026 tests, NVFP4 generates roughly 20 percent more tokens per second than the equivalent q4_K_M model. The speedup comes from two sources. MLX’s just-in-time compiler fuses multiple operations into single Metal kernels. This reduces GPU dispatch overhead. Ollama also reworked its GPU-backed sampling to run more efficiently.

Average output speed over ten runs with an 8,300-token input prompt showed NVFP4 consistently ahead. The exact throughput depends on the model size, the Mac model, and the tokenizer overhead. On an M4 Max or M5 Max, the combination of higher quality and higher speed makes NVFP4 the better choice for any model that supports it.

Ollama’s June update also introduced a snapshot system for agent workloads. The system saves model state at branching points in multi-turn conversations. When an agent hands off to a subagent or a user retries a response, the engine resumes from the saved state instead of reprocessing the entire context. This feature works with any model, not just NVFP4, but it compounds the benefit of running a faster quant on the MLX backend.

Which models support NVFP4 today?

The NVFP4 model catalog within Ollama is still small but growing. As of June 2026, the most notable options are the Gemma 4 family at 12B and 26B parameter sizes, and multiple Qwen 3.5 variants including the 9B, 27B, and 35B-A3B models. Ollama’s library also carries coding-specific variants such as Qwen 3.5 35B-A3B Coding NVFP4.

The Gemma 4 26B NVFP4 model has seen over 4.3 million downloads as of mid-June 2026. That number reflects interest in the format more than the model itself. Gemma 4 is a strong general-purpose model with vision and tool-use capabilities, and the NVFP4 variant makes it runnable on a 48 GB Mac Studio that would struggle to fit the BF16 version.

Users can also convert their own models using Ollama’s experimental import pipeline. The --quantize nvfp4 flag behind the --experimental flag allows custom imports to be converted into NVFP4. The feature graduated out of experimental status in April 2026, making it a supported path for users who want to quantize their own fine-tuned models.

One caveat surfaced in an Ollama GitHub issue in April 2026. The Qwen 3.6 35B-A3B model’s K-projection weights collapsed to zero during NVFP4 quantization when every value in a row fell below the smallest representable codepoint. The fix involved a mixed-precision recipe that keeps certain attention projections in BF16 while quantizing only the MoE expert weights. Ollama’s packaging tool was updated to mirror Red Hat’s mixed-precision approach. Users who pulled the model after the fix got the corrected version.

What are the platform limits and what comes next?

NVFP4 in Ollama currently runs only on macOS via the MLX backend. The backend is built on Apple’s MLX framework and Metal Performance Shaders. Windows and Linux support is listed as “in progress” by the Ollama team. Users who try to pull an NVFP4 model on Linux get a 412 error from the registry: “this model requires macOS.”

The platform lock caused confusion in the Ollama community when NVFP4 model tags first appeared in March 2026. Users with NVIDIA RTX 6000 Blackwell cards could not run a format designed for Blackwell hardware because Ollama’s NVFP4 implementation was tied to MLX, not CUDA. The format name itself, NVFP4, implies Nvidia compatibility, which added to the confusion. Ollama’s GitHub issue tracker has an open thread about adding platform annotations to model tags in the catalog.

For users on Apple Silicon, the situation is clear. NVFP4 is better quality and faster inference than q4_K_M on any model that provides an NVFP4 tag. The format delivers on the promise of 4-bit quantization that does not feel like 4-bit quantization. For users on Windows or Linux, the GGUF family of quants (q4_K_M, q4_K_S, q3_K_M) remains the only option until the CUDA and Vulkan backends add NVFP4 support.

NVIDIA’s broader ecosystem has already moved. NVFP4 quantized models are available on Hugging Face from Red Hat and NVIDIA. vLLM, TensorRT-LLM, and SGLang all support NVFP4 inference. The gap is in Ollama’s non-MLX backends, not in the format itself.

The next milestone for Ollama NVFP4 support will be a CUDA backend that can run the same format on Blackwell GPUs. When that ships, the platform confusion disappears and NVFP4 becomes the default 4-bit recommendation across all hardware. For now, it is a macOS advantage that delivers genuinely better local model quality than anything else at the same memory footprint.

What does NVFP4 do differently from standard 4-bit?

How does NVFP4 quality compare to q4_K_M?

How much faster is NVFP4 on the MLX engine?

Which models support NVFP4 today?

What are the platform limits and what comes next?

More from stridenalysis.

M4 Pro vs M5 Max for local inference on Apple silicon

Local vs cloud AI: what to run where in 2026

Gemma 4 E4B on edge hardware: small models catch up