How to run Gemma 4 12B locally on a Mac with Ollama

How to run Gemma 4 12B locally on a Mac with Ollama

The June 3 announcement landed in the middle of a workday, and by the afternoon developers were already pulling the model in large numbers. Google DeepMind released Gemma 4 12B as an open multimodal model that runs on consumer laptops with 16 GB of memory, and the community response was immediate. It was the first mid-sized Gemma to drop separate vision and audio encoders entirely, processing images and audio directly through the language backbone instead of routing them through dedicated encoder towers. That architectural choice made it fast enough to run on hardware people already own.

This guide covers everything needed to get Gemma 4 12B running locally on a Mac with Ollama: the hardware requirements, the exact commands, multimodal setup, and the configuration changes that make the difference between a slow demo and a usable daily driver.

What makes Gemma 4 12B different from other models?

Traditional multimodal models bolt on a separate vision encoder (roughly 550 million parameters for a mid-sized model) and an audio encoder, then pass those representations to the language model. Gemma 4 12B replaces that approach with a lightweight 35-million-parameter vision embedder that projects image patches straight into the shared decoder using a single matrix multiplication, as Google DeepMind product managers Olivier Lacombe and Gus Martins explained in the release post. Audio follows the same path: the raw 16 kHz signal is cut into 40-millisecond frames and linearly projected into the same dimensional space as text tokens, with no separate encoder at all.

The result is a model that processes text, images, audio, and video through a single decoder-only transformer while using less than half the memory of the 26B MoE variant. On the MMMU Pro benchmark, Gemma 4 12B scores 76.9 percent, nearly matching the 26B’s 73.8 percent, according to Google’s published benchmarks on the Ollama library page. The model also supports a 256,000-token context window and understands 140 languages.

The Apache 2.0 license is the other piece that matters. Commercial use, modification, and redistribution carry no MAU caps or acceptable-use restrictions, which makes Gemma 4 12B a practical choice for internal tools, product integrations, and any deployment where licensing terms shape the decision.

What do you need to run it on a Mac?

Apple Silicon Macs have an advantage for local AI workloads that most Windows and Linux laptops cannot match. The unified memory architecture lets the GPU and CPU share the same pool of RAM, so a Mac with 16 GB of memory has the full 16 GB available to load the model. On a discrete GPU laptop, the model is limited by the GPU’s dedicated VRAM, which is often lower than the total system memory.

The default Ollama quantization for Gemma 4 12B is Q4_K_M, which brings the download to about 7.6 GB on disk and fits comfortably within 16 GB of unified memory with room for a working context. In benchmarks collected across the Apple Silicon lineup, a MacBook Pro with an M4 chip and 16 GB of unified memory delivers roughly 25 tokens per second at Q4_K_M. An M4 Pro with 24 GB pushes that to about 35 tokens per second, and an M4 Max with 48 GB can reach approximately 40 tokens per second at higher quantization levels, as documented by community benchmarks.

Hardware requirements by quantization, based on Ollama model manifests and Google’s published model card:

Quantization Download size Memory needed Best Mac config
Q4_K_M (default) 7.6 GB 16 GB M4 / M4 Pro 16-24 GB
Q5_K_M ~10 GB 20-24 GB M4 Pro / M4 Max 24 GB+
Q8_0 ~13.5 GB 24-32 GB M4 Max 32 GB+
BF16 (full) 26.7 GB 48-64 GB M4 Max 48 GB / M4 Ultra

Models below 16 GB of unified memory can still run the E2B or E4B variants, but the 12B model needs that floor for practical use. The KV cache scales with context length, so users targeting the full 256K window should plan for at least 24 GB of unified memory.

How do you install Ollama and pull Gemma 4 12B?

Ollama is the fastest path from a clean machine to a running model. The runtime handles GPU detection, quantization, and the chat template automatically. Installation on macOS is a single terminal command.

Open a terminal and run:

curl -fsSL https://ollama.com/install.sh | sh

The installer places Ollama in /usr/local/bin and sets up a launch agent so the service starts automatically in the background. After installation, verify the version:

ollama --version

Versions 0.23 or newer are recommended for proper Gemma 4 support, including the thinking mode and tool calling features, per the Ollama GitHub release notes.

Pull the 12B model:

ollama pull gemma4:12b

The download is about 7.6 GB and takes a few minutes on a typical broadband connection. Ollama streams the model in chunks and shows progress. Once complete, start an interactive session:

ollama run gemma4:12b

The first load processes and caches the model, which in our testing on a 16 GB M4 MacBook Pro took roughly 45 seconds. Subsequent loads are faster because the OS keeps the model in its unified memory cache.

For a single prompt without entering interactive mode:

ollama run gemma4:12b "Write a Python function that merges two sorted lists"

The model responds in the terminal and exits. This is useful for scripting and automation.

To see all available Gemma 4 12B variants:

ollama pull gemma4:12b-q4_0
ollama pull gemma4:12b-q5_k_m
ollama pull gemma4:12b-q8_0

Each quantization tag trades memory for output quality. The default Q4_K_M is the recommended starting point for 16 GB Macs. Q5_K_M improves fidelity slightly and works on machines with 24 GB or more.

How do you run Gemma 4 12B with images and audio?

Gemma 4 12B processes images natively without any additional setup. The Ollama CLI accepts image paths directly:

ollama run gemma4:12b "What does this diagram show?" --image ./architecture.png

Multiple images work in the same prompt:

ollama run gemma4:12b "Compare these two UI mockups" --image ./v1.png --image ./v2.png

The model supports variable image resolution through a configurable visual token budget. Lower budgets (70 or 140 tokens) are faster and work for classification or captioning. Higher budgets (560 or 1120 tokens) preserve fine detail for OCR, document parsing, and reading small text. From the Python client:

import ollama

response = ollama.chat(
    model='gemma4:12b',
    messages=[{
        'role': 'user',
        'content': 'Extract the text from this invoice',
        'images': ['invoice.png']
    }],
    options={'image_token_budget': 560}
)
print(response['message']['content'])

Audio support follows the same pattern. The model accepts raw audio files (16 kHz WAV or MP3) and handles transcription, summarization, and question answering in a single pass. This is where the encoder-free architecture pulls ahead of older multimodal models. A task like “transcribe this meeting recording and list the action items” runs as one prompt instead of two separate pipelines.

The Python Ollama library also exposes an OpenAI-compatible API, which means any application that supports an OpenAI endpoint can point at a local Gemma 4 12B instance instead. The server runs by default when Ollama is active on http://localhost:11434. The endpoint structure mirrors OpenAI’s chat completions format, making it a drop-in replacement for development and prototyping.

How do you optimize context, thinking mode and performance?

Ollama defaults to a 4,096-token context for Gemma 4 12B, per the tool’s default configuration for new models, but the model supports up to 256,000 tokens. The default wastes the model’s most distinctive capability. Override it for any task that involves long documents, code repositories, or conversation history.

Set the context length per session:

ollama run gemma4:12b --num-ctx 32768

For the Python API:

response = ollama.chat(
    model='gemma4:12b',
    messages=[{'role': 'user', 'content': 'Summarize this 50-page document'}],
    options={'num_ctx': 65536}
)

The recommended context lengths for different workloads are: 4,096 for general chat and quick Q&A, 8,192 to 16,384 for code analysis and document review, 32,768 to 65,536 for book-length analysis, and 128,000 to 256,000 only when processing very long documents. Each step up in context uses more KV cache memory and slightly slows generation. On a 16 GB Mac, staying under 32,768 tokens keeps performance stable.

Gemma 4 12B includes a configurable thinking mode that outputs its internal reasoning before the final answer. Enable it by including the <|think|> token at the start of the system prompt. Without it, the model still generates the thought tags but leaves them empty, which wastes tokens. In Ollama, set up a Modelfile to bake this in:

FROM gemma4:12b
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
SYSTEM "<|think|> You are a helpful assistant."

Save this as Modelfile and run:

ollama create my-gemma4-12b -f Modelfile
ollama run my-gemma4-12b

The sampling defaults from Google’s model card are temperature 1.0, top_p 0.95, and top_k 64. These work well as a starting point. Lowering temperature to 0.7 produces more deterministic outputs for code and analysis tasks.

For users running the model through multiple rounds of conversation, the thinking content from previous turns should be stripped before sending the history back. Only the final responses should feed into the next turn, not the internal reasoning blocks.

The MTP (Multi-Token Prediction) drafter variant is also available in Ollama for users who need lower latency. It uses a small speculative decoding model to predict multiple tokens at once, which the main model then verifies. In testing on comparable hardware, the throughput improvement reached roughly 2x to 3x, as documented in community benchmarks, though the drafters add about 2 GB to the memory footprint.

Local AI models at this capability level did not exist on consumer hardware a year ago. Gemma 4 12B changes the calculation for anyone who needs multimodal reasoning without a per-token bill or a cloud dependency. The model fits a 16 GB Mac, processes images and audio natively, and delivers performance that approaches models twice its size. The tools to run it locally are mature, the commands are straightforward, and the license allows commercial use without restrictions. That combination is rare, and it is likely to define the shape of local AI through the rest of 2026.

Share this
X Facebook LinkedIn Email