A developer opens a terminal on a 24 GB MacBook Pro, runs ollama run qwen3.6:27b, and starts refactoring a Python module with 300 lines of context. No API key. No cloud bill. No internet connection required. According to aggregated benchmarks from LLMCheck, the best local coding models in 2026 score within 4 percentage points of frontier cloud assistants on real-world software engineering tasks.
The shift happened fast. In early 2025, running a useful coding model on a Mac meant settling for 7B-parameter models that could handle single-function completions but fell apart on multi-file refactors. The Qwen 3.6 release family, starting with the 35B-A3B MoE on April 16 and the 27B dense variant on April 22, changed the math. Both are Apache 2.0 licensed and fit on consumer Mac hardware.
What changed in 2026 for local coding models?
Three developments made local coding on Mac viable this year. First, Mixture-of-Experts architectures hit their stride. Models like Qwen 3.6-35B-A3B and Gemma 4 26B-A4B activate only 3-4 billion parameters per token despite having 26-35 billion total. That keeps inference fast and memory usage low on Apple Silicon’s unified memory architecture.
Second, MLX, Apple’s native machine learning framework, reached production maturity. Running a model through MLX delivers roughly 2x the generation speed and up to 5x faster prompt processing compared to llama.cpp on the same hardware, based on community benchmarks from Will It Run AI. A Qwen 3.6-35B-A3B that generates 30-38 tok/s via Ollama reaches 60-75 tok/s via MLX on an M4 Max.
Third, speculative decoding techniques like Multi-Token Prediction reached Apple Silicon. Gemma 4’s MTP drafters deliver a 1.86x mean speedup on the 26B-A4B variant via the LiteRT-LM runtime. They push real-world coding throughput past 80 tok/s on a 24 GB M4 Pro.
Which local coding model scores highest on SWE-bench Verified?
SWE-bench Verified is the industry standard for measuring real-world coding ability. It tests whether a model can fix actual GitHub issues in real repositories. Here is how the locally runnable field stacks up as of June 2026:
| Model | Architecture | SWE-bench Verified | HumanEval | Min RAM | tok/s (M4 Pro, Q4) |
|---|---|---|---|---|---|
| Qwen 3.6-27B | 27B dense | 77.2% | 91.4% | 24 GB | 25-35 |
| Qwen 3.6-35B-A3B | 35B MoE (3B active) | 73.4% | 92.1% | 24 GB | 40-55 |
| Qwen3-Coder-Next | 80B MoE (3B active) | 70.6% | 94.2% | 64 GB | 18 |
| Qwen 3.5-27B | 27B dense | 72.4% | 84.8% | 24 GB | 30-40 |
| DeepSeek-Coder V2 | dense | 62.1% | 90.8% | 32 GB | 30 |
| Gemma 4 26B-A4B | 26B MoE (4B active) | 52.1% | 72.0% | 24 GB | 55-70 |
| Qwen 3.5-9B | 9B dense | 47.2% | 84.1% | 16 GB | 100 |
| Phi-4 Mini | 5B dense | 38.9% | 79.6% | 8 GB | 135 |
The standout number is Qwen 3.6-27B at 77.2% SWE-bench Verified, within striking distance of Claude Opus 4.6 at 80.9% and ahead of the 397B-parameter Qwen 3.5 MoE at 75.0%, according to Alibaba’s benchmark data. A 27B model that fits in 16.8 GB at Q4 quantization is trading blows with models 10x its size.
Qwen 3.6-35B-A3B is the better choice for interactive coding because of its MoE speed advantage. Its 73.4% SWE-bench score trails the 27B dense by 4 points, but it generates output roughly 60% faster on the same hardware.
How much RAM does each model need on Apple Silicon?
Unified memory is the binding constraint on Mac. Here is what each tier of hardware can run:
8 GB Macs (MacBook Air M1/M2/M3): Phi-4 Mini at Q4 is the only practical coding model. It scores 38.9% on SWE-bench and runs at 120-140 tok/s. Good for autocomplete and simple function generation. Not suitable for debugging or refactoring.
16 GB Macs (MacBook Pro base, Mac Mini): Qwen 3.5-9B at Q4 is the best option at 47.2% SWE-bench and roughly 100 tok/s. For lighter work, Gemma 4 E4B at 80+ tok/s handles quick Q&A and single-file tasks. The 9B can handle most single-file edits and code explanations competently.
24 GB Macs (M4 Pro, M3 Pro): This is the sweet spot in 2026. Qwen 3.6-27B at Q4 lands at roughly 16.8 GB of memory, leaving room for KV cache and system overhead. It scores 77.2% on SWE-bench. The 35B-A3B MoE fits at Q4 with similar headroom and runs faster. Edward Chalupa’s benchmarks on an M3 Ultra found that the 27B dense model at Q4 produces “near-frontier” coding output at interactive speeds.
32-48 GB Macs (M4 Max, M3 Max): Either Qwen variant runs with generous context budgets. The 35B-A3B can stretch to 64K-128K context for repo-level analysis. Gemma 4 26B-A4B becomes a viable secondary model for agent tasks and function calling.
64 GB+ Macs (M4 Max 64 GB, M3 Ultra): Qwen3-Coder-Next becomes practical at approximately 46 GB Q4. This model is purpose-built for agentic coding workflows like multi-file refactors, autonomous debugging, and tool use. The 73.4% SWE-bench performer Qwen 3.6-35B-A3B can run at Q6 or Q8 quantization for near-lossless quality.
Which inference backend delivers the fastest tokens on Mac?
The inference runtime matters as much as the model. On Apple Silicon, three options dominate.
MLX delivers the highest throughput. Apple’s framework is optimized for the Metal GPU and unified memory architecture. On a 64 GB M4 Max, Qwen 3.6-35B-A3B at MLX 4-bit generates 55-70 tok/s versus 30-38 tok/s via Ollama. The gap on prompt processing is larger: MLX finishes prompt evaluation 3-5x faster.
Ollama with the MLX preview backend bridges usability and speed. Ollama handles model management, chat templates, and integration with tools like Continue.dev and Cursor. The MLX preview in Ollama 0.19+ brings most of the speed advantage without the command-line complexity.
llama.cpp remains the compatibility king. It supports GGUF quantizations for every model, works on Linux and Windows alongside macOS, and has the most mature KV cache management. Its performance is roughly 50-70% of MLX on generation speed, but its prefill and multi-turn handling is more consistent across long sessions.
The practical recommendation: start with Ollama for easy setup. If you are building a coding agent that makes dozens of calls per session, switch to MLX for the throughput gain.
How does Gemma 4 compare to Qwen for real coding tasks?
Gemma 4 26B-A4B and Qwen 3.6-35B-A3B are the two most compared models on Mac in 2026. They are not direct substitutes.
Qwen dominates on coding benchmarks by a wide margin. LLMCheck’s head-to-head comparison found a 21 percentage point gap on SWE-bench Verified (73.4% versus 52.1%) and a 14-point gap on HumanEval. For code generation, debugging, and refactoring, Qwen is the clear winner.
Gemma 4 wins on everything around coding. It scores higher on general chat (Arena #3 versus unranked), delivers more reliable function calling (~95% versus ~85% accuracy), handles vision and audio natively, and uses roughly 2 GB less memory at Q4 quantization. For agent workflows that need tool use or multimodal input alongside code, Gemma 4 is the better daily driver.
The ideal power-user setup is both models. Qwen 3.6-35B-A3B or 27B for actual coding work. Gemma 4 26B-A4B for agent orchestration, function calling, and general assistance. Both swap in seconds through Ollama or LM Studio.
What should you run on your Mac right now?
For a 16 GB MacBook Air or Pro, pull Qwen 3.5-9B as your primary coding model and Phi-4 Mini as a fast fallback for simple completions. Neither will handle complex refactors but both cover day-to-day coding comfortably.
For a 24 GB MacBook Pro, Qwen 3.6-27B is the single best model to run. It fits at Q4 with room for 32K+ context, scores 77.2% on SWE-bench, and generates output at 25-35 tok/s on an M4 Pro. If you want higher throughput for interactive work, the 35B-A3B MoE trades 4 percentage points of SWE-bench for roughly 2x the generation speed.
For a 64 GB+ Mac Studio, Qwen3-Coder-Next at Q4 is the most capable local coding model available. It is specialized for agentic coding workflows including autonomous debugging, multi-file refactoring, and tool orchestration. Pair it with Qwen 3.6-35B-A3B at Q6 or Q8 for general reasoning tasks.
The models that make Apple Silicon competitive for local coding all arrived in a six-week window between February and April 2026. The Qwen 3.6-27B dense variant at 77.2% SWE-bench is the clearest signal yet that local models are no longer a compromise. They are a viable alternative to cloud APIs for daily coding work, with the added advantages of zero latency, no data leaving your machine, and no per-token cost. The six-month sprint between Qwen 3.5 in February and Qwen 3.6 in April reset expectations for what a locally runnable model can do on a laptop. The next six months will likely narrow the remaining gap to cloud APIs even further.