A developer on a flight to Tokyo opened a terminal, typed a feature request in plain English, and watched an AI agent write the code. There was no internet connection. The agent was OpenCode, the model was running on a laptop through Ollama, and the entire interaction happened without a single byte reaching the cloud. In Q1 2026, Ollama hit 52 million monthly downloads, a 520x increase from its 100,000 monthly downloads in Q1 2023, according to Vucense. The demand for local AI coding tools has moved from niche to mainstream.
What is OpenCode and how does it work?
OpenCode is an open-source AI coding agent that runs in the terminal. It is written in Go and provides a text-based interface for interacting with codebases through natural language. Unlike autocomplete tools, OpenCode functions as an agent: it reads files, runs shell commands, searches code, and applies edits autonomously.
As of June 2026, OpenCode has 172,198 GitHub stars under the MIT license, according to Morph LLM. It supports over 75 AI providers through a unified configuration system. That makes it provider-agnostic by design. The same configuration that points at Anthropic or OpenAI in the cloud can point at a local model running on your machine.
OpenCode installs on macOS, Linux, and Windows. The recommended method on macOS is Homebrew with brew install opencode, or the universal install script at curl -fsSL https://opencode.ai/install | bash. The binary is about 157 MB with its dependencies.
Why run a coding agent offline?
The obvious reason is privacy. Every line of code you send to a cloud provider leaves your machine. For proprietary software, regulated industries, or air-gapped environments, that is a non-starter. A local model means your code never reaches a third-party server.
The second reason is cost. Cloud AI providers charge per token. A heavy coding session with frequent file reads and large context windows can cost several dollars per hour. Local models run on hardware you already own. There is no per-token charge, no rate limit, and no monthly subscription beyond the electricity bill.
The third reason is latency. A local model responds as fast as your GPU can generate tokens. There is no network round-trip, no queue, and no API throttling. On Apple Silicon with a model that fits in memory, responses start in milliseconds rather than the seconds-long wait for a cloud endpoint under load.
These advantages are not theoretical. They are the reason Ollama went from 100,000 monthly downloads in Q1 2023 to 52 million in Q1 2026, a 520x increase reported by Vucense. The local AI ecosystem has crossed into mainstream developer adoption.
What do you need before you start?
A local coding agent requires four components: a model runtime, a model file, the agent itself, and compatible hardware.
The runtime is the software that loads the model into memory and exposes it over an API. Ollama is the most popular choice. It runs as a background service on macOS, Linux, and Windows, and provides an OpenAI-compatible endpoint at http://localhost:11434/v1. Install it with brew install ollama on macOS or the installer from ollama.com.
The model is the actual neural network that processes your requests. Open-weight coding models come in a wide range of sizes. The right size depends on your hardware budget and your tolerance for latency.
The agent is OpenCode itself. Once installed and configured, it routes all prompts through the local endpoint rather than a cloud API. No code leaves the machine.
For hardware, the minimum viable setup is a machine with at least 16 GB of unified memory or VRAM. Apple Silicon Macs with 32 GB or more are the sweet spot. The Gemma 4 31B model, for example, requires about 19 GB of RAM in its Q4 quantized form and runs comfortably on a 48 GB M4 Pro, as documented in the agileguy.ca setup guide. NVIDIA GPUs with 12 GB or more VRAM also work, and the selection of CUDA-optimized models is larger.
How do you connect OpenCode to a local model?
The setup process takes about 15 minutes from a clean machine.
Step one: install Ollama and start the service.
brew install ollama
brew services start ollama
Step two: pull a coding model. The most reliable option as of mid-2026 is Qwen 2.5 Coder 14B. It is dense, not a mixture of experts, which means it responds faster and avoids the latency spikes that MoE models introduce with their gating layers.
ollama pull qwen2.5-coder:14b
Step three: increase the context length. Ollama defaults to 4,096 tokens of context, which is far too small for OpenCode. The system prompt and tool definitions alone consume most of that window. There is almost no room left for actual code. Create a Modelfile to override the parameter:
FROM qwen2.5-coder:14b
PARAMETER num_ctx 32768
PARAMETER temperature 0.2
Then build and register the custom model:
ollama create qwen2.5-coder:14b-opencode -f Modelfile
Step four: install OpenCode.
brew install opencode
Step five: configure OpenCode to use the local Ollama endpoint. Edit ~/.config/opencode/opencode.json to add the Ollama provider:
{
"model": "ollama/qwen2.5-coder:14b-opencode",
"provider": {
"ollama": {
"npm": "@ai-sdk/openai-compatible",
"name": "Ollama (local)",
"options": {
"baseURL": "http://localhost:11434/v1"
},
"models": {
"qwen2.5-coder:14b-opencode": {
"name": "Qwen 2.5 Coder (local)"
}
}
}
}
}
Restart OpenCode, run the /model command, and select the local model. Every prompt from that point runs entirely on your machine.
One common mistake: setting a different model for the small_model field. OpenCode uses a secondary model for compaction and title generation. If that model loads separately, Ollama keeps both in GPU memory. That causes contention and can trigger sporadic 500 errors from Ollama’s server. The fix is to set small_model to the same value as model, as noted in the agileguy.ca guide.
Another mistake: leaving CLAUDE.md files in your project directory. If you also use Claude Code, its instructions file tricks local models into calling skills and using response formats they do not support. Set the environment variable OPENCODE_DISABLE_CLAUDE_CODE=1 to prevent that.
Which models work best for local coding?
Not every model is good at tool calling. A coding agent needs to understand function signatures, produce valid syntax, and follow structured output formats. Many open-weight models fail at one or more of these.
The Qwen 2.5 Coder family is the most reliable local option as of June 2026. The 14B variant offers the best balance of speed and accuracy on consumer hardware. The 32B variant scores 83.2% on MMLU, according to Vucense, but requires 24 GB or more of memory.
DeepSeek Coder V3 is a strong alternative at 236B parameters using a mixture-of-experts architecture. It scores competitively on coding benchmarks but requires at least 48 GB of memory. That puts it out of reach for most laptops.
Llama 4 Scout is another MoE model at 109B parameters. It scores about 65% on SWE-bench and needs 32 GB of memory. It is usable for code generation but feels slower than the dense Qwen variants in interactive sessions.
Gemma 4 31B runs at about 19 GB in its Q4 quantized form and scores competitively on coding benchmarks. It is available through Ollama and benefits from Apple Silicon’s unified memory architecture. The trade-off is resource intensity: at 31B parameters, it leaves roughly 30 GB of free memory on a 48 GB machine, which is enough for development but leaves less headroom for running large IDEs or containers alongside.
The most important distinction is between dense models and MoE models. Dense models like Qwen 2.5 Coder use all their parameters for every token. MoE models like DeepSeek Coder and Llama 4 activate only a subset of parameters per token, which saves memory but introduces latency from the routing mechanism. For interactive coding, dense models feel faster even when their total parameter count matches an MoE equivalent.
What trade-offs should you know about?
Local coding agents are not a drop-in replacement for cloud models in every scenario. The trade-offs are real.
Speed is the main limitation. A cloud model like GPT-5 or Claude Opus generates several hundred tokens per second through optimized inference data centers. A local model on consumer hardware generates 10 to 40 tokens per second, depending on size and quantization. For small edits, the difference is barely noticeable. For large refactors that generate hundreds of lines, the wait becomes significant.
Model quality is another gap. The best open-weight models are roughly six to twelve months behind the frontier. The Qwen 2.5 Coder family is excellent for boilerplate, test generation, and straightforward refactors. For novel architecture decisions or deep debugging, a cloud model still wins.
Memory is a hard constraint. Running a 31B model leaves about 30 GB of free memory on a 48 GB machine. Running multiple models simultaneously or keeping a large IDE open alongside the model can push the system into swap, which kills performance to single-digit tokens per second.
Then there is the question of which tasks to delegate. Local models handle code generation, test writing, file refactoring, and documentation well. They struggle with long-context reasoning, multi-file architectural changes, and nuanced debugging. Knowing which task to assign to the local agent versus doing it yourself is a skill that develops with practice.
The air-gapped use case is where local agents justify themselves regardless of these trade-offs. A developer who works on classified systems, proprietary trading algorithms, or unreleased products does not have the option to send code to a cloud API. For that developer, a local model running at 15 tokens per second is not a compromise. It is the only option.
The number of developers in that position is growing. With Ollama at 52 million monthly downloads and open-weight models improving every quarter, the local coding agent is becoming a standard tool in the engineering workflow. Setting one up is the starting point. Picking which model to load is the next decision.
Disclosure: StrideNote Studio is a research and media production studio that evaluates local AI tools as part of its workflow.