This is not a survey of what is possible. It is the actual stack we run in the studio every day, on our own machines, with nothing renting our attention by the month. We built it in layers, and the order matters: compute first, then a model, then the tools that sit on top.
Here is each layer, why we chose what we chose, and what we would tell you to skip.
The shape: three layers
Every useful local AI setup is the same three layers.
- Compute. The machine the model runs on.
- The model engine. The thing that downloads, loads, and serves the model.
- The tools. The apps you actually touch: a chat, a coding agent, an editor plugin, a documents reader.
Most people start at layer three, pick a flashy app, and get stuck because the foundation underneath is wrong. We built bottom-up.
Layer 1: Compute
We run Apple Silicon Macs. Not because of brand loyalty, but because unified memory is genuinely good at this. The model and the work share one fast memory pool, with no copying back and forth between separate CPU and GPU memory. For local AI on a single machine, it is hard to beat for the money.
The number that matters more than the chip is RAM. 16GB runs a useful coding model with headroom. 32GB lets you run larger models and keep other work open at the same time. If you are buying for this, spend on memory before anything else.
You do not need a GPU server, a rack, or a cloud account. The point of the stack is that the machine on your desk is enough.
Layer 2: The model engine
The foundation is Ollama. It is the first thing we install on any new machine. One command pulls a model, and it quietly runs a server on http://localhost:11434 that speaks the same API shape as the big cloud providers. That last part is the whole reason it is the foundation: every tool in layer three plugs into that one local endpoint.
For the model itself, our daily driver is Nemotron, which holds up well on agentic coding work. We keep a smaller general model around for quick drafting and a coding-specific model for editor work. Right-sizing to RAM is the rule: a 7B model on 16GB, larger models on 32GB or more.
When a job runs long enough that speed becomes the bottleneck, batch transcription, evaluation runs, anything that grinds, we switch that job to MLX, Apple’s framework, which runs noticeably faster on the same Mac. Ollama when convenience wins, MLX when speed wins. That is the only place we run two engines on purpose.
Layer 3: The tools
This is where the stack becomes daily work.
- Coding agent: OpenCode. A normal desktop app, pointed at our local Ollama model. It reads files, edits them, runs commands, and never sends our code anywhere. It is the layer-three tool we touch most.
- Editor autocomplete. The same Ollama model wired into the editor for completions, so the model we already pulled does double duty.
- Chat. A local chat UI on top of Ollama for everyday questions and drafting.
- Documents. A retrieval tool pointed at the same endpoint, so we can ask questions of our own files without uploading them.
Notice the pattern: every tool reuses the model from layer two. We did not pull a separate model per app. One engine, one endpoint, many tools.
What we rejected, and why
- A cloud subscription per tool. The math stopped working once we counted the tools. Five subscriptions is a real monthly number for work a local stack does for the price of electricity.
- A separate model store per app. Some friendly apps bundle their own runtime and keep their own models. Convenient for a first taste, wasteful once you are running several tools. We centralized on Ollama and pointed everything at it.
- Oversized models on undersized machines. A 70B model on a 16GB Mac downloads and then crawls. We size the model to the RAM, not to the leaderboard.
- Docker for everything. Some agents want a container. For a single-operator studio, the app-on-your-Dock path was simpler and we never missed the sandbox.
What we would tell you to skip
If you are building your own version of this, skip the urge to start at the top. Do not pick the app first. Install Ollama, pull one model, confirm it answers with the Wi-Fi off, and only then add tools. Each tool you add should plug into the engine you already have, not bring its own.
And skip the leaderboard anxiety. The best model in the world is not the question. The question is whether the model on your machine is good enough for the work in front of you, and for most work it is.
What this stack gives us
Nothing leaves the building. There is no monthly bill that grows as we add tools. It works on a plane and during an outage. And when we want to bring someone along, the whole thing is reproducible from open parts, which is the point: you can build the same stack this afternoon.
That is the studio’s local AI stack, on our own machines, layer by layer. Curious about these things. You should be too.
Harness your curiosity.
— Stridenote · № 001