How we build our local AI stack

This is not a survey of what is possible. It is the actual stack we run every day, with nothing renting our attention by the month. We built it in layers, and the order matters: compute first, then a model, then the tools that sit on top.

Here is each layer, why we chose what we chose, and what we would tell you to skip.

What are the three layers of a local AI stack?

Every useful local AI setup is the same three layers.

Compute. The machine the model runs on.
The model engine. The thing that downloads, loads, and serves the model.
The tools. The apps you actually touch: a chat, a coding agent, an editor plugin, a documents reader.

Most people start at layer three, pick a flashy app, and get stuck because the foundation underneath is wrong. We built bottom-up.

How did we choose each layer?

Layer 1: compute

We run Apple Silicon Macs. Not because of brand loyalty, but because unified memory is genuinely good at this. The model and the work share one fast memory pool, with no copying back and forth between separate CPU and GPU memory. For local AI on a single machine, it is hard to beat for the money.

The number that matters more than the chip is RAM. 16GB runs a useful coding model with headroom. 32GB lets you run larger models and keep other work open at the same time. If you are buying for this, spend on memory before anything else.

You do not need a GPU server, a rack, or a cloud account. The point of the stack is that the machine on your desk is enough.

Layer 2: the model engine

The foundation is Ollama. It is the first thing to install on a new setup. One command pulls a model, and it quietly runs a server on http://localhost:11434 that speaks the same API shape as the big cloud providers. That last part is the whole reason it is the foundation: every tool in layer three plugs into that one local endpoint.

For the models, our coding model is glm-4.7-flash, which holds up well on agentic coding work, and our writing model is gemma4:31b. We keep a smaller general model around for quick drafting too. Right-sizing to RAM is the rule: a 7B model on 16GB, larger models on 32GB or more.

When a job runs long enough that speed becomes the bottleneck, batch transcription, evaluation runs, anything that grinds, we switch that job to MLX, Apple’s framework, which runs noticeably faster on the same Mac. Ollama when convenience wins, MLX when speed wins. That is the only place we run two engines on purpose.

Layer 3: the tools

This is where the stack becomes daily work.

Coding agent: OpenCode. A normal desktop app, pointed at our local Ollama model. It reads files, edits them, runs commands, and never sends our code anywhere. It is the layer-three tool we touch most.
Editor autocomplete. The same Ollama model wired into the editor for completions, so the model we already pulled does double duty.
Chat. A local chat UI on top of Ollama for everyday questions and drafting.
Documents. A retrieval tool pointed at the same endpoint, so we can ask questions of our own files without uploading them.

Notice the pattern: every tool reuses the model from layer two. We did not pull a separate model per app. One engine, one endpoint, many tools.

In practice the layers disappear into the work. We open the coding agent and it just answers, because the engine is already running and the model is already pulled. We ask the chat a question and it lands in seconds. None of it announces that it is local; it simply never asks us to sign in, never meters us, and never reaches for the network. That invisibility is the sign the stack is built right.

What did we reject, and why?

A cloud subscription per tool. The math stopped working once we counted the tools. Five subscriptions is a real monthly number for work a local stack does for the price of electricity.
A separate model store per app. Some friendly apps bundle their own runtime and keep their own models. Convenient for a first taste, wasteful once you are running several tools. We centralized on Ollama and pointed everything at it.
Oversized models on undersized machines. A 70B model on a 16GB Mac downloads and then crawls. We size the model to the RAM, not to the leaderboard.
Docker for everything. Some agents want a container. For a single-operator studio, the app-on-your-Dock path was simpler and we never missed the sandbox.

What should you skip when building a local AI stack?

If you are building your own version of this, skip the urge to start at the top. Do not pick the app first. Install Ollama, pull one model, confirm it answers with the Wi-Fi off, and only then add tools. Each tool you add should plug into the engine you already have, not bring its own.

And skip the leaderboard anxiety. The best model in the world is not the question. The question is whether the model on your machine is good enough for the work in front of you, and for most work it is.

What this stack gives us

Nothing leaves the building. There is no monthly bill that grows as we add tools. It works on a plane and during an outage. And when we want to bring someone along, the whole thing is reproducible from open parts, which is the point: you can build the same stack this afternoon.

The stack also earns its keep by being boring. It does not change under us. The model we pulled keeps working, the engine keeps serving, and the only maintenance is an occasional update we choose to run. A rented stack of five subscriptions changes on five different vendors’ schedules; ours changes only when we decide it should.

That is the studio’s local AI stack, layer by layer.

When one machine is not enough, the same layers spread across several. We looked at how exo clusters Macs to run a 671B model locally.

What are the three layers of a local AI stack?

How did we choose each layer?

Layer 1: compute

Layer 2: the model engine

Layer 3: the tools

What did we reject, and why?

What should you skip when building a local AI stack?

What this stack gives us

More from studio.

Sovereign AI Stack: Ownership vs Cognitive Rental

How we built our local AI stack at Stridenote

LTX-Video on ComfyUI: Local AI Video on Apple Silicon