Playbooks Process Jun 07, 2026

Build Your Own Local AI Stack (Compute, Model, Tools)

A local AI setup is three choices stacked in order: compute, a model engine, then the tools you actually touch. Make them bottom-up and each one plugs into the last. Here is how to decide, step by step.

Build Your Own Local AI Stack (Compute, Model, Tools)

A local AI stack sounds like a project. It is really just three decisions, made in the right order: what machine the model runs on, what serves the model, and what apps you point at it. Get the order right and each decision narrows the next one. Get it wrong, start with a flashy app, and you spend a weekend fighting a foundation that was never going to hold.

This is a build guide for your stack, not ours. At each layer you will make a choice that fits your hardware and your work. By the end you will have a private chat, a coding agent, and a documents tool, all answered by one model on your own disk, with nothing renting your attention by the month.

Before you start

You need one computer with at least 16GB of RAM. 8GB will run a small model, but the work wants headroom, so 16GB is the honest floor and 32GB is comfortable. Apple Silicon (M1 through M4) is faster than Intel for this because the chip and the work share one fast memory pool, though everything here runs on Windows and Linux too.

You do not need a GPU server, a cloud account, or a rack in a closet. The whole premise is that the machine on your desk is enough. Set aside an hour and a few gigabytes of disk for model downloads.

Step 1: Choose your compute

This is the layer you mostly already own, so the decision is small: confirm the machine can carry the work, then size everything else to it.

  • RAM is the number that matters. It sets the size of model you can run, which in turn sets how good the answers are. Spend on memory before chips or storage if you are buying.
  • 16GB comfortably runs a 7B-class model with room to keep other apps open.
  • 32GB or more opens the door to larger models and running a couple of tools at once.
  • Apple Silicon is the easy recommendation for a single machine because of that shared memory pool, but a recent Windows or Linux box with a supported GPU works fine.

You are not buying anything new for this guide. You are deciding what the machine can hold so the next two steps stay realistic.

Step 2: Install the model engine

The engine downloads, loads, and serves the model. It is the foundation because every tool in step four plugs into the one local address it exposes. Choose one engine and point everything at it.

The default pick is Ollama. It installs as a single binary, runs quietly as a background service, and serves a model on http://localhost:11434 that speaks the same API shape as the big cloud providers. That compatibility is the whole point: any tool that “speaks ChatGPT” can talk to it.

Install it:

# macOS
brew install ollama

# Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows
winget install Ollama.Ollama

Or download the installer from https://ollama.com/download for any platform. After install it runs in the background. There is nothing to keep open.

If you are on Apple Silicon and a job ever gets slow enough that speed becomes the bottleneck, you can add MLX, Apple’s framework, for that one job and keep Ollama for everything else. That is an optimization, not a starting choice. Begin with one engine.

Step 3: Pull one model, sized to your RAM

Resist the urge to pull a different model for every app. One engine, one model to start, many tools on top.

# a capable general model, comfortable on 16GB
ollama pull llama3.2

# a coding-specific model if you plan to do code work
ollama pull qwen2.5-coder:7b

Size the model to the machine, not to the leaderboard. A rough guide: a 7B model is fine on 16GB, larger models want 32GB or more. A 70B model on a 16GB machine will download and then crawl. Browse the full list of tags at https://ollama.com/library and pick what fits.

Confirm it landed:

ollama list

You should see your model. That is the foundation done.

Step 4: Add the tools, one at a time

Now the stack becomes daily work. Each tool below points at the same Ollama endpoint from step two. Add the one you need first, prove it, then add the next.

  • A chat UI. For everyday questions and drafting, install Open WebUI, a ChatGPT-style interface that runs locally and auto-detects your Ollama models. The quick path is Docker: docker run -d -p 3000:8080 -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:main, then open http://localhost:3000 and create the first account. If Ollama runs on the host while Open WebUI runs in Docker, set the connection URL to http://host.docker.internal:11434 in its settings.
  • A coding agent. For agentic code work in a normal window, install opencode, a desktop app from https://opencode.ai/download. In its provider settings choose Ollama and confirm http://localhost:11434, then pick the coding model you pulled. No terminal required once it is set up.
  • A documents tool. To ask questions of your own files without uploading them, point a retrieval tool at the same endpoint. Open WebUI’s built-in file upload covers “chat with this PDF” already; reach for a dedicated tool only when the document work gets heavier.

Notice the pattern. Every tool reuses the model from step three. You never pulled a second model per app, and nothing here phoned home.

Prove it works

The real test is offline. Turn your Wi-Fi off, then run each layer in turn:

  1. In a terminal: ollama run llama3.2 and ask it a question. It answers with no connection.
  2. In Open WebUI: send a prompt and watch it stream. Switch models in the dropdown if you pulled more than one.
  3. In opencode: type “Open this folder and tell me what is in it,” and approve the file-read step. Watch the network indicator stay at zero.

If all three answer with the Wi-Fi off, the stack is real. The work is happening on your machine, against a model on your disk.

Trade-offs and gotchas

  • The model is the ceiling. A weak local model is a weak everything. The best cloud models are still ahead on the hardest problems. For most day-to-day work a good local model is enough, and the privacy and the absent bill change the math. You can always switch one tool to a cloud provider for one hard job and switch back.
  • Memory is sticky. Once a model loads it stays in RAM until it is unloaded or bumped. Run ollama ps to see what is loaded, and pick a smaller model if the machine feels tight.
  • Models do not appear in an app. Almost always the engine is not running, or the connection URL is wrong. Confirm with ollama list, then check the tool points at http://localhost:11434 (or http://host.docker.internal:11434 from inside Docker).
  • Disk fills up. On macOS, models live in ~/.ollama/models/ and can grow to tens of GB. Run ollama rm on models you no longer use.
  • Do not start at the top. The most common mistake is picking the app first. Build bottom-up so every tool plugs into an engine that already works.

Where to go next

Once one tool runs, adding the next is fast because the foundation is shared. Wire the same coding model into your editor for autocomplete. Point a retrieval tool at your notes. Keep a smaller model around for quick drafting and a coding model for code work. Each addition is a step-four move on the same step-two engine, never a new foundation.

You now have a private chat, a coding agent, and a documents tool, all answered by one model on your own machine, reproducible from open parts this afternoon. Curious about these things. You should be too.

Harness your curiosity.

— Stridenote · № 006