We had a choice, and most people make it without noticing it is a choice. You can reach a capable model by calling an API, or you can run one on a machine you already own. We run ours on the machine, and after months of doing it across real client work, we have not wanted to switch back.
The exact models matter less than the habit, but for the record: at the studio we run gemma4:31b for writing and glm-4.7-flash for coding, both on a single Apple M4 Pro with 48 GB of unified memory, served through Ollama. We started on a single generalist model and moved to these two specialists as the local options got better. The brand on the file is not the point of this note. The point is the four reasons the file sits on our own disk instead of behind someone else’s endpoint.
In practice the split is simple. The writing model drafts, edits, and answers questions against our own notes all day. The coding model runs the build loop. Neither has ever been unavailable, rate-limited, or down for maintenance, because there is no service to be down.
The move to two specialists was deliberate. A single generalist model is convenient but average at everything; a writing model tuned for prose and a separate coding model each clear a higher bar on the work we actually do. Swapping one file for two cost nothing and raised the floor on both jobs.
Does running an LLM locally keep your data private?
The first reason is the one that is hard to feel until you have it. When the model runs locally, the work you give it does not travel. The client transcript, the half-finished draft, the file full of things you would never paste into a web form. None of it goes anywhere, because there is nowhere for it to go.
With an API, every prompt is a request to a server you do not control, logged under a policy you did not write. With a local model, the prompt and the answer both live and die on your own machine. That is not a privacy promise, it is just where the computation happens, and it is the difference between a claim you assert and one you can actually prove.
How much does running an LLM locally cost?
A model behind an API charges per call. The bill scales with how much you use it, which means the more useful it becomes, the more it costs to keep using. A local model charges nothing per call. We paid for the hardware once, and now the marginal cost of one more prompt is the electricity to run it.
Stretch that over a year. A per-call tool at a few dollars a day is a four-figure annual line item that grows with use, while the same work on owned hardware is a one-time purchase that gets cheaper per prompt the more you lean on it. That is the arithmetic we walk through in stop paying monthly for AI. Rate limits cut the same way: an API decides how many requests you get and how fast, and hitting the ceiling on a busy afternoon stops your tool until the vendor lets you back in. On the M4 Pro the models answer in seconds for ordinary drafting and coding, which is all routine work needs, and the only limit is the machine, which answers to us.
Does a local LLM work offline?
The quiet reason is the one we notice least until the day it matters. A local model does not need the internet. The Wi-Fi can drop, the vendor can have an outage, the airport can have terrible signal, and the tool keeps answering. It runs in a basement with no signal and on a plane at altitude, because it is not reaching out to anything.
A tool that works only when the network works is a tool you have to trust the network with. We would rather not. This is also part of why we own the stack rather than rent it: a local model cannot be rate-limited, deprecated, or taken offline by anyone but us.
When should you still call a cloud API?
This is not a claim that local always wins. It does not.
The frontier of model quality still lives behind the big APIs, and for the genuinely hard problems a top cloud model is ahead of what we run locally. We keep one cloud subscription for exactly those jobs, the dense reasoning tasks that come up a few times a month rather than a few times an hour. Running your own model also means the upkeep is yours: the updates, the disk space, the occasional afternoon spent fixing something nobody else will fix for you. A managed API hands all of that off, and for some teams that handoff is worth the per-call price.
The upkeep is real but smaller than people fear. In practice it is an occasional model update and keeping enough disk free, not a second job. The trade is a few maintenance afternoons a year against never being metered, throttled, or cut off mid-task, and that trade has paid for itself many times over.
So we are honest about it. We call an API when the work genuinely needs the top of the curve. For everything else, which is most things, the model on our own machine is enough, and the four reasons above are why we leave it there.
The direction of travel helps too. Each new release of the open models narrows the gap with the frontier, so the slice of work that still needs the cloud keeps shrinking, and the case for keeping the everyday work at home keeps getting stronger.
Pick one task you currently send to an API. Run it once on a local model instead. Then decide for yourself whether the endpoint was ever earning its keep, or whether that work was always something you could keep at home.