Notes Opinion Jun 07, 2026

Why We Run Nemotron Locally Instead of Calling an API

We picked a model and ran it on our own machine instead of behind an API. The reasons are not exotic: nothing leaves the building, the cost is fixed, and no rate limit decides what we can do today.

Written byH Hillary

Read time4 min

UpdatedJun 12, 2026

Filed underNotes · Opinion

Why We Run Nemotron Locally Instead of Calling an API

We had a choice, and most people make it without noticing it is a choice. You can reach a capable model by calling an API, or you can run one on a machine you already own. We run ours on the machine.

The model we settled on is Nemotron. [CONFIRM: which Nemotron size and quantization we run, and on what hardware]. The brand of the model is not the point of this note. The point is the four reasons we put it on our own disk instead of behind someone else’s endpoint.

Nothing leaves the building

The first reason is the one that is hard to feel until you have it. When the model runs locally, the work you give it does not travel. The client transcript, the half-finished draft, the file full of things you would never paste into a web form. None of it goes anywhere, because there is nowhere for it to go.

With an API, every prompt is a request to a server you do not control, logged under a policy you did not write. With a local model, the prompt and the answer both live and die on your own machine. That is not a privacy promise. It is just where the computation happens.

The cost is fixed, and the limits are ours

A model behind an API charges per call. The bill scales with how much you use it, which means the more useful it becomes, the more it costs to keep using. A local model charges nothing per call. We paid for the hardware once, and now the marginal cost of one more prompt is the electricity to run it.

Rate limits work the same way. An API decides how many requests you get and how fast. Hit the ceiling on a busy afternoon and your tool simply stops until the vendor lets you back in. [CONFIRM: throughput we actually get from Nemotron on our hardware for routine work]. The local version has no ceiling we did not set ourselves. The only limit is the machine, and the machine answers to us.

It keeps working when the connection does not

The quiet reason is the one we notice least until the day it matters. A local model does not need the internet. The Wi-Fi can drop, the vendor can have an outage, the airport can have terrible signal, and the tool keeps answering. It runs in a basement with no signal and on a plane at altitude, because it is not reaching out to anything.

A tool that works only when the network works is a tool you have to trust the network with. We would rather not.

The honest counterpoint

This is not a claim that local always wins. It does not.

The frontier of model quality still lives behind the big APIs, and for the genuinely hard problems, a top cloud model is ahead of what we run locally. [CONFIRM: the kinds of tasks where we still reach for a cloud model over Nemotron]. Running your own model also means the upkeep is yours: the updates, the disk space, the occasional afternoon spent fixing something nobody else will fix for you. A managed API hands all of that off, and for some teams that handoff is worth the per-call price.

So we are honest about it. We call an API when the work genuinely needs the top of the curve. For everything else, which is most things, the model on our own machine is enough, and the four reasons above are why we leave it there.

Pick one task you currently send to an API. Run it once on a local model instead. Then decide. Curious about these things. You should be too.

Harness your curiosity.

— Stridenote · № 006

Nothing leaves the building

The cost is fixed, and the limits are ours

It keeps working when the connection does not

The honest counterpoint

More from notes.

The Subscription Stack Is Quietly Eating Your Margin

Three Things We Check Before Adopting Any AI Tool

Privacy Is a Feature You Can Actually Ship