The fastest way to a bad first impression of local AI is to pull a model too big for your machine. It downloads fine. Then it loads, your fans spin up, the cursor blinks for a long time, and a sentence trickles out one word every few seconds. You conclude local models are slow. They are not. You just picked the wrong size.
This guide fixes that. We will read your RAM, match a model to it, and explain quantization at the one level that actually matters in practice: smaller quant means you fit more, at a small cost to quality. No invented benchmark numbers, just the sizing rules we use in the studio and the reasoning behind them.
Why RAM is the number that matters
When a model runs, the whole thing has to sit in memory at once. On a Mac with Apple Silicon, that memory is unified: the same pool feeds both the CPU and the GPU, which is exactly why these machines are good at this work. On a PC with a discrete GPU, the model wants to fit in VRAM. Either way, the question is the same: is there room to hold the model plus the rest of what your computer is doing?
If the answer is no, the system spills over, starts swapping to disk, and everything crawls. That crawl is the symptom people mistake for “local AI is slow.” It is really “this model did not fit.”
So the model size you can run is set by your RAM, not your patience.
Before you start
You need two facts about your own machine:
- How much RAM you have. On a Mac: Apple menu, About This Mac. On Windows: Task Manager, Performance, Memory. On Linux:
free -h. - What else is competing for it. A browser with forty tabs, a video call, and a design app are all holding memory. The model gets what is left, not the whole number on the box.
The Atlas baseline we work from, drawn from the Ollama note: 16GB of RAM is the recommended starting point, and 8GB is the practical minimum where only small models will run.
The tiers
8GB: small models, single tasks
8GB is the floor. Ollama lists it as the minimum, and the word minimum is doing real work there. A small model will run. You can chat, draft, summarize, and answer everyday questions, and it will feel fine for those.
What it will not do gracefully is hold a large model or juggle a big model alongside heavy apps. Keep your expectations to one small model at a time, close what you are not using, and you will have a genuinely useful local assistant. Push past that and you will meet the swap-to-disk crawl.
A reasonable starter on 8GB is a small general model in the 1B to 3B range, the kind of thing ollama run llama3.2 gives you out of the box.
16GB: the comfortable middle, where a 7B model lives
16GB is the sweet spot, and it is the number we recommend to most people setting up for the first time. Both the Ollama and MLX notes name 16GB as the recommended baseline.
The practical rule from the Atlas: a 7B model is comfortable on 16GB. That class of model is where local AI stops feeling like a toy. A good 7B coding model handles real refactors and bug fixes. A good 7B general model writes, reasons, and follows instructions well enough for daily work. You have headroom to keep a browser and an editor open while the model runs.
This is the tier most of our day-to-day local work happens on. If you are buying a machine specifically to run local models and want one number to aim at, 16GB is the honest floor for a good experience.
32GB and up: room for the bigger models
Larger models want 32GB or more. Both notes draw the same line: the Ollama note flags that larger models want 32GB or more, and the MLX note lists 32GB-plus for larger models. This is the tier where you stop choosing small and start choosing capable.
With 32GB you can run a meatier model, or keep a 7B loaded while comfortably running everything else. With 64GB and beyond you reach into the genuinely large open models. The Ollama note puts it bluntly: pulling a 70B model onto a 16GB Mac will technically download and then crawl. That crawl is the tier mismatch again. The model is not broken; the machine cannot hold it.
If your work is occasional, 32GB is plenty. If you let models grind on long jobs, more memory buys you bigger models and more parallel headroom.
Quantization, at the level that matters
Here is the lever that lets a given machine punch above its tier.
Quantization shrinks a model by storing its numbers at lower precision. A model comes in different quant levels, and the practical effect is simple:
- A smaller quant makes the file smaller, so it fits in less RAM and leaves more headroom.
- A smaller quant costs a little quality. Usually a little. The heavier the squeeze, the more you may notice it on hard tasks.
That is the whole working model. You do not need to memorize precision formats to use this well. The move is: if a model is just over what your RAM can hold, reach for a smaller quant of the same model before you give up and drop to a weaker one. A more capable model at a tighter quant often beats a smaller model at full precision, and it fits.
In Ollama, quant levels show up as tags on a model in the library. Pull the tag that fits your tier. If a model loads and runs smoothly, you chose well. If it crawls, step down a quant or step down a size.
Prove it fits
The test takes one command. With a model pulled, see what is actually loaded in memory:
ollama ps
This shows you the model currently held in RAM. If your machine feels tight, this is usually the reason: memory is sticky, so once a model loads it stays resident until something bumps it or you stop it.
Then run a real prompt and watch the pace. Smooth, steady output means the model fit. Long pauses and a single word at a time mean it did not. That difference, felt directly, is more reliable than any number on a chart.
To free memory, pull a smaller model or remove ones you are done with:
ollama rm <model>
Trade-offs and gotchas
- The box number is not the model number. Your 16GB Mac is not handing all 16GB to the model. The browser, the OS, and everything else take their share first. Size for what is left.
- Memory is sticky. A loaded model stays in RAM. If you switch models often, your machine can feel slow simply because an old one never unloaded.
ollama pstells the truth;ollama rmor a restart clears it. - Bigger is not automatically better for you. A larger model that barely fits and crawls is worse, in daily use, than a 7B that answers instantly. Speed is part of quality when you are working in real time.
- Apple Silicon changes the math slightly. Unified memory means the same RAM serves the GPU, so the usable ceiling on a Mac is friendlier than the raw number suggests. The MLX note is the reference if you want maximum speed on the same hardware.
- Disk fills up quietly. Models are large and accumulate fast. Check what you have kept and remove what you no longer use.
Where to go next
Once you can size a model to your machine by eye, two follow-ons reuse this skill directly. The first is speed: on Apple Silicon, MLX runs the same class of model noticeably faster than Ollama, which matters most on long jobs. The second is picking the right model for a specific task, coding versus general chat versus documents, now that you know which sizes your RAM can actually hold.
You now have a way to read your own machine and choose a model that fits it, instead of pulling the biggest one and hoping. Curious about these things. You should be too.
Harness your curiosity.
— Stridenote · № 011