How to choose between Ollama, LM Studio, and MLX for local models

How to choose between Ollama, LM Studio, and MLX for local models

You download a model, open a terminal, and stare at three different commands you could run to serve it. One starts a Go service with a single word. Another opens a polished chat window with a model browser built in. The third requires a Python virtualenv and a few lines of configuration. All three can run the same 35-billion-parameter model on your Mac, but they do not produce the same speed. A benchmark published by engineer Kyle Dean Reinford in April 2026 tested all three runners on a Mac M2 Studio with 64 GB of RAM running Qwen3.6-35B-A3B and found a 50-token-per-second gap between the slowest and fastest option. The choice between Ollama, LM Studio, and Apple’s MLX framework is not a matter of taste. It determines how much performance you leave on the table.

How do Ollama, LM Studio, and MLX differ?

Ollama is a Go-based CLI tool and background service that downloads, manages, and runs local LLMs with a single command. It wraps llama.cpp for GGUF models and, since version 0.19 released in March 2026, uses Apple’s MLX as its Apple Silicon backend. The project has over 174,000 stars on GitHub and supports macOS, Windows, Linux, and Docker. Ollama has no built-in graphical interface. You interact with it through the terminal or its OpenAI-compatible REST API at localhost:11434.

LM Studio is a desktop GUI application that wraps model management and inference in an Electron-based chat interface with a model browser, split-view parallel inference, and a local server. It runs both GGUF and MLX formats by swapping between its llama.cpp and MLX engines. Version 0.4.0, released January 2026, introduced llmster, a headless daemon that runs on servers without the GUI. Version 0.4.16 from June 2026 added a mobile companion app called Locally for iPhone and iPad. The desktop app is closed source, but the lms CLI and the MLX engine are open source.

Apple’s MLX is not an end-user application. It is a NumPy-compatible array framework and machine learning library built specifically for Apple Silicon. The mlx-lm Python library sits on top and provides LLM inference and fine-tuning capabilities. MLX processes arrays in unified memory. Data does not copy between the CPU and GPU during computation. This architectural difference is the source of its speed advantage on Macs. MLX runs only on Apple Silicon. It does not work on Intel Macs, Windows, or Linux GPU servers.

Which runner is fastest on Apple Silicon?

The performance gap on Apple Silicon is the most important difference between these tools. Reinford’s benchmark ran the same Qwen3.6-35B-A3B model on an M2 Studio with 64 GB of unified memory. The results were clear. Ollama version 0.20.3, using the GGUF format through llama.cpp and Metal, delivered 33.1 tokens per second. LM Studio, running the same model in MLX 4-bit format, delivered 80.9 tokens per second. The mlx_lm.server from Apple’s MLX framework delivered 85.0 tokens per second.

That is a 2.5x difference between Ollama’s stable release and the native MLX path. Time to first token followed the same pattern. Ollama took 360 milliseconds. mlx_lm.server took 157 milliseconds. For any workflow where a user waits for a model to start generating, that difference is immediate.

Ollama’s own testing shows that its newer MLX backend closes this gap materially. On a Mac M5 Max running Qwen 3.5-35B-A3B, Ollama 0.19 achieved 1,810 tokens per second during prefill and 112 tokens per second during decode. The previous version managed 1,154 and 58 respectively. That is a 57% improvement in prefill and a 93% improvement in decode. The caveat is that this MLX-powered preview was limited to specific models and required a Mac with more than 32 GB of memory.

The practical takeaway is straightforward. If you run models on Apple Silicon and want maximum throughput, use an MLX-native runner. LM Studio and mlx_lm.server deliver roughly equal speed because they use the same engine. Ollama with llama.cpp leaves a material amount of performance unused today.

How does developer experience compare?

Ollama wins on simplicity. You install it, run ollama pull <model>, and type ollama run <model>. The built-in model registry means you do not search Hugging Face for the right file. The OpenAI-compatible API at localhost:11434 works out of the box with Claude Code, OpenCode, Cursor, and other agent tools. The ollama launch command, introduced in 2026, starts agent integrations with a single line. For a developer who wants to get a model running in under a minute without reading documentation, Ollama is the first choice.

LM Studio offers a richer interface at the cost of a heavier footprint. The Electron app uses more memory at idle than a terminal-based tool. The trade-off buys you a model browser with hardware-aware quantization recommendations, split-view chat for comparing outputs side by side, and fine-grained load options like context length and concurrent predictions. The lms CLI, added in version 0.4.0, gives you headless operation on a server with no GUI. For a team that wants a shared model endpoint on a Linux box with an NVIDIA GPU, llmster is a legitimate alternative to Ollama.

MLX direct usage requires Python comfort. You install mlx-lm via pip, download weights from the mlx-community namespace on Hugging Face, and run mlx_lm.server from the command line. It has no model registry, no GUI, and no preset configuration. What it offers is the fastest inference possible, direct access to the framework for fine-tuning with LoRA, and the ability to train custom models. If your work includes both inference and training, MLX is the only option in this group that handles both without a second tool.

How do platform and hardware needs affect the choice?

Ollama runs on macOS, Windows, and Linux. It supports both Apple Silicon and x86 hardware, and models come in GGUF format which works across all three operating systems without conversion. For a team with mixed hardware, Ollama is the default choice because every machine runs the same tool.

LM Studio runs on macOS, Windows, and Linux through its headless daemon. The GUI version is macOS and Windows only. The mobile companion, Locally, connects to a remote LM Studio instance via LM Link, which uses Tailscale for secure tunneling. For a team that wants remote access to a desktop-grade GPU from a laptop or phone, LM Studio provides the smoothest path.

MLX runs on macOS with Apple Silicon only. There is no Windows or Linux support in the main framework. A CUDA backend for Linux is in beta, but it is not the primary development target. If your stack includes any non-Apple hardware, MLX as a serving layer creates fragmentation.

When should you skip the wrapper and use MLX directly?

MLX direct usage makes sense in three scenarios. The first is when performance is the only metric that matters. If you are building a coding agent or a real-time application where every millisecond counts, the 2.5x throughput advantage over llama.cpp is material. The second is when you need to fine-tune or train a model. Ollama and LM Studio handle inference only. MLX comes with optimizers, LoRA training scripts, and the full gradient computation pipeline. The third is when you work in Python and want model inference as part of a larger data pipeline. MLX arrays behave like NumPy arrays, so they integrate with existing scientific computing workflows without a translation layer.

MLX is the wrong choice when you need cross-platform portability, rapid setup, or support for models that exist only in GGUF format. Converting from GGUF to MLX safetensors is possible but requires a conversion step, and some niche models have no mlx-community version.

How do you decide which local model runner to use?

The decision depends on your hardware, your use case, and your tolerance for configuration. If you run on Apple Silicon and need maximum inference speed, use LM Studio or mlx_lm.server. The difference between 33 and 80 tokens per second changes whether a model feels responsive or sluggish. Pick LM Studio if you want a GUI and model browser. Pick mlx_lm.server if you prefer the terminal and might want to fine-tune later.

If you use Windows or Linux, or if you share a model across a team with mixed hardware, Ollama is the practical default. It runs everywhere, supports GGUF models from every provider, and the one-command setup eliminates configuration drift. The performance gap on Apple Silicon is real, but it narrows as Ollama’s MLX backend expands to more models.

If you run on Linux with NVIDIA GPUs, Ollama and LM Studio both support CUDA, and the GGUF-versus-MLX performance question does not apply in the same way. On that hardware, the default GGUF path is already the fastest option.

The local model tooling space moves quickly. Ollama adopted MLX as its Apple Silicon backend in March 2026, closing a gap that had existed for two years. LM Studio went from GUI-only to a headless server daemon in a single release. Apple continues to ship MLX updates that improve both performance and model support. These tools are converging on the same engineering substrate. The differences between them are real but shrinking. The question is not which one is objectively best. It is which one matches your hardware, your workflow, and your willingness to reach for a second tool when your needs change.


Disclosure: StrideNote Studio is a research and media production studio that evaluates local AI tools as part of its workflow.

Share this
X Facebook LinkedIn Email