ComfyUI with local LLMs: a practical Mac workflow

You sit down to generate an image in ComfyUI, type “cyberpunk street market at night” into a text encoder, and get back something generic. A row of neon signs. A wet road. A figure in a trench coat. The model did what you asked. The problem is that “cyberpunk street market at night” is about 200 words short of what the model needs to produce something distinctive.

Local LLMs fix that. They sit between you and the image generator, taking a short description and expanding it into the kind of dense, specific prompt that makes Flux or SDXL produce something worth keeping. And on a Mac with Apple Silicon, the whole pipeline can run locally. According to a study published on arXiv in November 2025, MLX, llama.cpp, and Ollama all deliver viable inference speeds on Apple Silicon hardware, with MLX achieving the highest sustained generation throughput on an M2 Ultra system with 192 GB of unified memory.

How do local LLMs connect to ComfyUI?

ComfyUI has no built-in LLM capability. It is a node-based image generation engine that processes tensors through a directed graph. To add LLM reasoning, you need a custom node that makes HTTP calls to an LLM server or loads a GGUF model directly.

The cleanest separation pattern is to run the LLM as a separate server and let ComfyUI send requests to it. This is the approach recommended by the community guides at ComfyUI Nomadoor. Ollama runs as a background service on the Mac and exposes an OpenAI-compatible API on http://localhost:11434/v1. A ComfyUI custom node sends a chat completion request to that endpoint, gets back text, and passes it into a CLIP text encode node. The image generation model never sees the LLM. It only sees the expanded prompt.

The alternative is to load the LLM directly inside ComfyUI using a custom node pack that bundles llama.cpp bindings. The ComfyUI-LLM-Session node pack by kantan-kanto does exactly this. It supports Llama, Mistral, Qwen, DeepSeek, Gemma, and other popular GGUF models. It runs fully inside ComfyUI with no external daemon, manages persistent multi-turn conversations through a file-based session system, and supports model-to-model dialogue for experimental workflows. The tradeoff is that loading a 7B or 8B parameter model inside ComfyUI shares GPU memory with the image model, which matters more on a Mac with 16 GB or 32 GB than on a workstation with 48 GB.

What do you need to install for a Mac workflow?

A working Mac ComfyUI plus LLM setup involves four pieces.

Ollama is the simplest LLM server for macOS. It runs as a native Mac app, sits in the menu bar, and exposes the OpenAI-compatible API. Pull a small model like qwen3.5:4b or gemma4:e2b for lightweight prompt expansion, or qwen3-vl:8b if you want vision support. The arXiv study found that Ollama prioritises convenience over peak throughput, but for the prompt-enhancement use case where each call is a few hundred tokens, that tradeoff is sensible.

On the ComfyUI side, you need a custom node that talks to Ollama. Multiple options exist. The comfyui-ollama-image-to-prompt node by jluo-github supports both vision mode (image-to-prompt with VLMs) and text mode (keyword-to-prompt with regular LLMs). It ships with presets for different output formats: dense descriptive prose for Flux, Danbooru tags for NoobAI, JSON extract for structured workflows. The ComfyUI-LLM-text-processor node by fxd0h is built specifically for Apple Silicon and runs a Gemma 4 GGUF model via an auto-downloaded macOS llama.cpp binary. The Civitai workflow for Ideogram 4 on Mac combines this node with a patched GGUF loader to run the entire Ideogram 4 pipeline on Apple Silicon.

The critical Mac-specific piece is the GGUF loader. At the time of writing, the standard ComfyUI GGUF loader has issues with Apple’s MPS backend. The fxd0h fork of ComfyUI-GGUF patches the dtype handling so GGUF models load correctly on MPS. Without this patch, you get an “unknown model architecture” error. The patches have been submitted upstream to the city96/ComfyUI-GGUF repository.

How does prompt enhancement change image quality?

The value of a local LLM in the ComfyUI pipeline is not convenience. It is specificity.

A raw prompt like “abandoned library, overgrown” produces a usable image but a predictable one. The LLM takes that prompt and expands it into something a diffusion model can work with: “A grand Victorian-era library with collapsed oak shelves, vines creeping through broken stained-glass windows, dust motes suspended in shafts of amber light, scattered leather-bound books with gold foil peeling, moss on the marble floor, a single candle flickering on a reading desk.” That level of detail is what separates a generic output from one that looks intentional.

The prompt-expander nodes achieve this through system prompt engineering. You set a system prompt like “You are an expert at creating detailed Stable Diffusion prompts. Output only the expanded prompt, no explanation.” The node sends the user’s short description with that instruction, and the LLM returns the expanded version. The output connects directly to a CLIP text encode node. The workflow runs without any manual editing between the LLM response and the image generation step.

Vision-capable LLMs like Qwen3-VL or Gemma 4 E4B add a second capability. You can feed the LLM a reference image and ask it to describe the scene, the lighting, the composition, and the mood, then use that description as the seed for a new generation. This turns the LLM into an automated image analysis and prompt generation pipeline. The ComfyUI-LLM-Session node pack supports image inputs through its vision handler. The fxd0h LLM text processor nodes load a Gemma 4 uncensored GGUF that interprets both text instructions and image context.

What do Mac memory limits mean for your setup?

Apple Silicon uses unified memory. The GPU and the CPU share the same pool. That is an advantage for data transfer (no PCIe bottleneck) but a hard constraint for running two models at once.

If you load a Flux dev variant and an 8B parameter LLM simultaneously, the memory demand adds up. Flux dev in GGUF Q4_K_S with the T5-XXL encoder needs roughly 12 to 14 GB peak including activations and buffers, according to testing documented on the MACGPU blog. An 8B LLM in Q4_K_M needs another 5 to 6 GB. On a 16 GB Mac, that leaves almost no room for the OS or other applications. On a 32 GB Mac it is tight but workable. On a 64 GB Mac the combined pipeline runs comfortably.

The separation pattern helps. When Ollama runs as a separate server and the LLM is not actively generating, Ollama unloads the model from memory. The ComfyUI process keeps the image model loaded. They share the memory pool but not simultaneously for most of the workflow. The LLM only needs to be resident during the few seconds it takes to expand a prompt. After that, ComfyUI does the image generation without competition.

The PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 environment variable is worth setting in any Mac ComfyUI deployment. By default, PyTorch reserves GPU memory in chunks and does not release it back. On a shared memory system, that starves other processes. Setting it to zero forces PyTorch to allocate and release on demand. This is documented in multiple production ComfyUI deployments on Mac, including the guide published on zolty.systems in April 2026.

Text encoders on Apple Silicon have their own issue. A March 2026 pull request to the Comfy-Org/ComfyUI repository identified that the text_encoder_device() function forces text encoders to run on CPU instead of MPS GPU on Apple Silicon. The fix is a single-line change that adds VRAMState.SHARED to the device-selection condition. Without this fix, non-quantized text encoders like a bf16 Gemma 3 12B run entirely on CPU, slowing down prompt encoding significantly. The fix has been merged in recent ComfyUI builds.

Which custom node approach fits your needs?

Three approaches exist, and the right one depends on your memory budget and workflow complexity.

The Ollama API approach runs the LLM as a separate process. Nodes like comfyui-ollama-image-to-prompt, the stavsap comfyui-ollama nodes, or the Aditya prompt generator connect to Ollama over HTTP. This is the simplest to set up, the easiest to debug (Ollama logs are separate from ComfyUI logs), and the best for memory-constrained Macs since Ollama can unload models between calls. It is the approach used in the Civitai Ideogram 4 Mac workflow, which combines an Ollama-hosted Gemma 4 for prompt expansion with a GGUF-loaded Ideogram 4 for image generation.

The embedded GGUF approach loads the LLM directly inside ComfyUI using llama.cpp bindings. The ComfyUI-LLM-Session node pack and the ComfyUI_LocalLLMNodes pack both take this route. The advantage is zero external dependencies. No Ollama installation, no separate server to manage. The disadvantage is memory pressure. The LLM stays loaded in ComfyUI’s process space and competes directly with the image model. This approach works best on machines with 48 GB or more of unified memory.

The hybrid approach runs a lightweight LLM inside ComfyUI for basic prompt expansion and a larger model via Ollama for complex tasks like image description or structured prompt generation. Some workflows chain both. A vision model on Ollama describes the input image, then a smaller local model inside ComfyUI formats the description into the prompt syntax the image model expects. This is the most flexible pattern but also the one with the most nodes to wire up.

Where is this workflow going next?

The most visible trend is agentic ComfyUI workflows. The comfyui_LLM_party node pack and the guides published on the Creepybits blog in April 2026 demonstrate multi-server setups where an executive LLM (Gemma 4 via vLLM) decides which sub-workflow to run, generates the prompt, and triggers the image generation, all within ComfyUI’s node graph. The LLM is no longer just expanding prompts. It is orchestrating the pipeline.

On the Mac side, the patched GGUF loaders and the MPS text encoder fix mean the platform-specific blockers are being resolved one by one. The ComfyUI community has moved from “it does not work on Mac” to “it works but watch your memory budget” in about six months. The remaining performance gap between Apple Silicon and NVIDIA for this specific workload is the GGUF dequantization overhead, which is not a software fix. It is a hardware limitation of the MPS backend lacking fp8 tensor cores. But for the typical Mac user generating a few dozen images per session, the difference between a 30-second prompt expansion and a 3-second one is not the bottleneck. The image generation step itself is.

The local LLM in ComfyUI is not a gimmick. It is the difference between prompts that feel like search queries and prompts that feel like directions. On a Mac with Apple Silicon, that difference runs entirely on your own hardware.

How do local LLMs connect to ComfyUI?

What do you need to install for a Mac workflow?

How does prompt enhancement change image quality?

What do Mac memory limits mean for your setup?

Which custom node approach fits your needs?

Where is this workflow going next?

More from playbooks.

Local AI coding agent setup with OpenCode and Ollama

How to run Gemma 4 12B locally on a Mac with Ollama

OpenClaw with a local model: a private AI assistant