The gap between what local AI video models promise and what they deliver on consumer hardware is still wide. I spent a few hours this week trying to close it on a single machine: an Apple M4 Pro with 48 GB of unified memory, running ComfyUI v0.25.0 and Lightricks’ LTX-Video model. The result was a 3.9-second, 768-by-512 clip of a boy running through a field. It is blurry and not usable for anything. But the process of getting there, through every error and fix, is more instructive than the output.
This is a walkthrough of what broke and what needs to change before local video generation is ready for real work.
Why run AI video generation locally?
Running a video model locally means no API bills, no data leaving your machine, and no rate limits. For a small studio doing research and media production, that is the difference between experimenting freely and rationing every generation. The trade-off is quality. Cloud video models from Runway, Pika, or Kling produce results that hold up in client work. Local models produce results that hold up in a blog post about how far local models still need to go.
LTX-Video is one of the few open-weight video models that fits on consumer hardware at all. Lightricks released it in late 2024 as a 2-billion-parameter model designed for efficiency. The version I used (0.9.8) is the distilled FP8 quantisation, which shrinks the model to 4.2 GB at the cost of output quality. The full BF16 version is about 8 GB and requires more VRAM than a Mac’s unified memory handles gracefully.
How do you set up LTX-Video on ComfyUI?
ComfyUI v0.25.0 runs on the MPS backend (Apple’s Metal Performance Shaders) with full GPU acceleration. The LTX-Video workflow needs three model files:
- The diffusion model:
ltxv-2b-0.9.8-distilled-fp8.safetensors(4.2 GB, HuggingFace) - A text encoder:
t5xxl_fp8_e4m3fn.safetensors(5.5 GB, T5 XXL, the only encoder LTX-Video uses) - A VAE: the autoencoder that compresses video frames into latent space and back
The VAE was where everything fell apart first.
Error 1. The wrong VAE format
ComfyUI expects VideoVAE tensors in a specific layout: 10 encoder down-blocks and 10 decoder up-blocks, each containing res-blocks and compress/residual layers. The official HuggingFace diffusers release of the LTX-Video VAE uses a different architecture: 4 encoder blocks with embedded stride convolutions and a conv_out layer instead of the ComfyUI layout. When ComfyUI’s sd.py loader looks for post_quant_conv.weight in the diffusers checkpoint, it does not find it, and the entire load fails with a KeyError.
I downloaded the 1.6 GB diffusers VAE first, hit this error, then found a pre-converted version on HuggingFace by user city96. His LTX-Video-VAE-BF16.safetensors (800 MB) uses the correct 10-block architecture and loads without issues. The diffusers VAE is not usable in ComfyUI without a manual conversion script.
Error 2. The DualCLIPLoader bug
LTX-Video uses only the T5 XXL text encoder. It does not need CLIP-L or Gemma. But ComfyUI’s DualCLIPLoader node, when given the type=ltxv parameter and two encoder files (T5 + CLIP-L), routes through a code path in comfy/sd.py (line 1677 in v0.25.0) that expects a Gemma3 tokenizer. This code path calls SPieceTokenizer.from_pretrained() looking for spiece.model, and when the file exists but is meant for a different tokenizer, it raises ValueError: invalid tokenizer.
The fix: use a single-file CLIPLoader node instead of DualCLIPLoader. With only one file, ComfyUI takes the len(clip_data)==1 branch, correctly identifies it as T5 XXL, and instantiates the LTXVT5Tokenizer (which wraps HuggingFace’s T5TokenizerFast). No spiece model needed.
Error 3. Schema drift between ComfyUI versions
The workflow that works on ComfyUI v0.23.x does not work on v0.25.0. Three nodes changed:
SaveVideo. Previously accepted an IMAGE input. Now requires a VIDEO input connection plus two new fields: format (auto/mp4) and codec (auto/h264). A workflow that routes VAEDecode directly into SaveVideo with the old images parameter fails validation with required_input_missing.
SamplerCustom. Three new required fields: cfg (float, formerly optional, default now enforced), add_noise (boolean), and noise_seed (integer). Omitting them produces required_input_missing errors with no obvious hint about which fields.
CreateVideo. The frame_rate parameter was renamed to fps. Old workflows pass frame_rate silently and the node ignores it, producing a video at the default 30 fps instead of the intended 25.
These are not breaking changes for new workflows, but any saved workflow from before May 2026 will need manual updates.
Why does local LTX-Video output look bad?
The final generation used 20 sampling steps with the euler scheduler at CFG 4.0, producing 97 frames at 768×512. The model took just over three and a half minutes on the M4 Pro.
The quality problems are structural, not configurable:
- Resolution. LTX-Video was trained at 768×512 and does not generalise well to higher resolutions. That is the ceiling unless you upscale externally.
- FP8 distillation. The quantised FP8 model loses detail in textures, faces, and edges. The full BF16 model produces better results but needs more VRAM than the M4 Pro can allocate comfortably alongside the T5 encoder and VAE.
- Frame count and motion. 97 frames at 25 fps is 3.88 seconds. Within that window, the model has to establish scene composition, animate the subject, and maintain temporal coherence. Eighteen seconds (the model’s trained maximum) gives better results because the sampling has more room to settle. But 18 seconds of 768×512 video at 20 steps takes 10-15 minutes per generation.
- Sampler choice. The euler sampler is fast but noisy at low step counts.
dpmpp_2mordpmpp_sdewith a karras scheduler would improve detail at the same step count, at the cost of a small increase in generation time.
What would make local AI video production-ready?
For a studio that needs publishable video output, local LTX-Video is not workable today. Three things would close the gap:
A higher-resolution fine-tune. The model architecture supports larger latent sizes. A community fine-tune at 1024×576 or 1280×720 would make the output usable without external upscaling.
BF16 support on unified memory. The M4 Pro has 48 GB of unified memory, which is enough to hold the BF16 model, the T5 encoder, and the VAE simultaneously. But ComfyUI’s memory management on MPS is not aggressive enough with offloading, and OOM errors are frequent. Better memory scheduling would let the BF16 model run without crashes.
Composable video workflows. The most useful local video setup today combines LTX-Video for base generation with a post-pipeline: upscaling via Real-ESRGAN or similar, frame interpolation to smooth motion, and temporal filtering to reduce flicker. None of this is built into a single ComfyUI workflow yet, and each stage requires its own model and VRAM budget.
Is local AI video worth it yet?
A few hours of work produced 3.9 seconds of unusable video. That sounds like a bad result, but it is progress. A year ago, running a 2-billion-parameter video model on a laptop was not possible at all. The models are getting smaller, the tooling is getting more stable, and the community is producing converted checkpoints that eliminate the worst onboarding friction.
The VAE format trap and the DualCLIPLoader bug are solvable problems that will be fixed in the next release. The resolution and quality limits are hardware constraints that will soften with each generation of Apple Silicon and each new model distillation.
For now, if you need a usable video for client work, use a cloud API. If you want to understand how the pipeline works end to end, set up the local stack, fix every error that hits you, and watch a 3.9-second clip of a boy running through a field that looks like it was shot through a fogged window. That education is worth the time.
Disclosure: StrideNote Studio is a research and media production studio that evaluates local AI tools as part of its workflow. No cloud video service provided compensation or access for this article.