A documentary interview never arrives clean. There is music bleeding in from a cafe, an HVAC unit droning under the dialogue, two people talking over each other. We wanted a chain that takes that messy file and hands back a clean, speaker-labeled transcript without uploading a second of audio to anyone. We built it from three open tools, and it runs on one machine.
This is the build log for that chain: Demucs to pull the voice out, WhisperX to turn the voice into text, ffmpeg gluing the steps. Honest about what it is good for and where it falls down.
The goal, stated plainly
We wanted two things that used to be separate problems solved in one pass. First, clean speech out of a recording that has more than speech in it. Second, a transcript that says who said what, with timestamps, in a format an editor can actually use.
The second problem is the one that costs money. Plain transcription gives you a wall of text. Diarized transcription, the kind that labels each speaker and timestamps every word, is what professional services charge around 1.50 dollars per audio minute for, with turnaround measured in weeks. We wanted that output, free, local, and fast enough to be useful the same day.
The stack
Three tools, all local, all batch.
- Demucs, installed with
pip install demucsinto its own virtual environment. It is a source separation tool from Facebook Research that splits an audio file into vocals, drums, bass, and an “other” track using a trained neural network. We use it for one thing here: pull the voice out and leave the noise behind. - WhisperX, installed with
pip install whisperx. It is Whisper transcription plus speaker diarization, word-level timestamps, and forced alignment in one tool. The diarization is what plain Whisper cannot do. - ffmpeg, installed with
brew install ffmpeg. It does the format conversions between the two AI steps. No glamour, total dependency.
On Apple Silicon, both Demucs and WhisperX default to CPU, with Metal acceleration available on newer setups. We planned around the CPU timing rather than fight it, which matters for how long the chain takes.
Step one: isolate the voice with Demucs
The first link is separation. We point Demucs at the interview file:
demucs --two-stems vocals interview.wav
The --two-stems vocals flag is the one we reach for most, because we do not need drums and bass broken out of an interview. We need the voice on one side and everything else on the other. Demucs writes its output as WAV files into separated/htdemucs/interview/, and the vocals.wav it produces is the clean dialogue, pulled out from under the music or the room.
The reaction the first time you hear it is the whole pitch: isolated speech, noticeably cleaner than the source, from a recording you thought was unusable. This used to require master tapes and a studio. It runs on a laptop in a couple of minutes.
Two honest notes from our own use. Demucs is excellent on music and a little less optimized for pure-speech separation in genuinely messy field conditions, so for the worst recordings we layer it with classical denoising downstream. And the separation is impressive, not magic. There is always some bleed between stems, and the result still benefits from EQ afterward.
Step two: convert with ffmpeg
WhisperX wants its input in a predictable shape. The format that transcription tools like best is 16kHz mono WAV, and Demucs hands back stereo. So ffmpeg sits between the two steps and does the conversion:
ffmpeg -i vocals.wav -ar 16000 -ac 1 -af loudnorm voice_16k.wav
That one line downsamples to 16kHz, folds stereo to mono with -ac 1, and normalizes the loudness so a quiet interview does not transcribe worse than a loud one. With ffmpeg the flag order matters, input options before -i, output options after, which is the rule we keep memorized and the reason the conversion is reliable rather than fiddly.
Step three: transcribe with WhisperX
The clean, correctly formatted voice goes into WhisperX with diarization on:
whisperx voice_16k.wav --model large-v3 --diarize --hf_token $HF_TOKEN --output_format all
The --diarize flag is what earns the chain its keep. The output comes back labeled [SPEAKER_01], [SPEAKER_02], timestamped, and --output_format all writes it as SRT, VTT, JSON, and plain text in one pass, so the editor gets subtitle files and the archive gets searchable text from the same run. We use large-v3 for the best quality and accept the slower run for documentary work where the words have to be right.
The setup snag worth flagging once: diarization needs a free HuggingFace token, and you have to accept the pyannote model license on HuggingFace before it works. Skip that and you get a confusing error rather than a clear one. It is annoying the first time and trivial forever after.
The timing is real and worth planning around. The notes put a 60-minute interview at roughly 2 minutes on a fast NVIDIA GPU and roughly 20 minutes on a Mac CPU. This is a batch chain, not a live one. You start it and do something else.
Where the chain struggles
We will not oversell it. The places it gets shaky are specific and knowable.
- Crowded conversations. WhisperX diarization is excellent for 2 to 4 speakers and gets shakier past 5, especially when people overlap. Expect the occasional speaker swap on a busy panel.
- Truly degraded source. Demucs results degrade with bad source material. There is a floor below which no separation saves a recording.
- Disk and patience. Demucs writes large WAV files per input, which adds up on batch jobs, and neither tool is real time. This is an interview-and-podcast chain, not a streaming one.
For a single-speaker voiceover we skip this whole thing and reach for a lighter, faster transcriber, because the diarization overhead buys nothing when there is only one voice.
The takeaway
The chain is at its best exactly where we built it for: a two-to-four-person interview recorded somewhere imperfect, that needs to become a clean, speaker-labeled, timestamped transcript by end of day, without the audio ever leaving the building. Demucs lifts the voice out, ffmpeg shapes it, WhisperX names the speakers and writes the text. Three open tools, one machine, no per-minute bill and no upload.
We did not tell you it beats a studio engineer on the worst recording in the pile. We told you what we wired together and where it holds. Curious about these things. You should be too.
Harness your curiosity.
— Stridenote · № 007