Plain Whisper is excellent at one job: turning audio into text. What it cannot do is tell you who said what. For a single voiceover that does not matter. For an interview, a panel, or any multi-speaker recording, it is the difference between a usable transcript and a wall of text you still have to untangle by hand.
WhisperX adds the parts that make a transcript actually workable: speaker labels (diarization), word-level timestamps, and a faster Whisper backend underneath. The output is what documentary editors and journalists need, not just what a model can produce. We run it daily on our own machines, and it has replaced the transcription services we used to pay by the minute. Here is how to set it up.
What you will end up with
- WhisperX installed in its own Python environment.
- A free HuggingFace token wired in, so diarization works.
- One command that turns an audio file into a timestamped, speaker-labeled transcript, with nothing leaving your machine.
No subscription, no per-minute billing, no upload of recordings you would rather keep private.
Before you start
You need a Mac, Windows, or Linux machine with at least 8GB of RAM. 16GB or more is more comfortable. A GPU is optional but makes a real difference to speed: on an NVIDIA 4090, a 60-minute interview transcribes in roughly two minutes; on a Mac CPU, expect closer to twenty.
You also need Python 3 and the ability to open a terminal. Unlike some tools we cover, WhisperX is command-line first. The commands are short, but there is no desktop app to double-click.
One more thing to line up before you install: a free HuggingFace account. Diarization runs on gated models, so you have to accept two licenses and create a token once. We will do that in step 2.
Step 1: Install WhisperX in a virtual environment
Always install WhisperX into its own Python environment. It pulls in a lot of dependencies, and a venv keeps them from colliding with anything else on your system.
# create and activate a virtual environment
python3 -m venv whisperx-env
source whisperx-env/bin/activate
# install
pip install whisperx
On Windows, activate with whisperx-env\Scripts\activate instead, and install CUDA-enabled PyTorch first if you have an NVIDIA card:
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install whisperx
On Linux with an NVIDIA GPU, do the same: install the CUDA build of PyTorch first, then WhisperX. Apple Silicon runs on CPU by default; newer setups can use the CTranslate2 Metal backend for acceleration.
Step 2: Get a HuggingFace token for diarization
This is the step that trips up first-timers. Diarization (the speaker labels) needs access to two gated pyannote models. You accept the licenses once and generate a token, and then you never think about it again.
- Sign up at https://huggingface.co.
- Accept the license at https://huggingface.co/pyannote/segmentation-3.0.
- Accept the license at https://huggingface.co/pyannote/speaker-diarization-3.1.
- Create a token at https://huggingface.co/settings/tokens.
- Export it in your terminal so WhisperX can read it:
export HF_TOKEN=hf_xxxxx
Skip the license acceptance and you will get a confusing error when you first run diarization. Do it now and the rest is smooth.
Step 3: Run your first transcription
Start simple, with no diarization, just to confirm the install works:
whisperx audio.mp3
Output files appear in the current directory with the same base name as the input. Now add the speaker labels:
whisperx audio.mp3 --diarize --hf_token $HF_TOKEN
A few flags worth knowing:
# best quality, larger model (~3GB download the first time)
whisperx audio.mp3 --model large-v3 --diarize --hf_token $HF_TOKEN
# pin the language instead of auto-detecting
whisperx audio.mp3 --language en --diarize --hf_token $HF_TOKEN
# emit every format at once: SRT, VTT, JSON, TXT
whisperx audio.mp3 --output_format all --diarize --hf_token $HF_TOKEN
The default model is base. Reach for large-v3 when quality matters and you can spare the disk and the time.
Prove it works
Open the output .txt or .srt file. You are looking for two things that plain Whisper cannot give you:
- Timestamps on each segment.
- Speaker tags like
[SPEAKER_01]and[SPEAKER_02]marking who is talking.
For a quick test, grab a two to three minute clip with at least two distinct voices, a podcast snippet works well, and run it with --diarize. On a Mac CPU it takes a few minutes; on a GPU, tens of seconds. When the file opens with the right voices attached to the right lines, you have what professional services charge by the minute for, running on this laptop, for free.
Trade-offs and gotchas
WhisperX is genuinely strong, but it is not magic, and a few things are worth knowing up front.
- Diarization quality drops with five or more speakers. For two to four people it is excellent. For a crowded multi-host panel with overlapping talk, expect some speaker swaps you will need to fix by hand.
- It is batch, not real-time. WhisperX processes complete files. If you need live or streaming transcription, this is the wrong tool.
- GPU is much faster than CPU. The functionality is identical either way, but on a Mac CPU a long interview is a coffee-and-then-some wait. Plan around it.
- Disk adds up. The
large-v3model is around 3GB, and the pyannote models add more on top. Not huge, but not nothing. - Noisy field recordings hurt accuracy. WhisperX is happiest with clear recordings at 16kHz or higher. For messy audio, isolate the vocals first with Demucs before transcribing.
Our verdict, in short: WhisperX is essential for documentary work. Before it, we either paid for diarized transcripts (around $1.50 per audio minute, with weeks of turnaround) or used plain Whisper and accepted a format that was useless for editing multi-speaker recordings. WhisperX gives both, free, local, fast enough, and properly formatted. The HuggingFace token is a one-time annoyance, not an ongoing one.
Where to go next
For single-speaker audio, a lone voiceover or a single-host show, skip WhisperX and use whisper.cpp instead. It is faster and carries none of the diarization overhead you will not be using.
Once your transcripts exist, they become inputs. Point a local model at them to clean up disfluencies, summarize, or pull quotes, and feed them into a documents tool so an interview archive becomes searchable. Both reuse work you have already done here. We cover the documents side in a separate Playbook.
You now have transcription with speaker labels and timestamps, running entirely on your own machine, with no per-minute bill attached. Curious about these things. You should be too.
Harness your curiosity.
— Stridenote · № 007