Table of Contents
Whisper vs Faster-Whisper: What Changed
OpenAI’s Whisper is the gold standard for open-source speech-to-text, but its stock implementation is slow. Faster-Whisper, a reimplementation using CTranslate2, delivers 4-8x speedups with the same accuracy. For anyone running transcription workloads on a dedicated GPU server, this speed difference translates directly into cost savings and higher throughput.
Both tools use the same Whisper model weights, so accuracy is identical. The difference is entirely in the inference engine. For dedicated hosting details, see our Whisper hosting page.
How Faster-Whisper Works
Faster-Whisper converts Whisper’s PyTorch weights to the CTranslate2 format, which applies layer fusion, INT8/FP16 quantisation, and batch decoding optimisations. The result is dramatically lower memory usage and higher throughput with no change to the underlying model architecture. It also supports VAD (voice activity detection) filtering to skip silent sections, further improving real-world speed.
Speed Benchmarks by GPU
Tested with a 60-minute English podcast (mono, 16kHz). Times include VAD for Faster-Whisper. See our benchmark tool for additional metrics.
| GPU | Model Size | Whisper (PyTorch FP16) | Faster-Whisper (INT8) | Speedup |
|---|---|---|---|---|
| RTX 3090 | large-v3 | 4m 12s | 0m 52s | 4.8x |
| RTX 3090 | medium | 2m 18s | 0m 29s | 4.8x |
| RTX 4060 | large-v3 | 7m 45s | 1m 18s | 6.0x |
| RTX 4060 | medium | 3m 52s | 0m 41s | 5.7x |
| RTX 4060 Ti | large-v3 | 5m 58s | 1m 02s | 5.7x |
Faster-Whisper delivers consistent 5-6x speedups across all tested GPUs. The RTX 3090 processes a full hour of audio in under a minute with the large-v3 model, fast enough for real-time transcription of multiple concurrent streams.
Does Speed Cost Accuracy?
| Model | Backend | WER (LibriSpeech test-clean) |
|---|---|---|
| large-v3 | Whisper (PyTorch FP16) | 2.7% |
| large-v3 | Faster-Whisper (INT8) | 2.7% |
| medium | Whisper (PyTorch FP16) | 3.4% |
| medium | Faster-Whisper (INT8) | 3.4% |
Word Error Rate is identical between both backends. The CTranslate2 optimisations affect only the compute path, not the model behaviour. Browse additional accuracy data in our benchmarks section.
Installation and Setup
# Install Faster-Whisper
pip install faster-whisper
# Basic transcription
from faster_whisper import WhisperModel
model = WhisperModel("large-v3", device="cuda", compute_type="int8")
segments, info = model.transcribe("audio.mp3", beam_size=5, vad_filter=True)
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
# Original Whisper (for comparison)
pip install openai-whisper
whisper audio.mp3 --model large-v3 --device cuda
For a full deployment walkthrough, see our Run Whisper on RTX 4060 guide. Read the self-host guide for server setup fundamentals.
Which to Use
Use Faster-Whisper in almost every scenario. It is faster, uses less VRAM, and produces identical output. The only reason to use stock Whisper is if you need PyTorch-native integration for a specific fine-tuning pipeline or custom model modification.
VRAM considerations: Faster-Whisper large-v3 at INT8 uses approximately 2.5 GB of VRAM, meaning you can run it alongside an LLM on the same GPU. This makes it ideal for multi-model serving on a single RTX 3090.
Use our cost calculator to compare transcription hosting costs. Browse more comparisons in the GPU comparisons category.
Deploy This Model Now
Run Faster-Whisper on dedicated GPU servers for high-throughput transcription. No API limits, no per-minute charges.
Browse GPU Servers