Table of Contents
Whisper Medium Benchmark Overview
OpenAI Whisper Medium (769M parameters) sits between the smaller Whisper Small and the flagship Large-v3, offering a strong balance of accuracy and speed. For many transcription workloads it provides more than sufficient quality while running significantly faster. Deploying it on a dedicated GPU server keeps latency low and throughput high for production use.
We benchmarked Whisper Medium using faster-whisper (CTranslate2) on GigaGPU servers with a 10-minute English audio sample. The model needs approximately 1.5 GB of VRAM at FP16, making it runnable on every GPU tested. For methodology details, see our benchmark hub.
RTF Results by GPU
Lower RTF is better. Below 1.0 means faster than real-time transcription.
| GPU | VRAM | Whisper Medium FP16 RTF | Speed vs Real-Time |
|---|---|---|---|
| RTX 3050 | 6 GB | 0.16 | 6.3x real-time |
| RTX 4060 | 8 GB | 0.09 | 11.1x real-time |
| RTX 4060 Ti | 16 GB | 0.065 | 15.4x real-time |
| RTX 3090 | 24 GB | 0.045 | 22.2x real-time |
| RTX 5080 | 16 GB | 0.03 | 33.3x real-time |
| RTX 5090 | 32 GB | 0.02 | 50x real-time |
Whisper Medium is substantially faster than Large-v3, with the RTX 5090 reaching a remarkable 50x real-time speed. Even the budget RTX 3050 manages 6.3x real-time, making it viable for lightweight self-hosted transcription.
FP16 vs INT8 Comparison
INT8 quantisation further improves speed. See our quantisation analysis for background on precision trade-offs.
| GPU | FP16 RTF | INT8 RTF | Improvement |
|---|---|---|---|
| RTX 3050 | 0.16 | 0.11 | 31% |
| RTX 4060 | 0.09 | 0.06 | 33% |
| RTX 4060 Ti | 0.065 | 0.044 | 32% |
| RTX 3090 | 0.045 | 0.03 | 33% |
| RTX 5080 | 0.03 | 0.02 | 33% |
| RTX 5090 | 0.02 | 0.014 | 30% |
INT8 gives a consistent ~32% speed boost. The RTX 5090 at INT8 reaches 71x real-time, processing a 1-hour recording in under 51 seconds.
Cost Efficiency Analysis
| GPU | FP16 RTF | Approx. Monthly Cost | Speed/Pound |
|---|---|---|---|
| RTX 3050 | 0.16 | ~£45 | 0.139 |
| RTX 4060 | 0.09 | ~£60 | 0.185 |
| RTX 4060 Ti | 0.065 | ~£75 | 0.205 |
| RTX 3090 | 0.045 | ~£110 | 0.202 |
| RTX 5080 | 0.03 | ~£160 | 0.208 |
| RTX 5090 | 0.02 | ~£250 | 0.200 |
The RTX 5080 and RTX 4060 Ti tie for best cost efficiency. For the best GPU for Whisper, the RTX 4060 Ti is the clear budget champion.
GPU Recommendations
- Budget: RTX 4060 — 11x real-time is excellent for moderate transcription volumes at low cost.
- Best value: RTX 4060 Ti — top cost efficiency with 15x real-time speed.
- High volume: RTX 5080 — 33x real-time handles heavy transcription pipelines.
- Maximum speed: RTX 5090 — 50x real-time for time-critical applications.
If you need better accuracy, see the Whisper Large-v3 RTF benchmark. For a detailed comparison across model sizes, check the Whisper Tiny vs Base vs Small comparison. Browse all data in the Benchmarks category.
Conclusion
Whisper Medium is the sweet spot for most transcription workloads, offering near-Large-v3 accuracy with roughly double the speed. It runs on every GPU we tested and delivers exceptional cost efficiency on mid-range cards. For teams that do not need the absolute best multilingual accuracy, Whisper Medium on dedicated hardware is the practical choice.
Fast Transcription with Whisper on Dedicated GPUs
Bare-metal GPU servers for speech-to-text workloads. From budget to high-end, find the right server for your volume.
Browse GPU Servers