RTX 3050 - Order Now
Home / Blog / Benchmarks / Whisper Large-v3 RTF by GPU
Benchmarks

Whisper Large-v3 RTF by GPU

Benchmark results for OpenAI Whisper Large-v3 real-time factor across six GPUs with FP16 and INT8 comparisons and cost analysis for dedicated GPU hosting.

Whisper Large-v3 Benchmark Overview

OpenAI Whisper Large-v3 is the most accurate open speech-to-text model, with 1.55 billion parameters supporting 100+ languages. The key metric for transcription is Real-Time Factor (RTF) — a value below 1.0 means the model transcribes faster than real time. Deploying Whisper Large-v3 on a dedicated GPU server ensures consistent low-latency transcription for production workloads.

Tests used faster-whisper (CTranslate2) on GigaGPU servers with a 10-minute English audio sample at 16kHz. Whisper Large-v3 requires approximately 3 GB of VRAM at FP16. For comparisons with smaller models, see our Whisper Tiny vs Base vs Small benchmark.

RTF Results by GPU

Lower RTF is better. An RTF of 0.10 means 10 minutes of audio is transcribed in 1 minute.

GPUVRAMWhisper Large-v3 FP16 RTFSpeed vs Real-Time
RTX 30506 GB0.323.1x real-time
RTX 40608 GB0.185.6x real-time
RTX 4060 Ti16 GB0.137.7x real-time
RTX 309024 GB0.0911.1x real-time
RTX 508016 GB0.0616.7x real-time
RTX 509032 GB0.0425x real-time

Every GPU tested runs Whisper Large-v3 faster than real-time. The RTX 5090 achieves 25x real-time speed, meaning a 1-hour podcast is transcribed in under 2.5 minutes.

FP16 vs INT8 Comparison

CTranslate2 supports INT8 quantisation for additional speed. Below we compare FP16 and INT8 RTF across all GPUs. For more quantisation analysis, see our quantisation speed comparison.

GPUFP16 RTFINT8 RTFImprovement
RTX 30500.320.2231%
RTX 40600.180.1233%
RTX 4060 Ti0.130.0931%
RTX 30900.090.0633%
RTX 50800.060.0433%
RTX 50900.040.02830%

INT8 delivers a consistent 30-33% improvement in RTF with negligible impact on transcription accuracy. For production deployments, INT8 is strongly recommended.

Cost Efficiency Analysis

We measure cost efficiency as transcription speed (inverse of RTF) per pound of monthly hosting cost.

GPUFP16 RTFApprox. Monthly CostSpeed/Pound
RTX 30500.32~£450.069
RTX 40600.18~£600.093
RTX 4060 Ti0.13~£750.103
RTX 30900.09~£1100.101
RTX 50800.06~£1600.104
RTX 50900.04~£2500.100

The RTX 5080 and RTX 4060 Ti offer the best value, with the RTX 3090 close behind. For the best GPU for Whisper, the 4060 Ti is an excellent budget pick.

GPU Recommendations

  • Budget: RTX 4060 Ti — 7.7x real-time at FP16, 11x at INT8. Excellent for moderate transcription volumes.
  • Best value: RTX 5080 — 16.7x real-time makes it ideal for high-volume transcription services.
  • Fastest: RTX 5090 — 25x real-time for mission-critical, low-latency pipelines.
  • Entry level: RTX 3050 — still 3x real-time, suitable for light-use self-hosted transcription.

For the smaller model variant, check the Whisper Medium RTF benchmark. You can also compare model sizes in our Whisper Tiny vs Base vs Small comparison. Browse all results in the Benchmarks category.

Conclusion

Whisper Large-v3 runs faster than real-time on every GPU we tested, and INT8 quantisation further boosts speed with no meaningful accuracy loss. Whether you are building a transcription API, a meeting notes service, or a podcast indexer, a dedicated GPU server with the right hardware delivers consistent, reliable performance.

Deploy Whisper Large-v3 on Dedicated Servers

Fast, reliable transcription on bare-metal GPU hardware. Choose from budget to high-end configurations.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?