Home / Blog / Benchmarks / Whisper Medium RTF by GPU

Benchmarks

Whisper Medium RTF by GPU

Benchmark data for OpenAI Whisper Medium real-time factor across six GPUs with FP16 and INT8 results and cost analysis for dedicated GPU hosting.

Benchmarks April 14, 2026 2 min read admin

Table of Contents

Whisper Medium Benchmark Overview
RTF Results by GPU
FP16 vs INT8 Comparison
Cost Efficiency Analysis
GPU Recommendations
Conclusion

Whisper Medium Benchmark Overview

OpenAI Whisper Medium (769M parameters) sits between the smaller Whisper Small and the flagship Large-v3, offering a strong balance of accuracy and speed. For many transcription workloads it provides more than sufficient quality while running significantly faster. Deploying it on a dedicated GPU server keeps latency low and throughput high for production use.

We benchmarked Whisper Medium using faster-whisper (CTranslate2) on GigaGPU servers with a 10-minute English audio sample. The model needs approximately 1.5 GB of VRAM at FP16, making it runnable on every GPU tested. For methodology details, see our benchmark hub.

RTF Results by GPU

Lower RTF is better. Below 1.0 means faster than real-time transcription.

GPU	VRAM	Whisper Medium FP16 RTF	Speed vs Real-Time
RTX 3050	6 GB	0.16	6.3x real-time
RTX 4060	8 GB	0.09	11.1x real-time
RTX 4060 Ti	16 GB	0.065	15.4x real-time
RTX 3090	24 GB	0.045	22.2x real-time
RTX 5080	16 GB	0.03	33.3x real-time
RTX 5090	32 GB	0.02	50x real-time

Whisper Medium is substantially faster than Large-v3, with the RTX 5090 reaching a remarkable 50x real-time speed. Even the budget RTX 3050 manages 6.3x real-time, making it viable for lightweight self-hosted transcription.

FP16 vs INT8 Comparison

INT8 quantisation further improves speed. See our quantisation analysis for background on precision trade-offs.

GPU	FP16 RTF	INT8 RTF	Improvement
RTX 3050	0.16	0.11	31%
RTX 4060	0.09	0.06	33%
RTX 4060 Ti	0.065	0.044	32%
RTX 3090	0.045	0.03	33%
RTX 5080	0.03	0.02	33%
RTX 5090	0.02	0.014	30%

INT8 gives a consistent ~32% speed boost. The RTX 5090 at INT8 reaches 71x real-time, processing a 1-hour recording in under 51 seconds.

Cost Efficiency Analysis

GPU	FP16 RTF	Approx. Monthly Cost	Speed/Pound
RTX 3050	0.16	~£45	0.139
RTX 4060	0.09	~£60	0.185
RTX 4060 Ti	0.065	~£75	0.205
RTX 3090	0.045	~£110	0.202
RTX 5080	0.03	~£160	0.208
RTX 5090	0.02	~£250	0.200

The RTX 5080 and RTX 4060 Ti tie for best cost efficiency. For the best GPU for Whisper, the RTX 4060 Ti is the clear budget champion.

GPU Recommendations

Budget: RTX 4060 — 11x real-time is excellent for moderate transcription volumes at low cost.
Best value: RTX 4060 Ti — top cost efficiency with 15x real-time speed.
High volume: RTX 5080 — 33x real-time handles heavy transcription pipelines.
Maximum speed: RTX 5090 — 50x real-time for time-critical applications.

If you need better accuracy, see the Whisper Large-v3 RTF benchmark. For a detailed comparison across model sizes, check the Whisper Tiny vs Base vs Small comparison. Browse all data in the Benchmarks category.

Conclusion

Whisper Medium is the sweet spot for most transcription workloads, offering near-Large-v3 accuracy with roughly double the speed. It runs on every GPU we tested and delivers exceptional cost efficiency on mid-range cards. For teams that do not need the absolute best multilingual accuracy, Whisper Medium on dedicated hardware is the practical choice.

Fast Transcription with Whisper on Dedicated GPUs

Bare-metal GPU servers for speech-to-text workloads. From budget to high-end, find the right server for your volume.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Whisper Medium RTF by GPU

Whisper Medium Benchmark Overview

RTF Results by GPU

FP16 vs INT8 Comparison

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Fast Transcription with Whisper on Dedicated GPUs

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Whisper Medium RTF by GPU

Whisper Medium Benchmark Overview

RTF Results by GPU

FP16 vs INT8 Comparison

Cost Efficiency Analysis

GPU Recommendations

Conclusion

Fast Transcription with Whisper on Dedicated GPUs

Need a Dedicated GPU Server?

admin

Related Articles

Mistral 7B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: mistral-7b-on-rtx-5080-benchmark, Excerpt: Mistral 7B benchmarked on RTX 5080: 68.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

How Many OCR Pages per Minute per GPU?

RTX 5080: Maximum LLM Throughput (Requests/sec)

Gemma 2 9B on RTX 3090: Performance Benchmark & Cost, Category: Benchmarks, Slug: gemma-2-9b-on-rtx-3090-benchmark, Excerpt: Gemma 2 9B benchmarked on RTX 3090: 52.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?