RTX 3050 - Order Now
Home / Blog / Use Cases / Build AI Transcription API with Whisper on GPU
Use Cases

Build AI Transcription API with Whisper on GPU

Build a production transcription API with OpenAI Whisper on a dedicated GPU server. Serve real-time and batch audio-to-text with word-level timestamps, language detection, and speaker diarisation — no per-minute billing or cloud dependencies.

What You’ll Build

In 30 minutes, you will have a production-ready transcription API that accepts audio files via HTTP, returns accurate transcripts with word-level timestamps and language detection, and handles concurrent requests with automatic queuing. Running Whisper large-v3 on a dedicated GPU server, your API transcribes one hour of audio in under three minutes at zero per-minute cost — whether you process 10 files or 10,000 per day.

Cloud transcription services charge $0.006-$0.024 per minute and impose rate limits that block batch workloads. At 1,000 hours of audio monthly, that is $6,000-$24,000 in API fees alone. Self-hosted Whisper on GPU hardware delivers identical or better accuracy with predictable monthly costs and no data leaving your infrastructure.

Architecture Overview

The API wraps Whisper behind a FastAPI service with an async task queue. Audio uploads land in a processing queue backed by Redis. Worker processes pull tasks and run them through Whisper on the GPU. Results are stored and returned via a polling endpoint or webhook callback. A pre-processing step normalises audio formats, resamples to 16kHz, and splits files exceeding 30 minutes into overlapping chunks for parallel processing.

The API layer exposes OpenAI-compatible transcription endpoints, so existing integrations that call the OpenAI Whisper API work with your self-hosted version by changing the base URL. Optional post-processing runs the transcript through an LLM for punctuation restoration, paragraph segmentation, and speaker label assignment.

GPU Requirements

ConcurrencyRecommended GPUVRAMRealtime Factor
1-2 streamsRTX 509024 GB~20x realtime
3-5 streamsRTX 6000 Pro40 GB~30x realtime
5+ streamsRTX 6000 Pro 96 GB80 GB~40x realtime

Whisper large-v3 uses approximately 10GB VRAM. The remaining capacity handles concurrent requests and batch processing. For real-time streaming transcription, Whisper runs in chunked mode with 5-second segments. Check our self-hosted LLM guide for pairing Whisper with an LLM on the same GPU.

Step-by-Step Build

Deploy Whisper and FastAPI on your GPU server. Set up Redis for task queuing and build the transcription endpoints.

from fastapi import FastAPI, UploadFile
import whisper, torch, tempfile, uuid

app = FastAPI()
model = whisper.load_model("large-v3", device="cuda")

@app.post("/v1/audio/transcriptions")
async def transcribe(file: UploadFile, language: str = None,
                     response_format: str = "json"):
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        tmp.write(await file.read())
        tmp_path = tmp.name

    result = model.transcribe(
        tmp_path,
        language=language,
        word_timestamps=True,
        fp16=torch.cuda.is_available()
    )

    if response_format == "verbose_json":
        return {
            "text": result["text"],
            "language": result["language"],
            "segments": result["segments"],
            "duration": result["segments"][-1]["end"]
        }
    return {"text": result["text"]}

@app.get("/health")
async def health():
    return {"status": "ok", "model": "whisper-large-v3"}

Add authentication middleware and rate limiting for production. The OpenAI-compatible endpoint format means existing client libraries work without modification. See production setup for scaling and monitoring configuration.

Scaling and Monitoring

Monitor GPU utilisation and queue depth to track capacity. When average queue wait time exceeds your SLA, add a second worker process or scale to a larger GPU. For burst workloads, pre-process audio into chunks and distribute across multiple workers on the same GPU.

Track transcription accuracy by sampling outputs and comparing against manual transcripts. Whisper large-v3 achieves word error rates below 5% for clear English audio and below 10% for most supported languages. Log request latency, audio duration, and detected language for capacity planning.

Deploy Your Transcription API

A self-hosted transcription API eliminates per-minute billing while delivering Whisper-quality accuracy under your complete control. Serve internal teams, integrate with your products, or offer it as a service. Launch on GigaGPU dedicated GPU hosting and start transcribing at scale. Browse more API use cases and tutorials in our library.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?