Transcribing 10,000 Hours of Audio Per Month Through an API Is Financial Self-Harm
A podcast analytics company transcribes every episode from 3,000 active podcasts — roughly 10,000 hours of audio monthly. They started on Replicate using Whisper large-v3, paying per second of GPU processing time. Whisper processes audio at approximately 10-15x real-time speed on an RTX 6000 Pro, meaning each hour of audio requires 4-6 minutes of GPU time. For 10,000 hours of audio, that’s 40,000-60,000 minutes of GPU time monthly. At Replicate’s pricing, the monthly bill settled around $5,800. Meanwhile, a single RTX 6000 Pro 96 GB running Faster-Whisper can transcribe the same volume in about 28 days of continuous processing — or 14 days on two RTX 6000 Pros. The annual cost difference: $69,600 on Replicate versus $43,200 for two dedicated RTX 6000 Pros. And the dedicated hardware still has capacity for the other 16 days of the month.
Audio transcription at scale is a perfect candidate for dedicated GPU infrastructure. The workload is predictable, the models are stable, and the volume makes per-second billing painful.
Replicate vs. Dedicated for Transcription
| Transcription Feature | Replicate | Dedicated GPU |
|---|---|---|
| Model options | Whisper variants on Replicate | Any Whisper variant, Faster-Whisper, custom |
| Real-time transcription | Not supported (batch only) | Streaming with WhisperLive or custom |
| Cost per audio hour | ~$0.58 (varies by model/speed) | ~$0.06 (amortised RTX 6000 Pro monthly) |
| Custom vocabulary | Not configurable | Full control over decoding params |
| Speaker diarisation | Separate model call required | Integrated pipeline on same GPU |
| Language models | Fixed to Replicate’s versions | Fine-tuned models for domain audio |
Setting Up Self-Hosted Transcription
Step 1: Choose your transcription engine. Faster-Whisper (CTranslate2 backend) delivers 4x the throughput of standard Whisper with identical accuracy. On a GigaGPU dedicated server, install it alongside complementary audio processing tools:
pip install faster-whisper
pip install pyannote.audio # speaker diarisation
pip install pydub # audio preprocessing
pip install fastapi uvicorn # API layer
Step 2: Build your transcription API. Create an endpoint that replicates Replicate’s interface while adding capabilities that weren’t available through their API:
from faster_whisper import WhisperModel
from fastapi import FastAPI, UploadFile
import tempfile
app = FastAPI()
model = WhisperModel("large-v3", device="cuda",
compute_type="float16")
@app.post("/transcribe")
async def transcribe(file: UploadFile, language: str = None,
word_timestamps: bool = True):
with tempfile.NamedTemporaryFile(suffix=".wav") as tmp:
tmp.write(await file.read())
tmp.flush()
segments, info = model.transcribe(
tmp.name,
language=language,
word_timestamps=word_timestamps,
beam_size=5,
vad_filter=True # Skip silence — faster processing
)
results = [{"start": s.start, "end": s.end,
"text": s.text, "words": s.words}
for s in segments]
return {"language": info.language,
"duration": info.duration,
"segments": results}
Step 3: Add speaker diarisation. A major limitation of Replicate’s Whisper deployment is that speaker identification requires a separate model call. On dedicated hardware, run diarisation alongside transcription in a unified pipeline:
from pyannote.audio import Pipeline as DiarizePipeline
diarize = DiarizePipeline.from_pretrained(
"pyannote/speaker-diarization-3.1", use_auth_token="...")
diarize.to(torch.device("cuda"))
# After transcription, align segments with speakers
diarization = diarize(audio_file)
# Merge transcription segments with speaker labels
Step 4: Set up batch processing. For podcast-scale workloads, create a processing queue that continuously transcribes audio files from an input directory:
# Batch transcription worker
import glob, json
from pathlib import Path
input_dir = Path("/data/audio/inbox")
output_dir = Path("/data/audio/transcripts")
for audio_file in sorted(input_dir.glob("*.mp3")):
segments, info = model.transcribe(str(audio_file),
vad_filter=True, word_timestamps=True)
transcript = [{"start": s.start, "end": s.end,
"text": s.text} for s in segments]
out_file = output_dir / f"{audio_file.stem}.json"
out_file.write_text(json.dumps(transcript, indent=2))
audio_file.rename(input_dir / "processed" / audio_file.name)
Throughput and Quality
Faster-Whisper on dedicated RTX 6000 Pro hardware processes audio at 30-50x real-time speed with the large-v3 model — significantly faster than standard Whisper on Replicate. This means:
- 1 hour of audio: Transcribed in 72-120 seconds (vs. 4-6 minutes on Replicate including queue time)
- 10,000 hours/month: Processable on a single RTX 6000 Pro in ~14 days of continuous operation
- Real-time capable: With streaming setups, transcribe live audio with <2 second latency — impossible through Replicate's batch API
For specialised domains, fine-tuned Whisper models trained on your specific audio type (medical dictation, legal proceedings, accented speech) run natively on dedicated hardware. See open-source model hosting for deploying custom models.
Cost Comparison
| Monthly Audio Hours | Replicate Monthly | GigaGPU Monthly | Savings |
|---|---|---|---|
| 500 hours | ~$290 | ~$1,800 | Replicate cheaper |
| 2,000 hours | ~$1,160 | ~$1,800 | Replicate cheaper |
| 4,000 hours | ~$2,320 | ~$1,800 | 22% savings on dedicated |
| 10,000 hours | ~$5,800 | ~$1,800 | 69% savings on dedicated |
| 25,000 hours | ~$14,500 | ~$3,600 (2x RTX 6000 Pro) | 75% savings on dedicated |
Dedicated hardware breaks even at approximately 3,100 hours of audio per month. The LLM cost calculator helps model audio workload economics alongside text-based inference costs.
Transcription as Infrastructure, Not a Service Call
When transcription runs on your own hardware, it becomes a reliable infrastructure component rather than an external API dependency. No cold starts, no rate limits, no surprise pricing changes. For compliance-sensitive audio — legal recordings, medical dictation, financial calls — private AI hosting ensures recordings never leave your infrastructure.
Related guides: our Replicate alternative page, the GPU vs API cost comparison, and more in the tutorials section. For LLM-based post-processing of transcripts, explore vLLM hosting and the cost analysis section.
Transcribe Thousands of Hours at Fixed Cost
Self-hosted Whisper on GigaGPU dedicated servers processes audio at 30-50x real-time. Predictable monthly pricing, no per-minute charges, no cold starts.
Browse GPU ServersFiled under: Tutorials