RTX 3050 - Order Now
Home / Blog / Tutorials / Migrate from Replicate to Dedicated GPU: Audio Transcription
Tutorials

Migrate from Replicate to Dedicated GPU: Audio Transcription

Self-host Whisper-based transcription on dedicated GPUs instead of Replicate for real-time processing, zero per-minute audio costs, and support for custom fine-tuned speech models.

Transcribing 10,000 Hours of Audio Per Month Through an API Is Financial Self-Harm

A podcast analytics company transcribes every episode from 3,000 active podcasts — roughly 10,000 hours of audio monthly. They started on Replicate using Whisper large-v3, paying per second of GPU processing time. Whisper processes audio at approximately 10-15x real-time speed on an RTX 6000 Pro, meaning each hour of audio requires 4-6 minutes of GPU time. For 10,000 hours of audio, that’s 40,000-60,000 minutes of GPU time monthly. At Replicate’s pricing, the monthly bill settled around $5,800. Meanwhile, a single RTX 6000 Pro 96 GB running Faster-Whisper can transcribe the same volume in about 28 days of continuous processing — or 14 days on two RTX 6000 Pros. The annual cost difference: $69,600 on Replicate versus $43,200 for two dedicated RTX 6000 Pros. And the dedicated hardware still has capacity for the other 16 days of the month.

Audio transcription at scale is a perfect candidate for dedicated GPU infrastructure. The workload is predictable, the models are stable, and the volume makes per-second billing painful.

Replicate vs. Dedicated for Transcription

Transcription FeatureReplicateDedicated GPU
Model optionsWhisper variants on ReplicateAny Whisper variant, Faster-Whisper, custom
Real-time transcriptionNot supported (batch only)Streaming with WhisperLive or custom
Cost per audio hour~$0.58 (varies by model/speed)~$0.06 (amortised RTX 6000 Pro monthly)
Custom vocabularyNot configurableFull control over decoding params
Speaker diarisationSeparate model call requiredIntegrated pipeline on same GPU
Language modelsFixed to Replicate’s versionsFine-tuned models for domain audio

Setting Up Self-Hosted Transcription

Step 1: Choose your transcription engine. Faster-Whisper (CTranslate2 backend) delivers 4x the throughput of standard Whisper with identical accuracy. On a GigaGPU dedicated server, install it alongside complementary audio processing tools:

pip install faster-whisper
pip install pyannote.audio   # speaker diarisation
pip install pydub            # audio preprocessing
pip install fastapi uvicorn  # API layer

Step 2: Build your transcription API. Create an endpoint that replicates Replicate’s interface while adding capabilities that weren’t available through their API:

from faster_whisper import WhisperModel
from fastapi import FastAPI, UploadFile
import tempfile

app = FastAPI()
model = WhisperModel("large-v3", device="cuda",
                     compute_type="float16")

@app.post("/transcribe")
async def transcribe(file: UploadFile, language: str = None,
                     word_timestamps: bool = True):
    with tempfile.NamedTemporaryFile(suffix=".wav") as tmp:
        tmp.write(await file.read())
        tmp.flush()
        segments, info = model.transcribe(
            tmp.name,
            language=language,
            word_timestamps=word_timestamps,
            beam_size=5,
            vad_filter=True  # Skip silence — faster processing
        )
        results = [{"start": s.start, "end": s.end,
                    "text": s.text, "words": s.words}
                   for s in segments]
    return {"language": info.language,
            "duration": info.duration,
            "segments": results}

Step 3: Add speaker diarisation. A major limitation of Replicate’s Whisper deployment is that speaker identification requires a separate model call. On dedicated hardware, run diarisation alongside transcription in a unified pipeline:

from pyannote.audio import Pipeline as DiarizePipeline

diarize = DiarizePipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1", use_auth_token="...")
diarize.to(torch.device("cuda"))

# After transcription, align segments with speakers
diarization = diarize(audio_file)
# Merge transcription segments with speaker labels

Step 4: Set up batch processing. For podcast-scale workloads, create a processing queue that continuously transcribes audio files from an input directory:

# Batch transcription worker
import glob, json
from pathlib import Path

input_dir = Path("/data/audio/inbox")
output_dir = Path("/data/audio/transcripts")

for audio_file in sorted(input_dir.glob("*.mp3")):
    segments, info = model.transcribe(str(audio_file),
        vad_filter=True, word_timestamps=True)
    transcript = [{"start": s.start, "end": s.end,
                   "text": s.text} for s in segments]
    out_file = output_dir / f"{audio_file.stem}.json"
    out_file.write_text(json.dumps(transcript, indent=2))
    audio_file.rename(input_dir / "processed" / audio_file.name)

Throughput and Quality

Faster-Whisper on dedicated RTX 6000 Pro hardware processes audio at 30-50x real-time speed with the large-v3 model — significantly faster than standard Whisper on Replicate. This means:

  • 1 hour of audio: Transcribed in 72-120 seconds (vs. 4-6 minutes on Replicate including queue time)
  • 10,000 hours/month: Processable on a single RTX 6000 Pro in ~14 days of continuous operation
  • Real-time capable: With streaming setups, transcribe live audio with <2 second latency — impossible through Replicate's batch API

For specialised domains, fine-tuned Whisper models trained on your specific audio type (medical dictation, legal proceedings, accented speech) run natively on dedicated hardware. See open-source model hosting for deploying custom models.

Cost Comparison

Monthly Audio HoursReplicate MonthlyGigaGPU MonthlySavings
500 hours~$290~$1,800Replicate cheaper
2,000 hours~$1,160~$1,800Replicate cheaper
4,000 hours~$2,320~$1,80022% savings on dedicated
10,000 hours~$5,800~$1,80069% savings on dedicated
25,000 hours~$14,500~$3,600 (2x RTX 6000 Pro)75% savings on dedicated

Dedicated hardware breaks even at approximately 3,100 hours of audio per month. The LLM cost calculator helps model audio workload economics alongside text-based inference costs.

Transcription as Infrastructure, Not a Service Call

When transcription runs on your own hardware, it becomes a reliable infrastructure component rather than an external API dependency. No cold starts, no rate limits, no surprise pricing changes. For compliance-sensitive audio — legal recordings, medical dictation, financial calls — private AI hosting ensures recordings never leave your infrastructure.

Related guides: our Replicate alternative page, the GPU vs API cost comparison, and more in the tutorials section. For LLM-based post-processing of transcripts, explore vLLM hosting and the cost analysis section.

Transcribe Thousands of Hours at Fixed Cost

Self-hosted Whisper on GigaGPU dedicated servers processes audio at 30-50x real-time. Predictable monthly pricing, no per-minute charges, no cold starts.

Browse GPU Servers

Filed under: Tutorials

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?