RTX 3050 - Order Now
Home / Blog / Use Cases / RTX 4090 24GB for End-to-End Voice Assistants
Use Cases

RTX 4090 24GB for End-to-End Voice Assistants

A complete voice assistant stack on the RTX 4090 24GB: Whisper Turbo, Llama 3 8B FP8 and XTTS v2 in 1.1s end-to-end. Stage-by-stage budgets, 8-12 concurrent calls, and production gotchas for telephony deployments.

Voice assistants are latency monsters. Speech-to-text, language model inference and text-to-speech all serialise into a single user-perceived round-trip, and humans notice every extra hundred milliseconds beyond about 800. Cloud APIs make this worse by adding 200-400ms of network round-trip per stage. The RTX 4090 24GB can host the entire pipeline on a single card and deliver under 1.1-second end-to-end turn latency for short utterances. This post documents the named workload — a 200-MAU support copilot for a UK fintech taking 80 concurrent calls at peak — and the stage-by-stage numbers we measure on a stock UK 4090 host.

Contents

Named workload: fintech support copilot

The reference workload is a UK fintech with 200,000 monthly active customers running a voice support copilot embedded in their mobile app and on inbound phone (Twilio Programmable Voice). Mean call duration is 4.2 minutes; mean turn count is 14; mean utterance length is 8 seconds. Peak concurrent calls observed in the last 90 days: 78. SLA target: 95th percentile end-to-end turn latency under 1.5 seconds, hard budget 2.0 seconds before users perceive “this is broken”.

The previous architecture used three separate cloud APIs (Deepgram for ASR, Anthropic for LLM, ElevenLabs for TTS) and averaged 2.1 seconds turn latency with 5% of turns over 4 seconds. The migration to two 4090s running the local stack hit 0.86 seconds median, 1.34 seconds p95, with no calls over 2 seconds.

Pipeline architecture

StageComponentServerVRAMThroughput
VADSilero VAD v4onnxruntime CPU0 GB200x RT
ASRWhisper large-v3-turbo INT8faster-whisper1.7 GB80-175x RT
LLMLlama 3.1 8B FP8 + FP8 KVvLLM 0.6.410 GB resident + KV198 t/s b=1
TTSXTTS v2coqui-tts2.0 GBRTF 0.07
Audio I/OWebRTC / Twilio Media StreamsCPU20ms frames

Total resident VRAM ~14 GB, leaving 10 GB for KV cache growth, voice clone embeddings, and burst headroom. All three GPU models stay warm; no cold-load latency between turns. The 80ms VAD tail and 20ms audio frame size are the standard telephony defaults that work cleanly with both WebRTC and Twilio.

ASR latency: Whisper Turbo

Whisper large-v3-turbo is the fast distilled variant of large-v3, with 4 decoder layers instead of 32. INT8 quantisation via faster-whisper’s CTranslate2 backend gets it down to 1.7 GB resident on the 4090. Streaming VAD with an 80ms tail detects end-of-utterance cleanly without cutting off speakers mid-thought:

Utterance lengthVAD tailWhisper timeTotal ASR
2 s80 ms25 ms105 ms
4 s80 ms50 ms130 ms
10 s80 ms125 ms205 ms
30 s (long answer)80 ms375 ms455 ms

For batched ASR (multiple concurrent calls hitting Whisper simultaneously), aggregate throughput climbs from 80x real-time at batch 1 to 175x at batch 16. In the named fintech workload with 78 concurrent calls, Whisper averages batch 4-6 in any given 100ms window, comfortably inside the 130ms ASR budget.

LLM latency: Llama 3 8B FP8

Llama 3.1 8B FP8 with FP8 KV cache and prefix caching is the right default for voice. The 4090’s native FP8 fourth-generation tensor cores make the model resident in just 10 GB with 198 tokens/sec single-stream decode. Prefix caching is critical: the system prompt for a support copilot is typically 1,200-2,000 tokens of policy, tools and examples, and reusing its KV across turns shaves the prefill from ~140ms to ~15ms.

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 65536 --max-num-seqs 32 \
  --enable-chunked-prefill --enable-prefix-caching \
  --gpu-memory-utilization 0.92
OutputTTFT (cached prompt)Time to 30 tokTime to 80 tok
Streaming, b=160 ms210 ms410 ms
Streaming, b=8 concurrent180 ms340 ms590 ms
Streaming, b=16 concurrent340 ms520 ms820 ms

The LLM streams tokens directly into the TTS as soon as the first sentence boundary appears (typically 25-40 tokens). You don’t wait for the full response — you start synthesising audio while the LLM continues generating. This pipelining is the single biggest perceived-latency win in the entire stack.

TTS latency: XTTS v2

XTTS v2 is the production-grade zero-shot voice cloning model from Coqui. RTF (real-time factor) on the 4090 is 0.07 — one second of output audio takes 70 ms to synthesise. Voice cloning encode (the 6-second reference clip processing) takes 280ms but is cacheable per speaker, so you pay it once per call, not per turn:

OutputEncodeSynthesisTotal
5s reply, cached voice0 ms350 ms350 ms
5s reply, fresh voice280 ms350 ms630 ms
First sentence (1.5s, streamed)0 ms105 ms105 ms
Full 6s reply, streamed in chunks0 ms420 ms (interleaved)~420 ms total wall

Streaming TTS in 1-2 sentence chunks as the LLM produces them is essential. The user hears the first audio at LLM TTFT + first-sentence decode + 70ms TTS — typically under 500ms — and the rest streams in continuously. Without streaming, the user waits for the full LLM decode plus full TTS synthesis: a 6-second reply at 80 tok/s decode plus 420ms synthesis is 1.4 seconds of dead air after their question ends.

End-to-end latency budget

The full turn for a 10-second utterance with a 6-second reply, on a warm pipeline:

StageLatencyCumulative
VAD tail (end-of-utterance detect)80 ms80 ms
Whisper Turbo (10s in)130 ms210 ms
LLM TTFT (cached system prompt)60 ms270 ms
LLM first sentence (~30 tokens)150 ms420 ms
XTTS first audio chunk (1.5s)105 ms525 ms
Network + jitter buffer200 ms725 ms
User hears first audio~725 ms
Continued streaming (rest of 6s reply)~410 ms (overlaps audio playback)
Total turn budget (perceived)~1.1 s

Comfortably under the 1.5-second p95 target, often under one second on shorter utterances. Note that the 200ms network jitter buffer is the single largest line item — for in-app voice (WebRTC over good networks) it drops to ~80ms, putting median turns at 600ms.

Capacity and scaling triggers

Voice assistants are heavier per user than text chat because all three models do work for each turn, and turns are bursty (everyone talks at once at handoff moments). On a single 4090 with the stack above, sustained capacity inside the 1.5s p95 budget:

Concurrent callsp50 turn latencyp95 turn latencyVRAMNotes
4720 ms980 ms15 GBComfortable
8880 ms1.34 s17 GBProduction target
101.05 s1.61 s19 GBSLA breach starts
121.31 s2.10 s21 GBAdd a card
161.78 s2.95 s23 GBVisibly degraded

Scaling triggers for the named fintech workload:

  • Add a second 4090 at 8 concurrent calls sustained. One card per ~8 concurrent gives clean SLA headroom for the bursts.
  • Move Whisper to a separate smaller card (e.g. 5060 Ti) at 20+ concurrent. ASR is independent and the cheapest stage to displace.
  • Pin the LLM to one 4090, shard ASR + TTS across remaining cards at 50+ concurrent. The LLM is the bottleneck above this point; voice clone caches benefit from sticky routing.
  • Switch LLM to Qwen 14B AWQ if answer quality complaints rise. Costs ~30% concurrency, gains noticeable accuracy on policy questions.

The fintech runs 11 4090s across two UK regions to serve the 78-call peak with regional failover; their previous cloud spend was 4.2x the dedicated hosting bill.

Production gotchas

  • VAD tuning is more important than model choice. An 80ms tail catches most natural pauses; 200ms feels laggy. Tune the energy threshold per audio source — Twilio’s narrowband audio needs different settings than WebRTC’s wideband.
  • XTTS voice clone embeddings drift on long calls. Re-encode the speaker reference every 30 turns; the cached embedding subtly degrades over a long session.
  • Whisper hallucinates on silence. If VAD passes a chunk with no speech (background noise, music), Whisper may produce phantom transcriptions like “Thanks for watching” (its training data leakage). Run a confidence threshold and drop low-confidence outputs.
  • Llama 3 emojis break TTS. Some tokenisers pass emoji codepoints to XTTS which then crashes the synthesiser. Strip non-Latin codepoints in a post-processing step.
  • Twilio Media Streams add 80-150ms one-way. Bake this into your latency budget. WebRTC inside your own app saves this.
  • Don’t terminate TLS in your inference container. Push it to nginx or a sidecar; the TLS handshake on cold connections adds 40-80ms that compounds on every reconnect.
  • Pre-warm the LLM with a dummy turn. The first inference after vLLM startup takes 8-12 seconds for CUDA graph compilation; don’t accept traffic until it’s done.

Verdict: when to pick a 4090 for voice

Pick the RTX 4090 24GB for voice assistants when you have steady traffic above ~5 concurrent calls and care about either latency consistency or per-call cost. The named fintech workload gets 8 concurrent calls at SLA per card with all three models resident — a level of integration impossible on a 16GB card. For lower-volume use cases (under 4 concurrent), a single 4090 still wins on latency consistency over cloud APIs but the cost equation depends on call volume. For ultra-low-latency in-app voice with <500ms turns, consider a 5090 for the extra LLM headroom. Throughput numbers are corroborated by our Whisper benchmark and Llama 8B benchmark.

Sub-second voice on one card

Whisper + Llama + XTTS resident together, no per-minute API fees, predictable monthly cost. UK dedicated hosting.

Order the RTX 4090 24GB

See also: Whisper benchmark, FP8 Llama deployment, prefill / decode benchmarks, Llama 8B benchmark, chatbot backend use case, vLLM setup, concurrent users, 4090 spec breakdown.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?