Voice assistants are latency monsters. Speech-to-text, language model inference and text-to-speech all serialise into a single user-perceived round-trip, and humans notice every extra hundred milliseconds beyond about 800. Cloud APIs make this worse by adding 200-400ms of network round-trip per stage. The RTX 4090 24GB can host the entire pipeline on a single card and deliver under 1.1-second end-to-end turn latency for short utterances. This post documents the named workload — a 200-MAU support copilot for a UK fintech taking 80 concurrent calls at peak — and the stage-by-stage numbers we measure on a stock UK 4090 host.
Contents
- Named workload: fintech support copilot
- Pipeline architecture
- ASR latency: Whisper Turbo
- LLM latency: Llama 3 8B FP8
- TTS latency: XTTS v2
- End-to-end latency budget
- Capacity and scaling triggers
- Production gotchas
- Verdict: when to pick a 4090
Named workload: fintech support copilot
The reference workload is a UK fintech with 200,000 monthly active customers running a voice support copilot embedded in their mobile app and on inbound phone (Twilio Programmable Voice). Mean call duration is 4.2 minutes; mean turn count is 14; mean utterance length is 8 seconds. Peak concurrent calls observed in the last 90 days: 78. SLA target: 95th percentile end-to-end turn latency under 1.5 seconds, hard budget 2.0 seconds before users perceive “this is broken”.
The previous architecture used three separate cloud APIs (Deepgram for ASR, Anthropic for LLM, ElevenLabs for TTS) and averaged 2.1 seconds turn latency with 5% of turns over 4 seconds. The migration to two 4090s running the local stack hit 0.86 seconds median, 1.34 seconds p95, with no calls over 2 seconds.
Pipeline architecture
| Stage | Component | Server | VRAM | Throughput |
|---|---|---|---|---|
| VAD | Silero VAD v4 | onnxruntime CPU | 0 GB | 200x RT |
| ASR | Whisper large-v3-turbo INT8 | faster-whisper | 1.7 GB | 80-175x RT |
| LLM | Llama 3.1 8B FP8 + FP8 KV | vLLM 0.6.4 | 10 GB resident + KV | 198 t/s b=1 |
| TTS | XTTS v2 | coqui-tts | 2.0 GB | RTF 0.07 |
| Audio I/O | WebRTC / Twilio Media Streams | CPU | — | 20ms frames |
Total resident VRAM ~14 GB, leaving 10 GB for KV cache growth, voice clone embeddings, and burst headroom. All three GPU models stay warm; no cold-load latency between turns. The 80ms VAD tail and 20ms audio frame size are the standard telephony defaults that work cleanly with both WebRTC and Twilio.
ASR latency: Whisper Turbo
Whisper large-v3-turbo is the fast distilled variant of large-v3, with 4 decoder layers instead of 32. INT8 quantisation via faster-whisper’s CTranslate2 backend gets it down to 1.7 GB resident on the 4090. Streaming VAD with an 80ms tail detects end-of-utterance cleanly without cutting off speakers mid-thought:
| Utterance length | VAD tail | Whisper time | Total ASR |
|---|---|---|---|
| 2 s | 80 ms | 25 ms | 105 ms |
| 4 s | 80 ms | 50 ms | 130 ms |
| 10 s | 80 ms | 125 ms | 205 ms |
| 30 s (long answer) | 80 ms | 375 ms | 455 ms |
For batched ASR (multiple concurrent calls hitting Whisper simultaneously), aggregate throughput climbs from 80x real-time at batch 1 to 175x at batch 16. In the named fintech workload with 78 concurrent calls, Whisper averages batch 4-6 in any given 100ms window, comfortably inside the 130ms ASR budget.
LLM latency: Llama 3 8B FP8
Llama 3.1 8B FP8 with FP8 KV cache and prefix caching is the right default for voice. The 4090’s native FP8 fourth-generation tensor cores make the model resident in just 10 GB with 198 tokens/sec single-stream decode. Prefix caching is critical: the system prompt for a support copilot is typically 1,200-2,000 tokens of policy, tools and examples, and reusing its KV across turns shaves the prefill from ~140ms to ~15ms.
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8 \
--max-model-len 65536 --max-num-seqs 32 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.92
| Output | TTFT (cached prompt) | Time to 30 tok | Time to 80 tok |
|---|---|---|---|
| Streaming, b=1 | 60 ms | 210 ms | 410 ms |
| Streaming, b=8 concurrent | 180 ms | 340 ms | 590 ms |
| Streaming, b=16 concurrent | 340 ms | 520 ms | 820 ms |
The LLM streams tokens directly into the TTS as soon as the first sentence boundary appears (typically 25-40 tokens). You don’t wait for the full response — you start synthesising audio while the LLM continues generating. This pipelining is the single biggest perceived-latency win in the entire stack.
TTS latency: XTTS v2
XTTS v2 is the production-grade zero-shot voice cloning model from Coqui. RTF (real-time factor) on the 4090 is 0.07 — one second of output audio takes 70 ms to synthesise. Voice cloning encode (the 6-second reference clip processing) takes 280ms but is cacheable per speaker, so you pay it once per call, not per turn:
| Output | Encode | Synthesis | Total |
|---|---|---|---|
| 5s reply, cached voice | 0 ms | 350 ms | 350 ms |
| 5s reply, fresh voice | 280 ms | 350 ms | 630 ms |
| First sentence (1.5s, streamed) | 0 ms | 105 ms | 105 ms |
| Full 6s reply, streamed in chunks | 0 ms | 420 ms (interleaved) | ~420 ms total wall |
Streaming TTS in 1-2 sentence chunks as the LLM produces them is essential. The user hears the first audio at LLM TTFT + first-sentence decode + 70ms TTS — typically under 500ms — and the rest streams in continuously. Without streaming, the user waits for the full LLM decode plus full TTS synthesis: a 6-second reply at 80 tok/s decode plus 420ms synthesis is 1.4 seconds of dead air after their question ends.
End-to-end latency budget
The full turn for a 10-second utterance with a 6-second reply, on a warm pipeline:
| Stage | Latency | Cumulative |
|---|---|---|
| VAD tail (end-of-utterance detect) | 80 ms | 80 ms |
| Whisper Turbo (10s in) | 130 ms | 210 ms |
| LLM TTFT (cached system prompt) | 60 ms | 270 ms |
| LLM first sentence (~30 tokens) | 150 ms | 420 ms |
| XTTS first audio chunk (1.5s) | 105 ms | 525 ms |
| Network + jitter buffer | 200 ms | 725 ms |
| User hears first audio | — | ~725 ms |
| Continued streaming (rest of 6s reply) | ~410 ms (overlaps audio playback) | — |
| Total turn budget (perceived) | — | ~1.1 s |
Comfortably under the 1.5-second p95 target, often under one second on shorter utterances. Note that the 200ms network jitter buffer is the single largest line item — for in-app voice (WebRTC over good networks) it drops to ~80ms, putting median turns at 600ms.
Capacity and scaling triggers
Voice assistants are heavier per user than text chat because all three models do work for each turn, and turns are bursty (everyone talks at once at handoff moments). On a single 4090 with the stack above, sustained capacity inside the 1.5s p95 budget:
| Concurrent calls | p50 turn latency | p95 turn latency | VRAM | Notes |
|---|---|---|---|---|
| 4 | 720 ms | 980 ms | 15 GB | Comfortable |
| 8 | 880 ms | 1.34 s | 17 GB | Production target |
| 10 | 1.05 s | 1.61 s | 19 GB | SLA breach starts |
| 12 | 1.31 s | 2.10 s | 21 GB | Add a card |
| 16 | 1.78 s | 2.95 s | 23 GB | Visibly degraded |
Scaling triggers for the named fintech workload:
- Add a second 4090 at 8 concurrent calls sustained. One card per ~8 concurrent gives clean SLA headroom for the bursts.
- Move Whisper to a separate smaller card (e.g. 5060 Ti) at 20+ concurrent. ASR is independent and the cheapest stage to displace.
- Pin the LLM to one 4090, shard ASR + TTS across remaining cards at 50+ concurrent. The LLM is the bottleneck above this point; voice clone caches benefit from sticky routing.
- Switch LLM to Qwen 14B AWQ if answer quality complaints rise. Costs ~30% concurrency, gains noticeable accuracy on policy questions.
The fintech runs 11 4090s across two UK regions to serve the 78-call peak with regional failover; their previous cloud spend was 4.2x the dedicated hosting bill.
Production gotchas
- VAD tuning is more important than model choice. An 80ms tail catches most natural pauses; 200ms feels laggy. Tune the energy threshold per audio source — Twilio’s narrowband audio needs different settings than WebRTC’s wideband.
- XTTS voice clone embeddings drift on long calls. Re-encode the speaker reference every 30 turns; the cached embedding subtly degrades over a long session.
- Whisper hallucinates on silence. If VAD passes a chunk with no speech (background noise, music), Whisper may produce phantom transcriptions like “Thanks for watching” (its training data leakage). Run a confidence threshold and drop low-confidence outputs.
- Llama 3 emojis break TTS. Some tokenisers pass emoji codepoints to XTTS which then crashes the synthesiser. Strip non-Latin codepoints in a post-processing step.
- Twilio Media Streams add 80-150ms one-way. Bake this into your latency budget. WebRTC inside your own app saves this.
- Don’t terminate TLS in your inference container. Push it to nginx or a sidecar; the TLS handshake on cold connections adds 40-80ms that compounds on every reconnect.
- Pre-warm the LLM with a dummy turn. The first inference after vLLM startup takes 8-12 seconds for CUDA graph compilation; don’t accept traffic until it’s done.
Verdict: when to pick a 4090 for voice
Pick the RTX 4090 24GB for voice assistants when you have steady traffic above ~5 concurrent calls and care about either latency consistency or per-call cost. The named fintech workload gets 8 concurrent calls at SLA per card with all three models resident — a level of integration impossible on a 16GB card. For lower-volume use cases (under 4 concurrent), a single 4090 still wins on latency consistency over cloud APIs but the cost equation depends on call volume. For ultra-low-latency in-app voice with <500ms turns, consider a 5090 for the extra LLM headroom. Throughput numbers are corroborated by our Whisper benchmark and Llama 8B benchmark.
Sub-second voice on one card
Whisper + Llama + XTTS resident together, no per-minute API fees, predictable monthly cost. UK dedicated hosting.
Order the RTX 4090 24GBSee also: Whisper benchmark, FP8 Llama deployment, prefill / decode benchmarks, Llama 8B benchmark, chatbot backend use case, vLLM setup, concurrent users, 4090 spec breakdown.