Chatbots are the workload the RTX 4090 24GB was born for. With 24 GB of GDDR6X at 1008 GB/s, native FP8 fourth-generation tensor cores, and the maturity of vLLM 0.6’s chunked-prefill and prefix-caching paths, a single card can serve roughly 30 concurrent active sessions at chat-quality latency — enough for a 200-MAU SaaS or a 12-engineer internal copilot. This post pulls together the configuration, prefix-cache maths and operational tips we use across production UK 4090 hosts with the named 200-MAU workload as the reference design point.
Contents
- Named workload: 200-MAU SaaS
- Choosing the model
- Prefix caching maths
- Concurrency and SLA tables
- Chunked prefill physics
- Production vLLM configuration
- Scaling triggers
- Production gotchas
- Verdict: when to pick a 4090
Named workload: 200-MAU SaaS
The reference deployment: a UK B2B SaaS support copilot embedded in a customer-facing dashboard. 200,000 monthly active users, ~12,000 daily active, with a sticky weekly cohort of ~1,500 power users. Average session length: 6.4 turns; mean turn 280 input tokens / 180 output tokens; system prompt 1,400 tokens stable per tenant.
Peak concurrent active sessions observed in the last 90 days: 28 with bursts to 42. SLA targets: p95 TTFT under 800ms, p95 first 30 tokens under 1.2 seconds, complete-response p95 under 4.5 seconds, no failed requests. The previous architecture used a token-metered API and was billing approximately £6,800/month with unpredictable per-tenant cost spikes; the dedicated 4090 deployment runs at fixed cost with 30% headroom for growth.
Choosing the model
| Need | Model | Decode b=1 | Active users | VRAM | MMLU |
|---|---|---|---|---|---|
| Volume / cheap | Llama 3.1 8B FP8 | 198 t/s | 30 | 11 GB | 69.4 |
| Volume / Mistral preference | Mistral 7B FP8 | 215 t/s | 34 | 10 GB | 62.5 |
| Quality / instruction | Qwen 2.5 14B AWQ | 135 t/s | 16 | 17 GB | 79.7 |
| Mixture quality | Mixtral 8x7B AWQ | 85 t/s | 10 | 17 GB | 70.6 |
| Tiny / inline | Phi-3 mini FP8 | 480 t/s | 60 | 5 GB | 69.0 |
| Long-context | Llama 3.1 8B FP8 (128k) | 198 t/s | 16 (KV-bound) | 20 GB | 69.4 |
Llama 3 8B FP8 is the default for the named SaaS workload — it sits at the sweet spot of throughput, quality and concurrent capacity. The team A/B-tested Qwen 14B AWQ for two weeks and found measurably better answers (preference 56-44) but at half the concurrent capacity, which would have required adding a second card just to hold SLA. They kept Llama 8B as primary and route ~3% of complex queries (detected via a small classifier) to a dedicated Qwen 14B endpoint.
Prefix caching maths
Most chatbots have a stable system prompt (300-2,000 tokens) followed by short user messages. vLLM’s prefix cache stores the KV for the system part as immutable blocks; subsequent requests against the same prefix skip its prefill almost entirely. The maths is straightforward: prefill compute is roughly linear in input tokens at ~11ms per 1,000 tokens for Llama 8B FP8, so a 1,500-token system prompt costs ~17ms cold versus ~1ms for the cache lookup.
| System prompt | User msg | Cold prefill | Hot prefill | TTFT cold | TTFT hot |
|---|---|---|---|---|---|
| 1,500 tokens | 50 tokens | 1,550 tokens | 50 tokens | 140 ms | 15 ms |
| 500 tokens | 50 tokens | 550 tokens | 50 tokens | 55 ms | 15 ms |
| 2,000 tokens | 200 tokens | 2,200 tokens | 200 tokens | 200 ms | 22 ms |
| 3,000 tokens (multi-tool) | 100 tokens | 3,100 tokens | 100 tokens | 280 ms | 17 ms |
For a chatbot with a 1,500-token system prompt, prefix caching shaves ~125 ms off every TTFT. Across 12,000 daily turns, that’s 25 GPU-minutes saved per day per card — translating directly into ~10 additional concurrent users at SLA. The cache is per-prefix-hash, so multi-tenant deployments share a single cache automatically when the system prompt is identical across tenants.
Concurrency and SLA tables
Numbers from production load tests with prefix caching on, max-num-seqs 64, max-num-batched-tokens 4096, FP8 KV cache:
| Model | Active | p50 TTFT | p95 TTFT | p99 TTFT | Median t/s/user | VRAM |
|---|---|---|---|---|---|---|
| Llama 3 8B FP8 (no cache) | 30 | 180 ms | 460 ms | 720 ms | 28 | 16 GB |
| Llama 3 8B FP8 + prefix cache | 30 | 40 ms | 180 ms | 340 ms | 30 | 17 GB |
| Llama 3 8B FP8 + cache + chunked | 30 | 40 ms | 175 ms | 290 ms | 30 | 17 GB |
| Llama 3 8B FP8 + everything | 50 | 85 ms | 420 ms | 780 ms | 22 | 20 GB |
| Qwen 14B AWQ + everything | 16 | 120 ms | 620 ms | 980 ms | 22 | 18 GB |
| Qwen 14B AWQ + everything | 26 | 240 ms | 980 ms | 1.6 s | 16 | 20 GB |
The named SaaS holds 30 concurrent active sessions on Llama 8B FP8 + prefix cache + chunked prefill within the 800ms p95 TTFT target with comfortable headroom. The 50-session column shows what happens if you push: p95 TTFT degrades to 420ms (still acceptable) but tail latency climbs and burst protection thins.
Chunked prefill physics
Without chunked prefill, vLLM serialises long prompts: a 4,000-token user paste blocks the decode loop for ~280 ms while it processes prefill in one go. During that window, every other concurrent session stops generating tokens — visible as latency spikes on dashboards. Chunked prefill breaks long prompts into 512-token chunks that interleave with decode steps, so any one session’s prefill costs only ~36 ms of decode pause at a time.
The trade-off: total prefill latency for the long-prompt request is slightly worse (~310 ms vs 280 ms) because of the chunk overhead, but p99 TTFT for all other concurrent sessions drops from ~1.4 s back down to ~290 ms. For interactive chatbot workloads where most messages are short and a few are large pastes, chunked prefill is non-negotiable.
Production vLLM configuration
The reference launch command for the named SaaS workload:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 --kv-cache-dtype fp8 \
--max-model-len 65536 --max-num-seqs 32 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.92 \
--max-num-batched-tokens 4096 \
--disable-log-requests
Key knobs explained: max-num-seqs 32 caps concurrent sessions at 32 — set this to your SLA-supported concurrency, not the maximum the GPU could handle. max-model-len 65536 permits long-context users but the prefix cache and FP8 KV mean you’re not paying for the worst-case context on every request. gpu-memory-utilization 0.92 leaves 2 GB for OS, monitoring agent and burst headroom — more conservative than the usual 0.95 default and worth it for stability.
For Qwen 14B AWQ as the high-quality fallback endpoint:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq_marlin --kv-cache-dtype fp8 \
--max-model-len 32768 --max-num-seqs 16 \
--enable-chunked-prefill --enable-prefix-caching \
--gpu-memory-utilization 0.93
Scaling triggers
One 4090 covers most B2B chatbots with up to ~5,000 MAU at typical engagement rates, comfortably handling the 30 concurrent peak that maps to about 200,000 MAU at lower engagement. Concrete scaling triggers for the named workload:
- Add a card at sustained 25 concurrent active sessions. One card per ~30 concurrent gives clean SLA headroom. Use a least-loaded HTTP balancer; vLLM scales linearly across replicas.
- Enable token-rate ceiling per tenant at 40 concurrent. Without rate limits, a single power user can dominate the batch. Cap at 80 t/s per tenant; users won’t notice.
- Add Qwen 14B fallback card at 5%+ “complex query” routing. If your classifier routes more than 5% of traffic to the quality fallback, dedicate a card to it.
- Move to 5090 32GB when 32k context becomes common. The extra 8GB lets you run Qwen 14B at 32k with 16 concurrent users instead of 8.
- Promote tenant-specific prefix to its own KV pool at 100k+ daily turns/tenant. Single-tenant heavy users benefit from dedicated cache space.
- Frontier-quality fallback to a separate Llama 70B INT4 box for the top 1% queries. One 4090 for 70B serves ~3 concurrent at SLA — plenty for the actual escalation rate.
Production gotchas
- Always stream tokens to the client. Perceived latency is dominated by TTFT plus the first 30-50 tokens; once a coherent first sentence appears, users will tolerate the rest. Buffering until completion ruins the UX.
- Pin
max_model_lento your real maximum. Permitting 128k context allocates KV cache on the heaviest possible tail and silently cuts batch capacity. Most chats are under 8k — set it there and add a dedicated long-context endpoint for the rare exception. - Set
gpu-memory-utilization 0.92, not 0.97. The extra 2-3 GB headroom prevents OOM on traffic spikes and leaves room for an embedding model or small reranker co-located if you need it. - WebSocket keep-alives matter. Default load balancers close idle connections after 60 seconds; if a user pauses mid-thought, the next message reconnects with a fresh TLS handshake (40-80ms penalty).
- Don’t serialise through Lambda-style request gateways. They add 100-300ms cold-start latency that you can’t optimise away. Run a long-lived FastAPI proxy in front of vLLM and connect WebSockets directly.
- Prefix cache eviction is LRU. If you have many tenants with different system prompts and total cached prefixes exceed VRAM budget, less-active tenants get evicted. Tune
--num-gpu-blocks-overrideif you have predictable tenant rotation. - Pre-warm before opening to traffic. First inference after vLLM startup takes 8-12 seconds for CUDA graph compilation. Run a dummy turn before adding the host to the load balancer.
Verdict: when to pick a 4090 for chatbots
Pick the RTX 4090 24GB for chatbot backends when you have steady traffic above ~10 concurrent active sessions and want predictable monthly cost. The named 200-MAU SaaS workload runs comfortably on one 4090 with 30% growth headroom — moving to the dedicated card cut their monthly bill from £6,800 to roughly £700, with better p95 latency than the SaaS alternative. Step down to a 5060 Ti only for solo dev/test or under-5-concurrent workloads. Step up to the 5090 32GB when 32k context becomes the median or you want Qwen 14B as primary. For tail-latency-sensitive workloads the right answer is multiple 4090s behind a balancer, not a single bigger card.
Run a chatbot you can size
30 active sessions per card, p99 TTFT under 800ms with prefix caching and chunked prefill. Predictable monthly bill, no per-message API tax. UK dedicated hosting.
Order the RTX 4090 24GBSee also: concurrent users benchmark, prefill and decode, vLLM setup, FP8 Llama deployment, Llama 8B benchmark, Qwen 14B benchmark, SaaS RAG, startup MVP, 4090 spec breakdown.