Self-hosting a chatbot has stopped being a hobbyist exercise. With a Blackwell RTX 5060 Ti 16GB on Gigagpu UK dedicated hosting you can serve Llama 3 8B FP8 at 112 tokens per second and keep 10-20 concurrent chat sessions alive on a single card. This post sizes two realistic deployments – an 8B general assistant and a Phi-3-backed scoped bot – lays out prefix caching benefits, and compares the monthly bill against the ChatGPT API for the same traffic.
Contents
- Picking the model
- Prefix caching for system prompts
- Latency table
- Concurrency
- Monthly cost vs ChatGPT API
- Deployment notes
Picking the model
For most customer-facing assistants, Llama 3 8B Instruct at FP8 gives ChatGPT-3.5-class quality. For narrow, high-volume bots (intent routing, FAQ answering, form-filling agents) Phi-3-mini 3.8B is a better fit because it drops to 285 t/s and frees the GPU for more parallel streams.
| Model | Use case | VRAM | Single t/s | Aggregate t/s |
|---|---|---|---|---|
| Llama 3 8B FP8 | General assistant | 11.3 GB | 112 | 720 @ 16 streams |
| Phi-3-mini 3.8B FP8 | Scoped / FAQ bot | 4.9 GB | 285 | 1,850 @ 32 streams |
| Mistral 7B FP8 | Multilingual | 9.8 GB | 122 | 780 @ 16 streams |
| Qwen 2.5 14B AWQ | Reasoning-heavy | 13.6 GB | 70 | 310 @ 8 streams |
Prefix caching for system prompts
Most production chatbots carry a 1,500-3,000 token system prompt (persona, tool schemas, safety rules). vLLM’s automatic prefix caching reuses the KV cache for that shared prefix across every conversation. For a 2,000 token system prompt plus a 150 token user turn, prefill time drops from 215 ms to 18 ms – a 12x reduction – and time to first token falls below 80 ms.
| Scenario | Prefill tokens | TTFT (no cache) | TTFT (prefix cache) |
|---|---|---|---|
| Short turn, 2k sys prompt | 2,150 | 215 ms | 78 ms |
| Long turn, 2k sys + 1k user | 3,000 | 298 ms | 142 ms |
| RAG turn, 2k sys + 4k ctx | 6,150 | 612 ms | 418 ms |
Latency table
Assuming a typical 180-token response and the 2k system prompt described above:
| Model | TTFT (cached) | Tokens/s | Full response |
|---|---|---|---|
| Llama 3 8B FP8 | 78 ms | 112 | 1.68 s |
| Phi-3-mini FP8 | 31 ms | 285 | 0.66 s |
| Mistral 7B FP8 | 71 ms | 122 | 1.55 s |
| Qwen 2.5 14B AWQ | 142 ms | 70 | 2.71 s |
Concurrency
At a realistic user cadence of one turn every 20 seconds, the card supports:
| Model | Concurrent users | Turns / hour | p95 latency |
|---|---|---|---|
| Llama 3 8B FP8 | 180 | 32,400 | 2.4 s |
| Phi-3-mini FP8 | 460 | 82,800 | 0.9 s |
| Mistral 7B FP8 | 195 | 35,100 | 2.2 s |
| Qwen 2.5 14B AWQ | 78 | 14,040 | 3.8 s |
Monthly cost vs ChatGPT API
Taking a mid-range SaaS chatbot doing 1 million turns/month at 2k prompt + 300 output tokens each:
| Provider | Input cost | Output cost | Monthly total |
|---|---|---|---|
| GPT-4o-mini API | $300 | $180 | ~$480 |
| GPT-4o API | $5,000 | $3,000 | ~$8,000 |
| Claude 3.5 Sonnet API | $6,000 | $4,500 | ~$10,500 |
| Gigagpu 5060 Ti 16GB | Flat monthly rental | From ~£160 | |
For any workload that would otherwise hit the premium APIs, a single 5060 Ti pays for itself many times over – and every additional token is essentially free.
Deployment notes
Use vLLM 0.6 with --enable-prefix-caching, pin --max-model-len to what your longest conversation actually needs (16k is plenty for most chat), and set --gpu-memory-utilization 0.85 to give the continuous batcher maximum room.
Host your chatbot on a dedicated UK GPU
Llama 3 8B, Phi-3 or Qwen – one card, hundreds of concurrent users. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: 5060 Ti chatbot backend, prefix caching guide, FP8 Llama deployment, vLLM setup, Phi-3 mini benchmark.