Running on-device inference is increasingly viable for small models, but anything above a ~3 B parameter class still needs a backend. A RTX 5060 Ti 16GB on our dedicated GPU hosting sits neatly in the gap – enough compute for Llama 3.1 8B FP8 at 100+ concurrent users, small enough to run regionally for sub-second latency to mobile and IoT clients.
Contents
- Request profile
- Model selection
- Architecture
- Latency budget
- Batching and async queues
- Regional placement
Request Profile
Edge clients differ from web apps in shape:
| Attribute | Typical value | Implication |
|---|---|---|
| Prompt size | 200-800 tokens | Prefill fast, decode dominates |
| Response size | 50-300 tokens | Short but latency-sensitive |
| Concurrency | 50-500 devices | Batching critical |
| Latency SLA | <1 s end-to-end | Streaming mandatory |
| Connection | Mobile 4G/5G, often intermittent | Retry-safe idempotent APIs |
| Auth | Device token / JWT | Per-request validation budget <5 ms |
| Payload | Often small JSON | HTTP/2 multiplexed helps |
Model Selection
| Model | Use case | VRAM | Concurrent users |
|---|---|---|---|
| Phi-3-mini 3.8B Q4 | Quick classify, short replies | ~3 GB | 300+ |
| Mistral 7B FP8 | Chat, summarise | ~9 GB | 150-200 |
| Llama 3.1 8B FP8 | General chat | ~10 GB | 100-150 |
| Qwen 2.5 7B AWQ | Multilingual chat | ~8 GB | 150-180 |
| BGE-M3 embedder | On-device RAG support | ~2 GB | thousands/s |
For a mobile app doing chat with a dash of RAG, pair Llama 3.1 8B FP8 (~10 GB) with BGE-M3 (~2 GB) on one 5060 Ti – 12 GB used, 4 GB headroom for KV cache.
Architecture
[mobile / IoT] [dedicated 5060 Ti server]
---------------- HTTPS (TLS 1.3) ----------------->
POST /chat nginx (TLS term, auth)
Accept: text/event-stream |
v
FastAPI gateway
- JWT validate
- rate-limit per device
- shape response
|
v
vLLM OpenAI server (:8000)
- Llama 3.1 8B FP8
- continuous batching
- streaming out
<-------- SSE tokens ------------------------
Latency Budget
| Stage | Budget (mobile 4G) | Budget (mobile 5G) | Budget (IoT Wi-Fi) |
|---|---|---|---|
| Device -> edge node | 30-80 ms | 10-25 ms | 15-30 ms |
| TLS handshake (reused) | 0-5 ms | 0-2 ms | 0-2 ms |
| JWT validate | 1-3 ms | 1-3 ms | 1-3 ms |
| Prefill (500 prompt tokens) | ~70 ms | ~70 ms | ~70 ms |
| TTFT (first token) | ~100 ms | ~80 ms | ~90 ms |
| Stream to user (100 tokens) | ~900 ms | ~900 ms | ~900 ms |
| Total perceived | ~1 s to first token | ~850 ms | ~900 ms |
Batching and Async Queues
- Continuous batching – vLLM merges new and in-flight requests per step; no tuning needed for typical mobile loads
- Async queue for non-interactive tasks – transcription, batch summarisation, overnight reports go via Redis Streams or RabbitMQ so they don’t contend with live chat
- Rate-limiting per device – prevents a single buggy client from starving others; nginx
limit_req zone=per-device burst=10 nodelay - Graceful degradation – when queue depth exceeds threshold, return
429 Retry-Afterwith exponential backoff rather than letting tail latency explode - Warmup on deploy – first request after load compiles CUDA graphs and is ~3x slower; send a canary request at startup
Regional Placement
UK hosting suits UK, Ireland, and Western Europe traffic – round-trip times stay under 30 ms for most users. For a global app consider two or three regional backends (UK, US-east, Asia) behind a Geo-DNS or Anycast router. One 5060 Ti per region is usually sufficient up to the low tens of thousands of daily active users on a chat workload.
Mobile and IoT Backend Hosting
Blackwell 16 GB with low-latency UK peering. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: vLLM setup, FP8 deployment, Llama 3 8B benchmark, Docker CUDA setup, first-day checklist.