An agent backend has very different latency demands from a chatbot: many short, structured calls per user-visible action, with strict tool-call grammars and tight round-trip budgets. The RTX 4090 24GB dedicated server is the cheapest credible single-card option for it. Qwen 2.5 14B Instruct AWQ at 135 t/s decode handles fast tool-calling, Qwen 2.5 32B AWQ at 65 t/s steps in for harder planning, and Llama 3.1 70B AWQ at 22-24 t/s is the heavy fallback for the 5% of turns that need it. This article is the production playbook: turn anatomy, model menu, latency budget breakdown, concurrent-session tables, prompt and grammar tactics, scaling triggers, ops, gotchas. Wider hardware menu on dedicated GPU hosting.
Contents
- The named workload: 5-turn task agent, 3 tools per turn
- Turn anatomy and the latency budget
- Model selection for tool use
- Worked latency budget per turn
- Capacity, concurrency and scaling triggers
- Cost vs hosted alternatives
- Prompt, grammar and parser tactics
- Production gotchas, ops and verdict
The named workload: 5-turn task agent, 3 tools per turn
The reference agent is a customer-success assistant that, for every user message, plans, fires three parallel tool calls (CRM lookup, knowledge-base RAG, billing API), observes the results and composes a final reply. Average task spans five user turns end-to-end. Tool-call output is JSON pinned by schema. Per-turn latency budget is 4-5 seconds at p95 (research has it that anything above 7 seconds breaks user perception of “fast”); end-to-end task budget is 25 seconds. Concurrency target is 8-10 active agents per card with the same SLA, scaling to roughly 200-400 distinct daily users at typical 30:1 daily-to-active ratios.
Why the 4090 fits this brief
Decode throughput dominates per-turn latency for short outputs. Qwen 14B AWQ at 135 t/s on the 4090 makes a 100-token tool call in 740 ms; a 250-token compose in 1.85 s. Five-turn task at three calls each lands inside 25 seconds with margin. KV at 8k context per session and 8 active sessions costs ~6 GB; the 14B AWQ base sits at 10.2 GB; total fits 24 GB with spike headroom. Cross-checked on the Qwen 14B page and the spec breakdown.
Turn anatomy and the latency budget
A typical agent turn decomposes into discrete LLM and tool stages. Each is short (50-250 output tokens) but the sequential dependencies compound. If the user expects a 4-second response, you have only ~700 ms of LLM time per call when chained five deep, less if a tool blocks. That makes raw decode throughput more important than absolute model size, and makes prompt-prefix caching a first-class concern.
| Stage | Typical output tokens | Wall clock at 135 t/s | Wall clock at 195 t/s | Wall clock at 24 t/s (70B) |
|---|---|---|---|---|
| Plan | 120 | 0.89 s | 0.62 s | 5.0 s |
| Tool call (single) | 80 | 0.59 s | 0.41 s | 3.3 s |
| Observe / parse | 0 | 0.05 s | 0.05 s | 0.05 s |
| Compose | 250 | 1.85 s | 1.28 s | 10.4 s |
| Prefill (with prefix cache) | — | 0.10 s | 0.08 s | 0.40 s |
| Network round trips | — | 0.20 s | 0.20 s | 0.20 s |
Model selection for tool use
| Model | Quant | Decode t/s | VRAM | Tool-call quality | Use lane |
|---|---|---|---|---|---|
| Qwen 2.5 14B Instruct | AWQ INT4 | 135 | 10.2 GB | Excellent native function calling | Default workhorse |
| Qwen 2.5 32B Instruct | AWQ INT4 | 65 | 19.1 GB | Strong on multi-step plans | Hard reasoning fallback |
| Llama 3.1 70B Instruct | AWQ INT4 | 22-24 | ~22 GB | Highest tool-selection accuracy | Heavy fallback only |
| Llama 3.1 8B Instruct | FP8 | 195 | 9.5 GB | Good with strict grammars | Speed-first lanes |
| Mistral Nemo 12B | FP8 | 145 | 13 GB | Decent multilingual tools | EU-language sub-agents |
| Mistral 7B Instruct | FP8 | 215 | 8 GB | Decent on simpler tools | Cheap classifier |
| Phi-3 mini | FP8 | 480 | 4 GB | Adequate for narrow tools | Routing, sub-agents |
Qwen 2.5 family ships with first-class native function-calling templates supported by vLLM’s tool-call parser; for Llama 3 use the JSON-mode plus grammar-constrained decoding path. See the Qwen 14B page, Qwen 32B page and Llama 70B INT4 page for per-model deployment specifics.
Standard launch for the agent workhorse
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-14B-Instruct-AWQ \
--quantization awq --kv-cache-dtype fp8 \
--max-model-len 16384 --max-num-seqs 32 \
--enable-chunked-prefill --enable-prefix-caching \
--enable-auto-tool-choice --tool-call-parser hermes \
--gpu-memory-utilization 0.92
--enable-auto-tool-choice together with --tool-call-parser hermes activates Qwen’s native function-calling template. Pair with AWQ quantisation guide for the model-prep workflow.
Worked latency budget per turn
One realistic turn: the agent plans, fires three parallel tool calls (CRM at 200 ms wall, RAG at 350 ms wall, billing at 250 ms wall), observes the results, then composes the final reply. With Qwen 14B AWQ at 135 t/s on the 4090 and prefix cache primed:
| Step | Output tokens | LLM time | Tool wall | Cumulative |
|---|---|---|---|---|
| Plan | 120 | 0.89 s | — | 0.89 s |
| 3 tool calls (parallel emit) | 3 x 80 | 0.59 s (batched) | 0.35 s (max of 3) | 1.83 s |
| Compose | 250 | 1.85 s | — | 3.68 s |
| Prefill (prefix cached) + network | — | 0.40 s | — | 4.08 s |
Inside the 4-5 second p95 budget. Multi-step reasoning across 5 turns lands ~3.5 s end-to-end if tool calls overlap perfectly, ~20 s in the realistic case where each turn averages 4 s. Switching the workhorse to Llama 3 8B FP8 (195 t/s) shaves the same flow to about 3.0 s per turn at some cost in tool-selection accuracy on harder cases. Falling back to Llama 70B AWQ on the same card pulls per-turn latency to about 1.5 s for short outputs but 8-10 s for long composes; reserve it for the 5-10% of turns that fail validation on the workhorse.
Capacity, concurrency and scaling triggers
vLLM continuous batching makes parallel tool calls within the same agent essentially free as long as KV space allows. With --max-num-seqs 32 and 8k context per session, the 4090 sustains roughly 8-10 active agent sessions on Qwen 14B AWQ with quality SLA, scaling to 16 active on Llama 8B FP8 and 3-4 active on Llama 70B AWQ.
| Workhorse model | Active sessions p95 | Sessions/sec turn rate | Daily active users (10x) | MAU (30x) |
|---|---|---|---|---|
| Llama 3 8B FP8 | ~16 | ~5 | ~160 | ~480 |
| Qwen 14B AWQ | ~10 | ~3 | ~100 | ~300 |
| Qwen 32B AWQ | ~5 | ~1.5 | ~50 | ~160 |
| Llama 70B AWQ | ~3-4 | ~0.8 | ~30-40 | ~120 |
Scaling triggers
- p95 turn time > 5 s for two consecutive minutes: add a second 4090 with the same workhorse and round-robin balance, or downgrade the workhorse from 14B to 8B.
- Tool-selection F1 < 0.92 on daily replay: promote workhorse from 14B to 32B AWQ; expect halved capacity but tighter quality.
- Sustained > 12 active sessions: add a second 4090 rather than larger context windows; KV pressure makes vertical scaling brittle.
- Heavy-fallback rate > 15% of turns: upgrade the default model rather than the fallback; the fallback should remain rare. See when to upgrade.
Cost vs hosted alternatives
An agent backend amplifies token cost: every user-visible action consumes 5-15 LLM calls. A modest 100-DAU support agent at 5 turns x 4 calls x 250 average tokens hits roughly 50 M tokens/day or 1.5 B/month, well inside the 4090’s 2.85 B/month sustained capacity at Llama 3 8B FP8.
| Volume | 4090 self-host (Qwen 14B) | GPT-4o-mini ($0.30/M) | GPT-4o ($5/M) | Claude Haiku ($0.58/M) | Claude Sonnet ($7/M) |
|---|---|---|---|---|---|
| 200 M tok/mo | $700 | $60 | $1,000 | $116 | $1,400 |
| 500 M tok/mo | $700 | $150 | $2,500 | $290 | $3,500 |
| 1.5 B tok/mo | $700 | $450 | $7,500 | $870 | $10,500 |
| 2.5 B tok/mo (cap) | $700 | $750 | $12,500 | $1,450 | $17,500 |
Break-even against GPT-4o is 140 M/month (Qwen 14B match many tool-call tasks at near-equivalent quality); against Sonnet, 100 M/month. Full crossover on the break-even calculator and vs Anthropic.
Prompt, grammar and parser tactics
- Constrained decoding: vLLM supports outlines, lm-format-enforcer and xgrammar; pin tool arguments to a JSON schema. Cuts parse errors from ~5% to <0.5%.
- Tool descriptions short and atomic: long function specs eat prefill latency. Keep each tool under 200 tokens; group related tools rather than nesting parameters.
- Cache the tool registry as immutable system prompt: with
--enable-prefix-cachingon, the static system prompt plus tool specs cache after the first call and add ~80 ms instead of 400 ms per turn. - Few-shot tool exemplars: 2-3 examples of correct tool use cut errors by ~40% on Qwen 14B; less effective on 32B which is already strong.
- Fallback on parse failure with explicit grammar: re-prompt with the JSON schema embedded rather than retrying blindly. Recover ~80% of failures inside one extra call.
- Bounded tool depth: cap recursive tool calls per turn at 5 and per task at 25. Open-ended loops are how agent costs blow up overnight.
Production gotchas, ops and verdict
Production gotchas
- Tool-call parser drift between vLLM versions: the Hermes and Mistral parsers in vLLM have changed semantics across 0.5/0.6 releases. Pin vLLM and the parser flag together; test with a hundred-prompt regression suite before each upgrade.
- Streaming + tool calls = partial JSON: streaming the compose stage is fine; streaming a tool-call stage means parsing partial JSON. Either disable streaming on tool stages or use a streaming parser.
- Prefix-cache invalidation on tool-registry change: hot-reloading tool definitions evicts the prefix cache and spikes p95 by 200-400 ms. Treat tool registry as immutable per process; rotate via blue/green deploy.
- Long tool outputs poisoning context: a 5,000-token web-page dump as a tool result eats compose budget. Truncate or summarise tool outputs above 1,000 tokens before re-feeding.
- Open-ended ReAct loops: without a hard turn cap an agent can chain 50 tool calls trying to satisfy an unsatisfiable request. Always set
max_iterationsand a wall-clock budget per task. - Quiet quality drift on Qwen 14B for low-resource tools: rare tool selections degrade silently. Daily replay against gold traces is the only reliable signal.
- KV starvation under burst load: 32 concurrent 8k-context sessions can hit the KV cap and queue. Cap
--max-num-seqsat 16 if your p95 matters more than throughput.
Ops and observability
Trace each LLM call and tool call as a span with OpenTelemetry; record token counts, latency, model and parse outcome per span. Daily, replay the previous day’s traces against a candidate model and score on tool-call F1 (correct tool selected) and final-answer correctness against a graded gold set. See the vLLM setup tutorial for the full launch invocation and the coding-assistant guide for a worked agent pipeline.
Verdict
For an agent backend serving 100-300 daily active users with a 4-5 second p95 turn budget, a single 4090 running Qwen 2.5 14B AWQ as the workhorse and Llama 70B AWQ as occasional fallback is the cheapest credible production option in 2026. Above 12 active sessions or 2 B tokens/month sustained, add a second 4090; above 25 active sessions, evaluate the 5090 for headroom. Below 100 M tokens/month or for spiky low-volume agents, a hosted API is still the right answer; see the cost comparison.
Agent backend on a single card, flat monthly
Qwen 14B AWQ workhorse, 70B AWQ fallback, native tool calling. UK dedicated hosting.
Order the RTX 4090 24GBSee also: Qwen 14B, Qwen 32B, Llama 70B INT4, coding assistant, vLLM setup, AWQ guide, concurrent users.