Table of Contents
Not all LLM workloads need the same infrastructure. Real-time chatbots have strict latency requirements; nightly batch summarisation jobs can run at low priority during off-hours. Mixing them on the same hardware needs intentional scheduling.
Real-time: dedicated capacity, predictable latency, max-num-seqs tuned for p99. Batch: queue-based, large batches, lower priority, runs during off-hours. Mix on same GPU: real-time gets reserved capacity; batch fills idle slots. Or split: dedicated batch GPU runs at lower spec, separate from real-time tier.
Workload shapes
- Real-time interactive: chat, voice agents. Strict TTFT < 2s. Concurrency varies.
- Real-time async: API responses for downstream services. Latency tolerant up to ~10s.
- Scheduled batch: nightly summarisation, weekly report gen. Latency-tolerant; throughput-anchored.
- Burst batch: ad-hoc large-document processing. Time-sensitive but not interactive.
Infrastructure
Three patterns for mixing workloads:
- Dedicated per workload: separate GPUs for real-time vs batch. Cleanest; highest cost.
- Shared GPU + priority queues: same vLLM serves both; priority queue routes real-time first. Cheaper; complex tuning.
- Time-shifted: real-time during business hours; batch jobs run overnight when GPU is otherwise idle.
Scheduling
For batch workloads, schedule for cost + capacity efficiency:
- Run during off-peak hours (UK 23:00-07:00 typical)
- Use lower priority on shared infrastructure
- Larger batch sizes than real-time (max-num-seqs higher)
- Monitor batch progress separately from real-time SLOs
- Idempotent + checkpointed for resume on interruption
Verdict
For mixed workloads, time-shifting batch to off-hours is the cheapest pattern. For high-volume mixed workloads, dedicated per workload is operationally simpler. Don't mix high-priority real-time with large batch on the same GPU without proper queue priority — the real-time will suffer.
Bottom line
Time-shift batch; dedicate for real-time at scale. See capacity planning.