RTX 3050 - Order Now
Home / Blog / AI Hosting & Infrastructure / Scheduled Batch vs Real-Time LLM Workloads
AI Hosting & Infrastructure

Scheduled Batch vs Real-Time LLM Workloads

Different LLM workload shapes need different infrastructure. Real-time chatbot vs nightly batch summarisation are different problems.

Not all LLM workloads need the same infrastructure. Real-time chatbots have strict latency requirements; nightly batch summarisation jobs can run at low priority during off-hours. Mixing them on the same hardware needs intentional scheduling.

TL;DR

Real-time: dedicated capacity, predictable latency, max-num-seqs tuned for p99. Batch: queue-based, large batches, lower priority, runs during off-hours. Mix on same GPU: real-time gets reserved capacity; batch fills idle slots. Or split: dedicated batch GPU runs at lower spec, separate from real-time tier.

Workload shapes

  • Real-time interactive: chat, voice agents. Strict TTFT < 2s. Concurrency varies.
  • Real-time async: API responses for downstream services. Latency tolerant up to ~10s.
  • Scheduled batch: nightly summarisation, weekly report gen. Latency-tolerant; throughput-anchored.
  • Burst batch: ad-hoc large-document processing. Time-sensitive but not interactive.

Infrastructure

Three patterns for mixing workloads:

  • Dedicated per workload: separate GPUs for real-time vs batch. Cleanest; highest cost.
  • Shared GPU + priority queues: same vLLM serves both; priority queue routes real-time first. Cheaper; complex tuning.
  • Time-shifted: real-time during business hours; batch jobs run overnight when GPU is otherwise idle.

Scheduling

For batch workloads, schedule for cost + capacity efficiency:

  • Run during off-peak hours (UK 23:00-07:00 typical)
  • Use lower priority on shared infrastructure
  • Larger batch sizes than real-time (max-num-seqs higher)
  • Monitor batch progress separately from real-time SLOs
  • Idempotent + checkpointed for resume on interruption

Verdict

For mixed workloads, time-shifting batch to off-hours is the cheapest pattern. For high-volume mixed workloads, dedicated per workload is operationally simpler. Don't mix high-priority real-time with large batch on the same GPU without proper queue priority — the real-time will suffer.

Bottom line

Time-shift batch; dedicate for real-time at scale. See capacity planning.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?