RTX 3050 - Order Now
Home / Blog / Tutorials / Batch Inference Optimisation
Tutorials

Batch Inference Optimisation

Optimising batch inference workloads — daily processing, large-scale extraction, embedding ingest. Different patterns from real-time.

Batch inference (nightly summarisation, weekly reports, embedding ingest) has different optimisation priorities from real-time serving. Latency tolerance is higher; throughput per pound is lower priority for the GPU but cost per output matters more. Several patterns apply.

TL;DR

Patterns for batch: large batch sizes (max-num-seqs much higher than real-time), schedule for off-peak GPU time, parallel workers feeding into vLLM, idempotent + checkpointed for resume. Run on the same GPU as real-time during night-time idle hours, or on dedicated cheaper GPU. Cost per output ~50-70% of real-time.

Differences from real-time

  • Latency tolerance: hours OK vs sub-second real-time
  • Batch size: max-num-seqs > 64 vs 16-32 real-time
  • Throughput priority: aggregate tok/s matters; per-stream tok/s less so
  • Resilience: checkpointing for resume on failure
  • Scheduling: off-peak hours; spot capacity; idle GPU time

Patterns

  • High batch size: --max-num-seqs 96-128 for batch jobs (vs ~32 for real-time)
  • Long max-model-len only if batch genuinely needs it; otherwise smaller saves KV memory
  • Worker pool: 4-8 parallel workers each calling vLLM with batches of 16
  • Idempotency: every batch unit identifiable; resume on failure
  • Checkpoint: persist progress every N units
  • Schedule: 23:00-07:00 UK off-peak when real-time traffic low
  • Cheaper hardware: 5060 Ti for batch can be sufficient; reserve 4090/5090 for real-time

Verdict

Batch inference is a different optimisation regime from real-time. Larger batch sizes; off-peak scheduling; idempotent + checkpointed; cheaper hardware acceptable. Run on shared GPU during off-peak hours for best economics. Treat as different workload from real-time, not just "same but slower".

Bottom line

Batch is different optimisation regime. See batch vs realtime.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?