Table of Contents
Batch inference (nightly summarisation, weekly reports, embedding ingest) has different optimisation priorities from real-time serving. Latency tolerance is higher; throughput per pound is lower priority for the GPU but cost per output matters more. Several patterns apply.
Patterns for batch: large batch sizes (max-num-seqs much higher than real-time), schedule for off-peak GPU time, parallel workers feeding into vLLM, idempotent + checkpointed for resume. Run on the same GPU as real-time during night-time idle hours, or on dedicated cheaper GPU. Cost per output ~50-70% of real-time.
Differences from real-time
- Latency tolerance: hours OK vs sub-second real-time
- Batch size: max-num-seqs > 64 vs 16-32 real-time
- Throughput priority: aggregate tok/s matters; per-stream tok/s less so
- Resilience: checkpointing for resume on failure
- Scheduling: off-peak hours; spot capacity; idle GPU time
Patterns
- High batch size:
--max-num-seqs 96-128for batch jobs (vs ~32 for real-time) - Long max-model-len only if batch genuinely needs it; otherwise smaller saves KV memory
- Worker pool: 4-8 parallel workers each calling vLLM with batches of 16
- Idempotency: every batch unit identifiable; resume on failure
- Checkpoint: persist progress every N units
- Schedule: 23:00-07:00 UK off-peak when real-time traffic low
- Cheaper hardware: 5060 Ti for batch can be sufficient; reserve 4090/5090 for real-time
Verdict
Batch inference is a different optimisation regime from real-time. Larger batch sizes; off-peak scheduling; idempotent + checkpointed; cheaper hardware acceptable. Run on shared GPU during off-peak hours for best economics. Treat as different workload from real-time, not just "same but slower".
Bottom line
Batch is different optimisation regime. See batch vs realtime.