RTX 3050 - Order Now
Home / Blog / Benchmarks / Batch Size Tuning for Max Throughput
Benchmarks

Batch Size Tuning for Max Throughput

Tune batch sizes for maximum GPU throughput in AI inference and training. Covers the latency-throughput tradeoff, continuous batching, VRAM limits, finding optimal batch size, and benchmarking methodology.

Your Batch Size Is Either Starving or Crashing the GPU

Batch size 1 leaves 80% of your GPU compute idle. Batch size 256 triggers an OOM crash. Somewhere between those extremes sits the configuration that maximizes tokens per second without exceeding VRAM or violating latency requirements. Finding that number requires understanding how batch size interacts with GPU memory, compute utilization, and the latency-throughput tradeoff specific to your model and hardware on a dedicated GPU server.

The Latency-Throughput Tradeoff

Larger batches increase total throughput but add latency per individual request:

# Batch=1: Low throughput, low latency
# - GPU processes 1 request at a time
# - Each request gets dedicated GPU attention
# - Tokens/sec: ~14 (Llama-3-70B on RTX 6000 Pro)
# - Per-request latency: ~70ms per token

# Batch=16: High throughput, moderate latency
# - GPU processes 16 requests simultaneously
# - Shared GPU resources across requests
# - Tokens/sec: ~160 total (~10 per request)
# - Per-request latency: ~100ms per token

# Batch=64: Maximum throughput, higher latency
# - GPU is fully utilized (compute-bound)
# - Tokens/sec: ~400 total (~6.3 per request)
# - Per-request latency: ~160ms per token

# The sweet spot depends on your SLA:
# Real-time chatbot:    batch 1-8   (latency priority)
# API serving:          batch 16-64 (throughput priority)
# Batch processing:     batch 128+  (max throughput, latency irrelevant)

# Visualize the tradeoff
import matplotlib
# Throughput increases sublinearly: doubling batch rarely doubles throughput
# Latency increases roughly linearly with batch size
# The "knee" of the throughput curve is your optimal operating point

Find Optimal Batch Size Empirically

import torch
import time

def benchmark_batch_size(model, tokenizer, batch_sizes, max_tokens=128):
    """Benchmark throughput at different batch sizes"""
    prompt = "Explain the concept of"
    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

    results = []
    for bs in batch_sizes:
        # Replicate input for batch
        batched = {k: v.repeat(bs, 1) for k, v in inputs.items()}

        torch.cuda.synchronize()
        torch.cuda.reset_peak_memory_stats()

        try:
            start = time.perf_counter()
            with torch.no_grad():
                output = model.generate(
                    **batched,
                    max_new_tokens=max_tokens,
                    do_sample=False
                )
            torch.cuda.synchronize()
            elapsed = time.perf_counter() - start

            total_tokens = bs * max_tokens
            throughput = total_tokens / elapsed
            per_request_latency = elapsed / bs * 1000
            vram_gb = torch.cuda.max_memory_allocated() / 1e9

            results.append({
                "batch_size": bs,
                "throughput_tok_s": throughput,
                "latency_ms": per_request_latency,
                "vram_gb": vram_gb
            })
            print(f"BS={bs:3d} | {throughput:7.1f} tok/s | "
                  f"{per_request_latency:6.0f}ms/req | {vram_gb:.1f}GB VRAM")

        except torch.cuda.OutOfMemoryError:
            print(f"BS={bs:3d} | OOM")
            torch.cuda.empty_cache()
            break

    return results

# Run sweep
batch_sizes = [1, 2, 4, 8, 16, 32, 64, 128]
results = benchmark_batch_size(model, tokenizer, batch_sizes)

vLLM Continuous Batching Configuration

vLLM handles batching automatically with continuous batching, but tuning parameters matters:

# vLLM continuous batching parameters
vllm serve meta-llama/Llama-3-70B-Instruct \
    --max-num-seqs 64 \
    --max-num-batched-tokens 8192 \
    --gpu-memory-utilization 0.92

# --max-num-seqs: Maximum concurrent sequences (effective batch size)
#   Too low: GPU underutilized
#   Too high: VRAM exhaustion, KV cache thrashing
#   Start at 32, increase until VRAM is 85-90% used

# --max-num-batched-tokens: Max tokens processed per iteration
#   Controls prefill batch size
#   Higher = better GPU utilization during prefill
#   Lower = faster time-to-first-token for new requests

# Monitor vLLM's actual batching behavior
curl -s http://localhost:8000/metrics | grep -E "batch_size|num_requests"
# vllm:num_requests_running — current active batch size
# vllm:num_requests_waiting — queued requests

# Benchmark at different concurrency levels
for CONC in 1 4 8 16 32 64; do
    echo "=== Concurrency: $CONC ==="
    python3 benchmark_serving.py \
        --backend vllm \
        --model meta-llama/Llama-3-70B-Instruct \
        --num-prompts 100 \
        --request-rate inf \
        --concurrency $CONC
done

Calculate VRAM Budget for Batch Size

# VRAM consumption formula:
# Total VRAM = Model Weights + KV Cache + Activations + Overhead
#
# Model weights (fixed):
#   70B FP16 = 140GB
#   70B INT8 = 70GB
#   70B INT4 = 35GB
#
# KV cache per token per layer (scales with batch size):
#   Per token = 2 * num_layers * hidden_dim * num_kv_heads/num_heads * 2 bytes
#   Llama-3-70B: ~1.25MB per token (80 layers, 8 KV heads)
#   4096 context * 32 sequences = 131072 tokens * 1.25MB = ~160GB
#   This is why KV cache dominates VRAM at large batches!
#
# Activations (small, proportional to batch):
#   ~100-500MB depending on batch size

# Quick VRAM budget calculator
def vram_budget(model_gb, kv_per_token_mb, max_context, batch_size,
                total_vram_gb=80):
    kv_cache_gb = kv_per_token_mb * max_context * batch_size / 1024
    overhead_gb = 2  # CUDA context, fragmentation
    total_gb = model_gb + kv_cache_gb + overhead_gb
    fits = total_gb <= total_vram_gb
    print(f"BS={batch_size}: Model {model_gb:.0f}GB + "
          f"KV {kv_cache_gb:.1f}GB = {total_gb:.1f}GB "
          f"{'OK' if fits else 'OOM'}")
    return fits

# Llama-3-70B INT4 on RTX 6000 Pro 96 GB
for bs in [1, 4, 8, 16, 32, 64]:
    vram_budget(35, 1.25, 2048, bs, 80)

Training Batch Size Tuning

# Training batch size affects convergence, not just throughput

# Step 1: Find maximum batch size that fits in VRAM
bs = 1
while True:
    try:
        batch = get_batch(bs)
        loss = model(**batch).loss
        loss.backward()
        torch.cuda.synchronize()
        bs *= 2
        torch.cuda.empty_cache()
    except torch.cuda.OutOfMemoryError:
        max_bs = bs // 2
        print(f"Max batch size: {max_bs}")
        break

# Step 2: Use gradient accumulation for effective larger batches
# Physical batch = 4 (fits in VRAM)
# Accumulation steps = 8
# Effective batch = 4 * 8 = 32
optimizer.zero_grad()
for i, batch in enumerate(dataloader):
    loss = model(**batch).loss / accumulation_steps
    loss.backward()
    if (i + 1) % accumulation_steps == 0:
        optimizer.step()
        optimizer.zero_grad()

# Step 3: Benchmark throughput
# Measure samples/second at each batch size
# Throughput plateaus when GPU is fully utilized

Batch size tuning unlocks the throughput potential of your GPU server. Deploy with optimized batching in vLLM using the production guide. Compare throughput numbers in our token benchmarks. Monitor batch utilization with our monitoring setup. Set up PyTorch with our installation guide. Explore more benchmarks and tutorials.

High-Throughput GPU Servers

GigaGPU dedicated servers with VRAM to handle large batches. Maximize tokens per second on NVIDIA RTX 6000 Pro and RTX 6000 Pro GPUs.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?