Before committing a dedicated GPU server to production, you need to know exactly what it can deliver. Benchmarking confirms that your hardware, drivers, and software stack perform as expected. This guide walks through a comprehensive benchmarking methodology for AI workloads — from raw CUDA compute to real-world LLM inference throughput — with actionable commands you can run immediately on Ubuntu 22.04 or 24.04.
Baseline GPU Information
Start by recording your GPU hardware details and driver versions. These form the baseline for all benchmarks.
# Full GPU specifications
nvidia-smi -q | head -60
# Structured GPU info
nvidia-smi --query-gpu=index,name,driver_version,memory.total,compute_cap,power.limit,clocks.max.sm,clocks.max.mem \
--format=csv
# CUDA toolkit version
nvcc --version
# CPU and memory info for context
lscpu | grep -E "Model name|Socket|Core|Thread"
free -h
# NVLink topology (multi-GPU servers)
nvidia-smi topo -m
If nvcc is not found, follow the CUDA installation guide first. For GPU selection guidance, see our best GPU for LLM inference comparison.
CUDA Compute Benchmarks
Measure raw GPU compute performance with the CUDA samples and a matrix multiplication benchmark:
# Install CUDA samples
sudo apt install -y cuda-samples-12-4
cd /usr/local/cuda/samples/1_Utilities/bandwidthTest
sudo make
./bandwidthTest
# Device query
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery
Run a custom FP16 matrix multiplication benchmark (relevant for AI workloads):
pip install torch
python3 << 'EOF'
import torch
import time
def benchmark_matmul(size, dtype, iterations=100):
a = torch.randn(size, size, dtype=dtype, device='cuda')
b = torch.randn(size, size, dtype=dtype, device='cuda')
# Warmup
for _ in range(10):
torch.mm(a, b)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
torch.mm(a, b)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
flops = 2 * size**3 * iterations / elapsed
print(f" {dtype} {size}x{size}: {flops/1e12:.2f} TFLOPS ({elapsed/iterations*1000:.2f} ms/iter)")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA: {torch.version.cuda}")
print()
for size in [1024, 2048, 4096, 8192]:
benchmark_matmul(size, torch.float32)
print()
for size in [1024, 2048, 4096, 8192]:
benchmark_matmul(size, torch.float16)
print()
# BF16 benchmark (Ampere and newer)
if torch.cuda.get_device_capability()[0] >= 8:
for size in [1024, 2048, 4096, 8192]:
benchmark_matmul(size, torch.bfloat16)
EOF
Memory Bandwidth Testing
VRAM bandwidth is often the bottleneck for LLM inference. Measure it directly:
python3 << 'EOF'
import torch
import time
def benchmark_bandwidth(size_gb, iterations=50):
size_bytes = int(size_gb * 1024**3)
num_elements = size_bytes // 4 # float32
src = torch.randn(num_elements, dtype=torch.float32, device='cuda')
dst = torch.empty_like(src)
# Warmup
for _ in range(5):
dst.copy_(src)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
dst.copy_(src)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
bandwidth = (2 * size_bytes * iterations) / elapsed / 1e9 # GB/s (read + write)
print(f" {size_gb:.1f} GB transfer: {bandwidth:.1f} GB/s")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print("Memory Bandwidth Test:")
for size in [0.5, 1.0, 2.0, 4.0, 8.0]:
benchmark_bandwidth(size)
EOF
# Host-to-device bandwidth
python3 -c "
import torch, time
size = 1024 * 1024 * 1024 # 1 GB
data = torch.randn(size // 4, dtype=torch.float32)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(10):
gpu_data = data.cuda()
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
print(f'Host->Device: {10 * 1 / elapsed:.2f} GB/s (PCIe bandwidth)')
"
LLM Inference Throughput
The most relevant benchmark for production is actual LLM inference throughput. Use vLLM's built-in benchmark tool:
# Install vLLM
pip install vllm
# Benchmark with synthetic prompts
python3 -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.1-8B-Instruct \
--max-model-len 4096 \
--gpu-memory-utilization 0.95 \
--port 8000 &
# Wait for model to load
sleep 60
# Run throughput benchmark
python3 -m vllm.benchmark_throughput \
--model meta-llama/Llama-3.1-8B-Instruct \
--input-len 512 \
--output-len 256 \
--num-prompts 100 \
--backend vllm
Measure latency distribution with a custom script:
python3 << 'EOF'
import requests
import time
import statistics
URL = "http://localhost:8000/v1/chat/completions"
HEADERS = {"Content-Type": "application/json"}
PAYLOAD = {
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [{"role": "user", "content": "Explain GPU memory bandwidth in two sentences."}],
"max_tokens": 128,
"stream": False
}
latencies = []
tokens_list = []
for i in range(50):
start = time.perf_counter()
resp = requests.post(URL, json=PAYLOAD, headers=HEADERS)
elapsed = time.perf_counter() - start
data = resp.json()
completion_tokens = data["usage"]["completion_tokens"]
latencies.append(elapsed)
tokens_list.append(completion_tokens)
tps = completion_tokens / elapsed
print(f"Request {i+1}: {elapsed:.2f}s, {completion_tokens} tokens, {tps:.1f} tok/s")
print(f"\nResults over {len(latencies)} requests:")
print(f" Mean latency: {statistics.mean(latencies):.3f}s")
print(f" P50 latency: {statistics.median(latencies):.3f}s")
print(f" P95 latency: {sorted(latencies)[int(0.95*len(latencies))]:.3f}s")
print(f" Mean tok/s: {sum(tokens_list)/sum(latencies):.1f}")
EOF
Compare your results against published numbers using the tokens per second benchmark tool. For optimisation tips based on your results, see the vLLM memory optimisation guide.
Training Performance Benchmark
Benchmark training throughput with a standard PyTorch workload:
python3 << 'EOF'
import torch
import torch.nn as nn
import time
device = torch.device("cuda")
batch_size = 32
seq_len = 512
hidden_size = 4096
num_layers = 4
iterations = 100
# Simple transformer-like model
model = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=hidden_size, nhead=32, batch_first=True),
num_layers=num_layers
).to(device).half()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
data = torch.randn(batch_size, seq_len, hidden_size, dtype=torch.float16, device=device)
# Warmup
for _ in range(5):
out = model(data)
loss = out.sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.synchronize()
# Benchmark
start = time.perf_counter()
for _ in range(iterations):
out = model(data)
loss = out.sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
samples_per_sec = batch_size * iterations / elapsed
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Training throughput: {samples_per_sec:.1f} samples/s")
print(f"Time per iteration: {elapsed/iterations*1000:.1f} ms")
print(f"Peak VRAM: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
EOF
Multi-GPU Scaling Benchmark
If your server has multiple GPUs in a multi-GPU cluster, measure scaling efficiency:
# Test NVLink bandwidth between GPUs
nvidia-smi nvlink -s
# Benchmark peer-to-peer GPU transfer
python3 << 'PYEOF'
import torch
import time
num_gpus = torch.cuda.device_count()
print(f"Testing {num_gpus} GPUs")
size = 256 * 1024 * 1024 # 256M elements = ~1GB float32
iterations = 50
for src_gpu in range(num_gpus):
for dst_gpu in range(num_gpus):
if src_gpu == dst_gpu:
continue
src_tensor = torch.randn(size, device=f"cuda:{src_gpu}")
dst_tensor = torch.empty(size, device=f"cuda:{dst_gpu}")
# Warmup
for _ in range(5):
dst_tensor.copy_(src_tensor)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
dst_tensor.copy_(src_tensor)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
bandwidth = (size * 4 * iterations) / elapsed / 1e9
print(f" GPU {src_gpu} -> GPU {dst_gpu}: {bandwidth:.1f} GB/s")
PYEOF
For details on setting up multi-GPU inference, see the multi-GPU server setup guide. Monitor GPU metrics during benchmarks using our GPU monitoring tutorial.
Interpreting Results and Next Steps
Compare your benchmark results against these reference ranges for common GPUs:
| GPU | FP16 TFLOPS (expected) | Memory BW (GB/s) | Llama-3.1-8B tok/s |
|---|---|---|---|
| RTX 5090 | ~165 | ~1008 | ~90-110 |
| RTX 6000 Pro 96 GB | ~312 | ~2039 | ~120-150 |
| RTX 6000 Pro 96 GB | ~990 | ~3350 | ~200-280 |
| RTX 6000 Pro | ~366 | ~864 | ~80-100 |
If your results fall significantly below expected values, check for thermal throttling, PCIe bandwidth limitations, or driver issues. Use the cost comparison tools to understand the economics: the cost per million tokens calculator and the GPU vs OpenAI cost comparison. For the cheapest options, see cheapest GPU for AI inference. More benchmarking content is available in the benchmarks category.
Get Benchmark-Verified GPU Servers
Every GigaGPU server is tested for peak performance before delivery. Choose from RTX 6000 Pro, RTX 6000 Pro, and RTX 6000 Pro configurations optimised for AI workloads.
Browse GPU Servers