Home / Blog / Benchmarks / How to Benchmark Your GPU Server for AI Workloads

Benchmarks

How to Benchmark Your GPU Server for AI Workloads

Learn how to benchmark your GPU server for AI inference and training workloads. Covers LLM throughput testing, CUDA compute benchmarks, memory bandwidth tests, and real-world inference latency measurement.

Benchmarks April 10, 2026 2 min read gigagpu

Before committing a dedicated GPU server to production, you need to know exactly what it can deliver. Benchmarking confirms that your hardware, drivers, and software stack perform as expected. This guide walks through a comprehensive benchmarking methodology for AI workloads — from raw CUDA compute to real-world LLM inference throughput — with actionable commands you can run immediately on Ubuntu 22.04 or 24.04.

Table of Contents

Baseline GPU Information
CUDA Compute Benchmarks
Memory Bandwidth Testing
LLM Inference Throughput
Training Performance Benchmark
Multi-GPU Scaling Benchmark
Interpreting Results and Next Steps

Baseline GPU Information

Start by recording your GPU hardware details and driver versions. These form the baseline for all benchmarks.

# Full GPU specifications
nvidia-smi -q | head -60

# Structured GPU info
nvidia-smi --query-gpu=index,name,driver_version,memory.total,compute_cap,power.limit,clocks.max.sm,clocks.max.mem \
    --format=csv

# CUDA toolkit version
nvcc --version

# CPU and memory info for context
lscpu | grep -E "Model name|Socket|Core|Thread"
free -h

# NVLink topology (multi-GPU servers)
nvidia-smi topo -m

If nvcc is not found, follow the CUDA installation guide first. For GPU selection guidance, see our best GPU for LLM inference comparison.

CUDA Compute Benchmarks

Measure raw GPU compute performance with the CUDA samples and a matrix multiplication benchmark:

# Install CUDA samples
sudo apt install -y cuda-samples-12-4
cd /usr/local/cuda/samples/1_Utilities/bandwidthTest
sudo make
./bandwidthTest

# Device query
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery

Run a custom FP16 matrix multiplication benchmark (relevant for AI workloads):

pip install torch

python3 << 'EOF'
import torch
import time

def benchmark_matmul(size, dtype, iterations=100):
    a = torch.randn(size, size, dtype=dtype, device='cuda')
    b = torch.randn(size, size, dtype=dtype, device='cuda')

    # Warmup
    for _ in range(10):
        torch.mm(a, b)
    torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(iterations):
        torch.mm(a, b)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    flops = 2 * size**3 * iterations / elapsed
    print(f"  {dtype} {size}x{size}: {flops/1e12:.2f} TFLOPS ({elapsed/iterations*1000:.2f} ms/iter)")

print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"CUDA: {torch.version.cuda}")
print()

for size in [1024, 2048, 4096, 8192]:
    benchmark_matmul(size, torch.float32)

print()
for size in [1024, 2048, 4096, 8192]:
    benchmark_matmul(size, torch.float16)

print()
# BF16 benchmark (Ampere and newer)
if torch.cuda.get_device_capability()[0] >= 8:
    for size in [1024, 2048, 4096, 8192]:
        benchmark_matmul(size, torch.bfloat16)
EOF

Memory Bandwidth Testing

VRAM bandwidth is often the bottleneck for LLM inference. Measure it directly:

python3 << 'EOF'
import torch
import time

def benchmark_bandwidth(size_gb, iterations=50):
    size_bytes = int(size_gb * 1024**3)
    num_elements = size_bytes // 4  # float32

    src = torch.randn(num_elements, dtype=torch.float32, device='cuda')
    dst = torch.empty_like(src)

    # Warmup
    for _ in range(5):
        dst.copy_(src)
    torch.cuda.synchronize()

    start = time.perf_counter()
    for _ in range(iterations):
        dst.copy_(src)
    torch.cuda.synchronize()
    elapsed = time.perf_counter() - start

    bandwidth = (2 * size_bytes * iterations) / elapsed / 1e9  # GB/s (read + write)
    print(f"  {size_gb:.1f} GB transfer: {bandwidth:.1f} GB/s")

print(f"GPU: {torch.cuda.get_device_name(0)}")
print("Memory Bandwidth Test:")
for size in [0.5, 1.0, 2.0, 4.0, 8.0]:
    benchmark_bandwidth(size)
EOF

# Host-to-device bandwidth
python3 -c "
import torch, time
size = 1024 * 1024 * 1024  # 1 GB
data = torch.randn(size // 4, dtype=torch.float32)
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(10):
    gpu_data = data.cuda()
    torch.cuda.synchronize()
elapsed = time.perf_counter() - start
print(f'Host->Device: {10 * 1 / elapsed:.2f} GB/s (PCIe bandwidth)')
"

LLM Inference Throughput

The most relevant benchmark for production is actual LLM inference throughput. Use vLLM's built-in benchmark tool:

# Install vLLM
pip install vllm

# Benchmark with synthetic prompts
python3 -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.95 \
    --port 8000 &

# Wait for model to load
sleep 60

# Run throughput benchmark
python3 -m vllm.benchmark_throughput \
    --model meta-llama/Llama-3.1-8B-Instruct \
    --input-len 512 \
    --output-len 256 \
    --num-prompts 100 \
    --backend vllm

Measure latency distribution with a custom script:

python3 << 'EOF'
import requests
import time
import statistics

URL = "http://localhost:8000/v1/chat/completions"
HEADERS = {"Content-Type": "application/json"}
PAYLOAD = {
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Explain GPU memory bandwidth in two sentences."}],
    "max_tokens": 128,
    "stream": False
}

latencies = []
tokens_list = []

for i in range(50):
    start = time.perf_counter()
    resp = requests.post(URL, json=PAYLOAD, headers=HEADERS)
    elapsed = time.perf_counter() - start
    data = resp.json()

    completion_tokens = data["usage"]["completion_tokens"]
    latencies.append(elapsed)
    tokens_list.append(completion_tokens)

    tps = completion_tokens / elapsed
    print(f"Request {i+1}: {elapsed:.2f}s, {completion_tokens} tokens, {tps:.1f} tok/s")

print(f"\nResults over {len(latencies)} requests:")
print(f"  Mean latency: {statistics.mean(latencies):.3f}s")
print(f"  P50 latency:  {statistics.median(latencies):.3f}s")
print(f"  P95 latency:  {sorted(latencies)[int(0.95*len(latencies))]:.3f}s")
print(f"  Mean tok/s:   {sum(tokens_list)/sum(latencies):.1f}")
EOF

Compare your results against published numbers using the tokens per second benchmark tool. For optimisation tips based on your results, see the vLLM memory optimisation guide.

Training Performance Benchmark

Benchmark training throughput with a standard PyTorch workload:

python3 << 'EOF'
import torch
import torch.nn as nn
import time

device = torch.device("cuda")
batch_size = 32
seq_len = 512
hidden_size = 4096
num_layers = 4
iterations = 100

# Simple transformer-like model
model = nn.TransformerEncoder(
    nn.TransformerEncoderLayer(d_model=hidden_size, nhead=32, batch_first=True),
    num_layers=num_layers
).to(device).half()

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
data = torch.randn(batch_size, seq_len, hidden_size, dtype=torch.float16, device=device)

# Warmup
for _ in range(5):
    out = model(data)
    loss = out.sum()
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
torch.cuda.synchronize()

# Benchmark
start = time.perf_counter()
for _ in range(iterations):
    out = model(data)
    loss = out.sum()
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()
torch.cuda.synchronize()
elapsed = time.perf_counter() - start

samples_per_sec = batch_size * iterations / elapsed
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"Training throughput: {samples_per_sec:.1f} samples/s")
print(f"Time per iteration: {elapsed/iterations*1000:.1f} ms")
print(f"Peak VRAM: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
EOF

Multi-GPU Scaling Benchmark

If your server has multiple GPUs in a multi-GPU cluster, measure scaling efficiency:

# Test NVLink bandwidth between GPUs
nvidia-smi nvlink -s

# Benchmark peer-to-peer GPU transfer
python3 << 'PYEOF'
import torch
import time

num_gpus = torch.cuda.device_count()
print(f"Testing {num_gpus} GPUs")

size = 256 * 1024 * 1024  # 256M elements = ~1GB float32
iterations = 50

for src_gpu in range(num_gpus):
    for dst_gpu in range(num_gpus):
        if src_gpu == dst_gpu:
            continue
        src_tensor = torch.randn(size, device=f"cuda:{src_gpu}")
        dst_tensor = torch.empty(size, device=f"cuda:{dst_gpu}")

        # Warmup
        for _ in range(5):
            dst_tensor.copy_(src_tensor)
        torch.cuda.synchronize()

        start = time.perf_counter()
        for _ in range(iterations):
            dst_tensor.copy_(src_tensor)
        torch.cuda.synchronize()
        elapsed = time.perf_counter() - start

        bandwidth = (size * 4 * iterations) / elapsed / 1e9
        print(f"  GPU {src_gpu} -> GPU {dst_gpu}: {bandwidth:.1f} GB/s")
PYEOF

For details on setting up multi-GPU inference, see the multi-GPU server setup guide. Monitor GPU metrics during benchmarks using our GPU monitoring tutorial.

Interpreting Results and Next Steps

Compare your benchmark results against these reference ranges for common GPUs:

GPU	FP16 TFLOPS (expected)	Memory BW (GB/s)	Llama-3.1-8B tok/s
RTX 5090	~165	~1008	~90-110
RTX 6000 Pro 96 GB	~312	~2039	~120-150
RTX 6000 Pro 96 GB	~990	~3350	~200-280
RTX 6000 Pro	~366	~864	~80-100

If your results fall significantly below expected values, check for thermal throttling, PCIe bandwidth limitations, or driver issues. Use the cost comparison tools to understand the economics: the cost per million tokens calculator and the GPU vs OpenAI cost comparison. For the cheapest options, see cheapest GPU for AI inference. More benchmarking content is available in the benchmarks category.

Get Benchmark-Verified GPU Servers

Every GigaGPU server is tested for peak performance before delivery. Choose from RTX 6000 Pro, RTX 6000 Pro, and RTX 6000 Pro configurations optimised for AI workloads.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

How to Benchmark Your GPU Server for AI Workloads

Baseline GPU Information

CUDA Compute Benchmarks

Memory Bandwidth Testing

LLM Inference Throughput

Training Performance Benchmark

Multi-GPU Scaling Benchmark

Interpreting Results and Next Steps

Get Benchmark-Verified GPU Servers

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

How to Benchmark Your GPU Server for AI Workloads

Baseline GPU Information

CUDA Compute Benchmarks

Memory Bandwidth Testing

LLM Inference Throughput

Training Performance Benchmark

Multi-GPU Scaling Benchmark

Interpreting Results and Next Steps

Get Benchmark-Verified GPU Servers

Need a Dedicated GPU Server?

gigagpu

Related Articles

Mistral 7B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: mistral-7b-on-rtx-5080-benchmark, Excerpt: Mistral 7B benchmarked on RTX 5080: 68.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

DeepSeek V3 Performance Report: April 2026

How Many TTS Requests per Second per GPU?

PaddleOCR on RTX 4060: OCR Speed & Cost, Category: Benchmarks, Slug: paddleocr-on-rtx-4060-benchmark, Excerpt: PaddleOCR benchmarked on RTX 4060: 28 pages/sec, VRAM usage, cost efficiency, and deployment configuration., Internal links: 8 –>

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?