Table of Contents
Step 1: Define Your AI Workload Type
Choosing the right GPU server starts with understanding exactly what you need it to do. AI workloads fall into distinct categories, each with different hardware demands. A server optimised for dedicated GPU hosting of inference workloads looks very different from one built for large-scale training. Making the wrong choice means either overspending on hardware you do not need or under-provisioning and hitting performance walls that block production deployment.
The four primary workload categories each stress different parts of the system. Inference workloads prioritise GPU memory and single-stream throughput. Training workloads demand raw compute power and fast GPU interconnects. Fine-tuning sits between the two, requiring significant VRAM but less sustained compute than training from scratch. Batch processing workloads are throughput-oriented and benefit from high parallelism.
| Workload Type | Primary Bottleneck | GPU Priority | Other Critical Specs |
|---|---|---|---|
| LLM inference | VRAM, memory bandwidth | High VRAM, fast memory | Fast storage for model loading |
| Model training | Compute (FLOPS) | High compute + VRAM | Multi-GPU interconnect, large RAM |
| Fine-tuning (LoRA/QLoRA) | VRAM | Sufficient VRAM for model + adapters | Moderate storage |
| Image/video generation | VRAM, compute | Balanced VRAM and FLOPS | Fast storage for outputs |
| Batch embedding/processing | Throughput | Multiple GPUs for parallelism | High storage I/O |
Step 2: Calculate Your VRAM Requirements
VRAM is the single most important specification for AI workloads. If your model does not fit in GPU memory, no amount of compute power will help. The relationship between model parameters and VRAM usage follows predictable patterns that make planning straightforward.
For inference, a model stored in 16-bit (FP16/BF16) precision requires approximately 2 bytes per parameter. A 7B model needs roughly 14 GB, a 13B model needs 26 GB, and a 70B model requires about 140 GB. Quantisation reduces these requirements significantly: 4-bit quantisation cuts memory to approximately 0.5 bytes per parameter, bringing a 7B model down to around 3.5 GB and a 70B model to roughly 35 GB.
| Model Size | FP16 VRAM | 8-bit VRAM | 4-bit VRAM | Minimum GPU |
|---|---|---|---|---|
| 3B parameters | ~6 GB | ~3 GB | ~1.5 GB | RTX 3050 (8 GB) |
| 7B parameters | ~14 GB | ~7 GB | ~3.5 GB | RTX 3090 (24 GB) |
| 13B parameters | ~26 GB | ~13 GB | ~6.5 GB | RTX 3090 (4-bit) or 2x GPU |
| 34B parameters | ~68 GB | ~34 GB | ~17 GB | 2x RTX 5090 or RTX 6000 Pro |
| 70B parameters | ~140 GB | ~70 GB | ~35 GB | 4x RTX 5090 (4-bit) or 2x RTX 6000 Pro |
Remember to add headroom for KV cache during inference, which can consume 2-8 GB depending on batch size and context length. The best GPU for LLM inference guide provides more detailed memory calculations for specific models.
Step 3: Select the Right GPU
With VRAM requirements established, narrow down your GPU options. The consumer and professional GPU markets offer cards at different price-to-performance ratios, and the right choice depends on your budget and workload characteristics.
For budget-conscious deployments running smaller models, the RTX 3050 provides an entry point for lightweight inference. The RTX 3090 remains the best value for 24 GB VRAM workloads, while the RTX 5090 offers superior compute throughput for the same memory capacity. See the RTX 3090 vs RTX 5090 comparison for detailed benchmarks.
| GPU | VRAM | FP16 TFLOPS | Memory Bandwidth | Best For |
|---|---|---|---|---|
| RTX 3050 | 8 GB | ~9 | 224 GB/s | Small model inference, testing |
| RTX 4060 | 8 GB | ~15 | 272 GB/s | Efficient small model serving |
| RTX 3090 | 24 GB | ~36 | 936 GB/s | 7B-13B inference, fine-tuning |
| RTX 5090 | 24 GB | ~83 | 1,008 GB/s | High-throughput inference, training |
For workloads requiring more than 24 GB on a single card, multi-GPU configurations using tensor parallelism distribute the model across multiple cards. The single vs multi-GPU scaling guide covers when this transition makes sense.
Step 4: CPU and System RAM Considerations
While the GPU handles the heavy compute, the CPU and system RAM play supporting roles that can become bottlenecks if under-provisioned. Data preprocessing, tokenisation, and request handling all run on the CPU. For inference servers, a modern 8-16 core processor is typically sufficient. Training workloads with complex data pipelines benefit from higher core counts.
System RAM should be at least 2x your total GPU VRAM to allow comfortable model loading and data staging. A server with 24 GB of GPU VRAM should have 64 GB of system RAM minimum. For deep learning training with large datasets, 128 GB or more prevents data loading from bottlenecking GPU utilisation.
Step 5: Storage Type and Capacity
Storage affects two critical operations: model loading time and dataset I/O during training. NVMe SSDs are strongly recommended for AI workloads. A 70B model in FP16 occupies roughly 140 GB on disk; loading this from an NVMe drive takes seconds, while a traditional SATA SSD or HDD would take significantly longer. Read the NVMe vs SSD comparison for AI for detailed throughput benchmarks.
Capacity planning should account for model weights (multiple versions if you are experimenting), datasets, checkpoints during training, and output storage. A minimum of 1 TB NVMe is recommended for most AI workloads, with 2-4 TB preferred for training pipelines that generate frequent checkpoints.
Step 6: Networking and Bandwidth
For inference servers handling API requests, network latency and bandwidth directly affect end-user experience. A 1 Gbps connection is sufficient for most inference APIs. High-throughput batch processing or serving large model outputs (images, long text) may benefit from 10 Gbps connectivity. The GPU server networking guide covers network architecture in detail.
Multi-GPU setups have additional networking considerations. GPU-to-GPU communication for tensor parallelism benefits from high-speed interconnects. Within a single server, PCIe Gen 4 provides adequate bandwidth for most configurations, while NVLink offers higher throughput for data-parallel training at scale.
Putting It All Together: Configuration Recommendations
| Use Case | GPU | CPU Cores | RAM | Storage |
|---|---|---|---|---|
| 7B model inference API | 1x RTX 3090 | 8-12 | 64 GB | 1 TB NVMe |
| 13B model inference (quantised) | 1x RTX 5090 | 12-16 | 64 GB | 1 TB NVMe |
| 70B model inference | 4x RTX 5090 | 16-32 | 256 GB | 2 TB NVMe |
| 7B model fine-tuning | 1-2x RTX 5090 | 12-16 | 128 GB | 2 TB NVMe |
| Image generation service | 1x RTX 5090 | 8-12 | 64 GB | 2 TB NVMe |
GigaGPU offers all of these configurations as bare-metal servers with fixed monthly pricing and a 99.9% uptime SLA. Every server ships with full root access, allowing you to install any framework, from PyTorch to vLLM, without restrictions. Use the GPU comparisons tool to evaluate specific cards side by side, and check the GPU comparisons blog category for in-depth hardware reviews.
Find Your Ideal GPU Server Configuration
Dedicated bare-metal GPU servers tailored to your AI workload. UK datacentres, fixed pricing, 99.9% SLA, and full root access.
Browse GPU Servers