RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM on RTX 5090: Maximum Throughput Configuration
Tutorials

vLLM on RTX 5090: Maximum Throughput Configuration

Configure vLLM on the RTX 5090 for maximum throughput. 32GB GDDR7, 1792 GB/s bandwidth, and Blackwell tensor cores enable FP16 serving of 13B+ models at record speeds.

The RTX 5090 Throughput Advantage

The RTX 5090 is the fastest consumer GPU for vLLM inference. With 32GB GDDR7 delivering approximately 1,792 GB/s bandwidth, it nearly doubles the memory throughput of the RTX 3090 while adding 33% more VRAM. On a dedicated GPU server, this means running 13B models in full FP16 with massive KV cache headroom, or 34B models in INT4 with comfortable context lengths.

For production LLM serving, the 5090 eliminates the compromises that smaller GPUs force. No quantisation needed for models up to 13B, long context support without VRAM pressure, and batch throughput that handles dozens of concurrent users. If you are evaluating the upgrade path, see our RTX 3090 to 5090 upgrade guide.

Setup and Installation

# Verify RTX 5090 with CUDA 12.8+
nvidia-smi
# NVIDIA GeForce RTX 5090, 32GB, CUDA 12.8

# Install vLLM
pip install vllm --upgrade

# Quick test — Llama 3 13B in full FP16
vllm serve meta-llama/Llama-3-13B-Instruct \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

The 5090 loads 13B FP16 models with over 15GB of VRAM remaining for KV cache, enabling 16K+ context without any quantisation. For driver setup, see our CUDA installation guide.

Maximum Throughput Configuration

To maximise throughput on the 5090, tune vLLM for high concurrency and long batches:

# High-throughput config for Llama 3 8B
vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-prefix-caching \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.93 \
  --host 0.0.0.0 --port 8000

# 13B FP16 production config
vllm serve meta-llama/Llama-3-13B-Instruct \
  --enable-prefix-caching \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

# 34B INT4 for code or reasoning tasks
vllm serve TheBloke/CodeLlama-34B-GPTQ \
  --quantization gptq \
  --max-model-len 8192 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.93

With 32GB available, the 5090 can run an 8B FP16 model with 64 concurrent sequences and 32K context. This is the kind of headroom that turns a single GPU into a production-grade inference endpoint.

Benchmark Results by Model

ModelPrecisionVRAM UsedSingle (t/s)Batch 8 (t/s)Batch 32 (t/s)
Llama 3 8BFP1616.2 GB~115~440~850
Llama 3 8BFP44.5 GB~160~580~1050
Llama 3 13BFP1626 GB~68~250~420
Mistral 7BFP1614.8 GB~120~460~880
CodeLlama 34BINT420 GB~38~130~210
DeepSeek R1 7BFP1614.5 GB~110~420~810
Qwen 2.5 14BFP1628 GB~55~190~310

At batch size 32, the RTX 5090 pushes over 850 tokens per second for Llama 3 8B in FP16. That is enough to serve 30+ concurrent chat users at 25+ tokens/s each. Check the tokens-per-second benchmark for cross-GPU comparisons.

Multi-Model Serving on 32GB

The 5090’s 32GB enables running multiple smaller models simultaneously, or one large model alongside a utility model:

# Run both a chat model and an embedding model
# Chat model on port 8000
vllm serve meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.50 \
  --max-model-len 8192 \
  --port 8000 &

# Embedding model on port 8001 (remaining ~16GB)
vllm serve BAAI/bge-large-en-v1.5 \
  --gpu-memory-utilization 0.25 \
  --port 8001 &

This configuration supports AI search and chatbot workloads from a single GPU. For full RAG pipeline architecture, see the LangChain RAG guide.

When to Choose the 5090

Choose the RTX 5090 for vLLM when you need 13B-14B models without quantisation, 34B models with reasonable context, high-concurrency batch serving, or long context lengths exceeding 16K tokens. If your workloads fit in 24GB, the RTX 3090 remains excellent value. For 16GB workloads at maximum speed, the RTX 5080 offers the best per-token cost.

Explore the full range of deployment options in the tutorials section and calculate your hosting costs with the LLM cost calculator.

RTX 5090: Maximum vLLM Throughput

32GB GDDR7, 1792 GB/s bandwidth. The fastest consumer GPU for LLM inference on dedicated hardware.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?