Home / Blog / Tutorials / vLLM on RTX 5090: Maximum Throughput Configuration

Tutorials

vLLM on RTX 5090: Maximum Throughput Configuration

Configure vLLM on the RTX 5090 for maximum throughput. 32GB GDDR7, 1792 GB/s bandwidth, and Blackwell tensor cores enable FP16 serving of 13B+ models at record speeds.

Tutorials April 17, 2026 3 min read admin

Table of Contents

The RTX 5090 Throughput Advantage
Setup and Installation
Maximum Throughput Configuration
Benchmark Results by Model
Multi-Model Serving on 32GB
When to Choose the 5090

The RTX 5090 Throughput Advantage

The RTX 5090 is the fastest consumer GPU for vLLM inference. With 32GB GDDR7 delivering approximately 1,792 GB/s bandwidth, it nearly doubles the memory throughput of the RTX 3090 while adding 33% more VRAM. On a dedicated GPU server, this means running 13B models in full FP16 with massive KV cache headroom, or 34B models in INT4 with comfortable context lengths.

For production LLM serving, the 5090 eliminates the compromises that smaller GPUs force. No quantisation needed for models up to 13B, long context support without VRAM pressure, and batch throughput that handles dozens of concurrent users. If you are evaluating the upgrade path, see our RTX 3090 to 5090 upgrade guide.

Setup and Installation

# Verify RTX 5090 with CUDA 12.8+
nvidia-smi
# NVIDIA GeForce RTX 5090, 32GB, CUDA 12.8

# Install vLLM
pip install vllm --upgrade

# Quick test — Llama 3 13B in full FP16
vllm serve meta-llama/Llama-3-13B-Instruct \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

The 5090 loads 13B FP16 models with over 15GB of VRAM remaining for KV cache, enabling 16K+ context without any quantisation. For driver setup, see our CUDA installation guide.

Maximum Throughput Configuration

To maximise throughput on the 5090, tune vLLM for high concurrency and long batches:

# High-throughput config for Llama 3 8B
vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-prefix-caching \
  --max-model-len 32768 \
  --max-num-seqs 64 \
  --gpu-memory-utilization 0.93 \
  --host 0.0.0.0 --port 8000

# 13B FP16 production config
vllm serve meta-llama/Llama-3-13B-Instruct \
  --enable-prefix-caching \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --gpu-memory-utilization 0.92

# 34B INT4 for code or reasoning tasks
vllm serve TheBloke/CodeLlama-34B-GPTQ \
  --quantization gptq \
  --max-model-len 8192 \
  --max-num-seqs 16 \
  --gpu-memory-utilization 0.93

With 32GB available, the 5090 can run an 8B FP16 model with 64 concurrent sequences and 32K context. This is the kind of headroom that turns a single GPU into a production-grade inference endpoint.

Benchmark Results by Model

Model	Precision	VRAM Used	Single (t/s)	Batch 8 (t/s)	Batch 32 (t/s)
Llama 3 8B	FP16	16.2 GB	~115	~440	~850
Llama 3 8B	FP4	4.5 GB	~160	~580	~1050
Llama 3 13B	FP16	26 GB	~68	~250	~420
Mistral 7B	FP16	14.8 GB	~120	~460	~880
CodeLlama 34B	INT4	20 GB	~38	~130	~210
DeepSeek R1 7B	FP16	14.5 GB	~110	~420	~810
Qwen 2.5 14B	FP16	28 GB	~55	~190	~310

At batch size 32, the RTX 5090 pushes over 850 tokens per second for Llama 3 8B in FP16. That is enough to serve 30+ concurrent chat users at 25+ tokens/s each. Check the tokens-per-second benchmark for cross-GPU comparisons.

Multi-Model Serving on 32GB

The 5090’s 32GB enables running multiple smaller models simultaneously, or one large model alongside a utility model:

# Run both a chat model and an embedding model
# Chat model on port 8000
vllm serve meta-llama/Llama-3-8B-Instruct \
  --gpu-memory-utilization 0.50 \
  --max-model-len 8192 \
  --port 8000 &

# Embedding model on port 8001 (remaining ~16GB)
vllm serve BAAI/bge-large-en-v1.5 \
  --gpu-memory-utilization 0.25 \
  --port 8001 &

This configuration supports AI search and chatbot workloads from a single GPU. For full RAG pipeline architecture, see the LangChain RAG guide.

When to Choose the 5090

Choose the RTX 5090 for vLLM when you need 13B-14B models without quantisation, 34B models with reasonable context, high-concurrency batch serving, or long context lengths exceeding 16K tokens. If your workloads fit in 24GB, the RTX 3090 remains excellent value. For 16GB workloads at maximum speed, the RTX 5080 offers the best per-token cost.

Explore the full range of deployment options in the tutorials section and calculate your hosting costs with the LLM cost calculator.

RTX 5090: Maximum vLLM Throughput

32GB GDDR7, 1792 GB/s bandwidth. The fastest consumer GPU for LLM inference on dedicated hardware.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

vLLM on RTX 5090: Maximum Throughput Configuration

The RTX 5090 Throughput Advantage

Setup and Installation

Maximum Throughput Configuration

Benchmark Results by Model

Multi-Model Serving on 32GB

When to Choose the 5090

RTX 5090: Maximum vLLM Throughput

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

vLLM on RTX 5090: Maximum Throughput Configuration

The RTX 5090 Throughput Advantage

Setup and Installation

Maximum Throughput Configuration

Benchmark Results by Model

Multi-Model Serving on 32GB

When to Choose the 5090

RTX 5090: Maximum vLLM Throughput

Need a Dedicated GPU Server?

admin

Related Articles

Ollama on RTX 5090: Running Large Models in 32GB

FAISS vs Milvus: GPU-Accelerated Vector Search

Connect Chrome Extension to Self-Hosted AI

FastAPI AI Inference Server: Complete Build

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?