Table of Contents
The RTX 5090 Throughput Advantage
The RTX 5090 is the fastest consumer GPU for vLLM inference. With 32GB GDDR7 delivering approximately 1,792 GB/s bandwidth, it nearly doubles the memory throughput of the RTX 3090 while adding 33% more VRAM. On a dedicated GPU server, this means running 13B models in full FP16 with massive KV cache headroom, or 34B models in INT4 with comfortable context lengths.
For production LLM serving, the 5090 eliminates the compromises that smaller GPUs force. No quantisation needed for models up to 13B, long context support without VRAM pressure, and batch throughput that handles dozens of concurrent users. If you are evaluating the upgrade path, see our RTX 3090 to 5090 upgrade guide.
Setup and Installation
# Verify RTX 5090 with CUDA 12.8+
nvidia-smi
# NVIDIA GeForce RTX 5090, 32GB, CUDA 12.8
# Install vLLM
pip install vllm --upgrade
# Quick test — Llama 3 13B in full FP16
vllm serve meta-llama/Llama-3-13B-Instruct \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--host 0.0.0.0 --port 8000
The 5090 loads 13B FP16 models with over 15GB of VRAM remaining for KV cache, enabling 16K+ context without any quantisation. For driver setup, see our CUDA installation guide.
Maximum Throughput Configuration
To maximise throughput on the 5090, tune vLLM for high concurrency and long batches:
# High-throughput config for Llama 3 8B
vllm serve meta-llama/Llama-3-8B-Instruct \
--enable-prefix-caching \
--max-model-len 32768 \
--max-num-seqs 64 \
--gpu-memory-utilization 0.93 \
--host 0.0.0.0 --port 8000
# 13B FP16 production config
vllm serve meta-llama/Llama-3-13B-Instruct \
--enable-prefix-caching \
--max-model-len 16384 \
--max-num-seqs 32 \
--gpu-memory-utilization 0.92
# 34B INT4 for code or reasoning tasks
vllm serve TheBloke/CodeLlama-34B-GPTQ \
--quantization gptq \
--max-model-len 8192 \
--max-num-seqs 16 \
--gpu-memory-utilization 0.93
With 32GB available, the 5090 can run an 8B FP16 model with 64 concurrent sequences and 32K context. This is the kind of headroom that turns a single GPU into a production-grade inference endpoint.
Benchmark Results by Model
| Model | Precision | VRAM Used | Single (t/s) | Batch 8 (t/s) | Batch 32 (t/s) |
|---|---|---|---|---|---|
| Llama 3 8B | FP16 | 16.2 GB | ~115 | ~440 | ~850 |
| Llama 3 8B | FP4 | 4.5 GB | ~160 | ~580 | ~1050 |
| Llama 3 13B | FP16 | 26 GB | ~68 | ~250 | ~420 |
| Mistral 7B | FP16 | 14.8 GB | ~120 | ~460 | ~880 |
| CodeLlama 34B | INT4 | 20 GB | ~38 | ~130 | ~210 |
| DeepSeek R1 7B | FP16 | 14.5 GB | ~110 | ~420 | ~810 |
| Qwen 2.5 14B | FP16 | 28 GB | ~55 | ~190 | ~310 |
At batch size 32, the RTX 5090 pushes over 850 tokens per second for Llama 3 8B in FP16. That is enough to serve 30+ concurrent chat users at 25+ tokens/s each. Check the tokens-per-second benchmark for cross-GPU comparisons.
Multi-Model Serving on 32GB
The 5090’s 32GB enables running multiple smaller models simultaneously, or one large model alongside a utility model:
# Run both a chat model and an embedding model
# Chat model on port 8000
vllm serve meta-llama/Llama-3-8B-Instruct \
--gpu-memory-utilization 0.50 \
--max-model-len 8192 \
--port 8000 &
# Embedding model on port 8001 (remaining ~16GB)
vllm serve BAAI/bge-large-en-v1.5 \
--gpu-memory-utilization 0.25 \
--port 8001 &
This configuration supports AI search and chatbot workloads from a single GPU. For full RAG pipeline architecture, see the LangChain RAG guide.
When to Choose the 5090
Choose the RTX 5090 for vLLM when you need 13B-14B models without quantisation, 34B models with reasonable context, high-concurrency batch serving, or long context lengths exceeding 16K tokens. If your workloads fit in 24GB, the RTX 3090 remains excellent value. For 16GB workloads at maximum speed, the RTX 5080 offers the best per-token cost.
Explore the full range of deployment options in the tutorials section and calculate your hosting costs with the LLM cost calculator.
RTX 5090: Maximum vLLM Throughput
32GB GDDR7, 1792 GB/s bandwidth. The fastest consumer GPU for LLM inference on dedicated hardware.
Browse GPU Servers