Home / Blog / Tutorials / vLLM on RTX 3090: Setup, Config & Throughput Guide

Tutorials

vLLM on RTX 3090: Setup, Config & Throughput Guide

Step-by-step guide to deploying vLLM on an RTX 3090 dedicated server. Covers installation, configuration tuning, and throughput benchmarks for 7B-34B models in 24GB VRAM.

Tutorials April 17, 2026 3 min read admin

Table of Contents

Why RTX 3090 for vLLM
Installation and Environment Setup
Configuration Tuning for 24GB
Throughput Benchmarks by Model
Production Deployment Tips
Next Steps and Alternatives

Why RTX 3090 for vLLM

The RTX 3090 remains one of the best-value GPUs for running vLLM in production. With 24GB GDDR6X and 936 GB/s bandwidth, it handles 7B models in FP16, 13B models comfortably, and even 34B models in INT4 quantisation. On a dedicated GPU server, you get full root access and no shared-tenancy overhead, which means predictable latency for every request.

vLLM is the go-to serving engine for production LLM inference. Its PagedAttention memory management and continuous batching make it significantly faster than naive HuggingFace serving. Combined with the RTX 3090’s generous VRAM, this pairing handles real workloads at cost-effective price points. For a broader comparison of inference engines, see our vLLM vs Ollama guide.

Installation and Environment Setup

Start with a clean Ubuntu 22.04 server with NVIDIA drivers and CUDA 12.x installed. If you need help with drivers, follow our CUDA installation guide.

# Create a Python virtual environment
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate

# Install vLLM with CUDA support
pip install vllm

# Verify GPU detection
python -c "import torch; print(torch.cuda.get_device_name(0))"
# Output: NVIDIA GeForce RTX 3090

For Docker deployments, use the official vLLM image:

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Configuration Tuning for 24GB

The RTX 3090’s 24GB requires careful memory allocation depending on your model size. Here are optimised configurations for common models:

# Llama 3 8B FP16 — leaves room for large KV cache
vllm serve meta-llama/Llama-3-8B-Instruct \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 32 \
  --host 0.0.0.0 --port 8000

# Llama 3 13B GPTQ INT4 — fits in 24GB with context
vllm serve TheBloke/Llama-3-13B-GPTQ \
  --quantization gptq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 16

# CodeLlama 34B INT4 — tight fit, limit context
vllm serve TheBloke/CodeLlama-34B-GPTQ \
  --quantization gptq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 8

Key parameters to tune: --gpu-memory-utilization controls how much VRAM vLLM reserves (0.90-0.95 for the 3090), --max-model-len limits context length which directly affects KV cache size, and --max-num-seqs caps concurrent requests to prevent out-of-memory errors.

Throughput Benchmarks by Model

Model	Precision	VRAM Used	Tokens/s (Single)	Tokens/s (Batch 8)	Max Context
Llama 3 8B	FP16	16.2 GB	~55	~210	16384
Mistral 7B	FP16	14.8 GB	~58	~225	16384
Llama 3 13B	FP16	26 GB	OOM	OOM	—
Llama 3 13B	GPTQ INT4	8.5 GB	~38	~140	8192
CodeLlama 34B	GPTQ INT4	20 GB	~18	~52	4096
DeepSeek R1 7B	FP16	14.5 GB	~52	~200	16384
Qwen 2.5 7B	FP16	15 GB	~54	~208	16384

Batch throughput is where vLLM excels. Continuous batching lets the 3090 serve 8 concurrent users at over 200 tokens per second aggregate for 7B-8B models. Compare these figures with other GPUs on the tokens-per-second benchmark tool.

Production Deployment Tips

For production vLLM on the RTX 3090, set up an nginx reverse proxy with TLS termination. Our nginx reverse proxy guide covers this in detail. Additional recommendations:

# Enable prefix caching for repeated prompts (system prompts, RAG contexts)
vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-prefix-caching \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --host 0.0.0.0 --port 8000

Monitor GPU utilisation with nvidia-smi dmon -s u and set up alerting when VRAM usage exceeds 95%. For detailed monitoring configuration, see our GPU monitoring guide. Use the LLM cost calculator to compare your self-hosted costs against API pricing.

Next Steps and Alternatives

If the RTX 3090 cannot fit your target model, consider the RTX 5090 with 32GB for larger models in FP16. For smaller workloads, Ollama on the RTX 3090 offers a simpler deployment path. The full vLLM production guide covers multi-model serving and advanced optimisation.

For a deeper dive into how vLLM compares with other serving frameworks, explore our guides on self-hosting LLMs and check the tutorials section for more deployment walkthroughs.

RTX 3090 Servers Ready for vLLM

24GB VRAM, full root access, pre-installed CUDA. Deploy vLLM in minutes on dedicated hardware.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

vLLM on RTX 3090: Setup, Config & Throughput Guide

Why RTX 3090 for vLLM

Installation and Environment Setup

Configuration Tuning for 24GB

Throughput Benchmarks by Model

Production Deployment Tips

Next Steps and Alternatives

RTX 3090 Servers Ready for vLLM

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

vLLM on RTX 3090: Setup, Config & Throughput Guide

Why RTX 3090 for vLLM

Installation and Environment Setup

Configuration Tuning for 24GB

Throughput Benchmarks by Model

Production Deployment Tips

Next Steps and Alternatives

RTX 3090 Servers Ready for vLLM

Need a Dedicated GPU Server?

admin

Related Articles

OpenAI SDK with Self-Hosted Models: Node.js Guide

llama.cpp on GPU Server: GGUF Performance Guide

How to Install NVIDIA CUDA on a Dedicated GPU Server

Best RAG Frameworks in 2026 (Updated April 2026)

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?