RTX 3050 - Order Now
Home / Blog / Tutorials / vLLM on RTX 3090: Setup, Config & Throughput Guide
Tutorials

vLLM on RTX 3090: Setup, Config & Throughput Guide

Step-by-step guide to deploying vLLM on an RTX 3090 dedicated server. Covers installation, configuration tuning, and throughput benchmarks for 7B-34B models in 24GB VRAM.

Why RTX 3090 for vLLM

The RTX 3090 remains one of the best-value GPUs for running vLLM in production. With 24GB GDDR6X and 936 GB/s bandwidth, it handles 7B models in FP16, 13B models comfortably, and even 34B models in INT4 quantisation. On a dedicated GPU server, you get full root access and no shared-tenancy overhead, which means predictable latency for every request.

vLLM is the go-to serving engine for production LLM inference. Its PagedAttention memory management and continuous batching make it significantly faster than naive HuggingFace serving. Combined with the RTX 3090’s generous VRAM, this pairing handles real workloads at cost-effective price points. For a broader comparison of inference engines, see our vLLM vs Ollama guide.

Installation and Environment Setup

Start with a clean Ubuntu 22.04 server with NVIDIA drivers and CUDA 12.x installed. If you need help with drivers, follow our CUDA installation guide.

# Create a Python virtual environment
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate

# Install vLLM with CUDA support
pip install vllm

# Verify GPU detection
python -c "import torch; print(torch.cuda.get_device_name(0))"
# Output: NVIDIA GeForce RTX 3090

For Docker deployments, use the official vLLM image:

docker run --gpus all -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3-8B-Instruct \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92

Configuration Tuning for 24GB

The RTX 3090’s 24GB requires careful memory allocation depending on your model size. Here are optimised configurations for common models:

# Llama 3 8B FP16 — leaves room for large KV cache
vllm serve meta-llama/Llama-3-8B-Instruct \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.92 \
  --max-num-seqs 32 \
  --host 0.0.0.0 --port 8000

# Llama 3 13B GPTQ INT4 — fits in 24GB with context
vllm serve TheBloke/Llama-3-13B-GPTQ \
  --quantization gptq \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --max-num-seqs 16

# CodeLlama 34B INT4 — tight fit, limit context
vllm serve TheBloke/CodeLlama-34B-GPTQ \
  --quantization gptq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 8

Key parameters to tune: --gpu-memory-utilization controls how much VRAM vLLM reserves (0.90-0.95 for the 3090), --max-model-len limits context length which directly affects KV cache size, and --max-num-seqs caps concurrent requests to prevent out-of-memory errors.

Throughput Benchmarks by Model

ModelPrecisionVRAM UsedTokens/s (Single)Tokens/s (Batch 8)Max Context
Llama 3 8BFP1616.2 GB~55~21016384
Mistral 7BFP1614.8 GB~58~22516384
Llama 3 13BFP1626 GBOOMOOM
Llama 3 13BGPTQ INT48.5 GB~38~1408192
CodeLlama 34BGPTQ INT420 GB~18~524096
DeepSeek R1 7BFP1614.5 GB~52~20016384
Qwen 2.5 7BFP1615 GB~54~20816384

Batch throughput is where vLLM excels. Continuous batching lets the 3090 serve 8 concurrent users at over 200 tokens per second aggregate for 7B-8B models. Compare these figures with other GPUs on the tokens-per-second benchmark tool.

Production Deployment Tips

For production vLLM on the RTX 3090, set up an nginx reverse proxy with TLS termination. Our nginx reverse proxy guide covers this in detail. Additional recommendations:

# Enable prefix caching for repeated prompts (system prompts, RAG contexts)
vllm serve meta-llama/Llama-3-8B-Instruct \
  --enable-prefix-caching \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.92 \
  --host 0.0.0.0 --port 8000

Monitor GPU utilisation with nvidia-smi dmon -s u and set up alerting when VRAM usage exceeds 95%. For detailed monitoring configuration, see our GPU monitoring guide. Use the LLM cost calculator to compare your self-hosted costs against API pricing.

Next Steps and Alternatives

If the RTX 3090 cannot fit your target model, consider the RTX 5090 with 32GB for larger models in FP16. For smaller workloads, Ollama on the RTX 3090 offers a simpler deployment path. The full vLLM production guide covers multi-model serving and advanced optimisation.

For a deeper dive into how vLLM compares with other serving frameworks, explore our guides on self-hosting LLMs and check the tutorials section for more deployment walkthroughs.

RTX 3090 Servers Ready for vLLM

24GB VRAM, full root access, pre-installed CUDA. Deploy vLLM in minutes on dedicated hardware.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?