Table of Contents
Why RTX 3090 for vLLM
The RTX 3090 remains one of the best-value GPUs for running vLLM in production. With 24GB GDDR6X and 936 GB/s bandwidth, it handles 7B models in FP16, 13B models comfortably, and even 34B models in INT4 quantisation. On a dedicated GPU server, you get full root access and no shared-tenancy overhead, which means predictable latency for every request.
vLLM is the go-to serving engine for production LLM inference. Its PagedAttention memory management and continuous batching make it significantly faster than naive HuggingFace serving. Combined with the RTX 3090’s generous VRAM, this pairing handles real workloads at cost-effective price points. For a broader comparison of inference engines, see our vLLM vs Ollama guide.
Installation and Environment Setup
Start with a clean Ubuntu 22.04 server with NVIDIA drivers and CUDA 12.x installed. If you need help with drivers, follow our CUDA installation guide.
# Create a Python virtual environment
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
# Install vLLM with CUDA support
pip install vllm
# Verify GPU detection
python -c "import torch; print(torch.cuda.get_device_name(0))"
# Output: NVIDIA GeForce RTX 3090
For Docker deployments, use the official vLLM image:
docker run --gpus all -p 8000:8000 \
vllm/vllm-openai:latest \
--model meta-llama/Llama-3-8B-Instruct \
--max-model-len 8192 \
--gpu-memory-utilization 0.92
Configuration Tuning for 24GB
The RTX 3090’s 24GB requires careful memory allocation depending on your model size. Here are optimised configurations for common models:
# Llama 3 8B FP16 — leaves room for large KV cache
vllm serve meta-llama/Llama-3-8B-Instruct \
--max-model-len 16384 \
--gpu-memory-utilization 0.92 \
--max-num-seqs 32 \
--host 0.0.0.0 --port 8000
# Llama 3 13B GPTQ INT4 — fits in 24GB with context
vllm serve TheBloke/Llama-3-13B-GPTQ \
--quantization gptq \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--max-num-seqs 16
# CodeLlama 34B INT4 — tight fit, limit context
vllm serve TheBloke/CodeLlama-34B-GPTQ \
--quantization gptq \
--max-model-len 4096 \
--gpu-memory-utilization 0.95 \
--max-num-seqs 8
Key parameters to tune: --gpu-memory-utilization controls how much VRAM vLLM reserves (0.90-0.95 for the 3090), --max-model-len limits context length which directly affects KV cache size, and --max-num-seqs caps concurrent requests to prevent out-of-memory errors.
Throughput Benchmarks by Model
| Model | Precision | VRAM Used | Tokens/s (Single) | Tokens/s (Batch 8) | Max Context |
|---|---|---|---|---|---|
| Llama 3 8B | FP16 | 16.2 GB | ~55 | ~210 | 16384 |
| Mistral 7B | FP16 | 14.8 GB | ~58 | ~225 | 16384 |
| Llama 3 13B | FP16 | 26 GB | OOM | OOM | — |
| Llama 3 13B | GPTQ INT4 | 8.5 GB | ~38 | ~140 | 8192 |
| CodeLlama 34B | GPTQ INT4 | 20 GB | ~18 | ~52 | 4096 |
| DeepSeek R1 7B | FP16 | 14.5 GB | ~52 | ~200 | 16384 |
| Qwen 2.5 7B | FP16 | 15 GB | ~54 | ~208 | 16384 |
Batch throughput is where vLLM excels. Continuous batching lets the 3090 serve 8 concurrent users at over 200 tokens per second aggregate for 7B-8B models. Compare these figures with other GPUs on the tokens-per-second benchmark tool.
Production Deployment Tips
For production vLLM on the RTX 3090, set up an nginx reverse proxy with TLS termination. Our nginx reverse proxy guide covers this in detail. Additional recommendations:
# Enable prefix caching for repeated prompts (system prompts, RAG contexts)
vllm serve meta-llama/Llama-3-8B-Instruct \
--enable-prefix-caching \
--max-model-len 8192 \
--gpu-memory-utilization 0.92 \
--host 0.0.0.0 --port 8000
Monitor GPU utilisation with nvidia-smi dmon -s u and set up alerting when VRAM usage exceeds 95%. For detailed monitoring configuration, see our GPU monitoring guide. Use the LLM cost calculator to compare your self-hosted costs against API pricing.
Next Steps and Alternatives
If the RTX 3090 cannot fit your target model, consider the RTX 5090 with 32GB for larger models in FP16. For smaller workloads, Ollama on the RTX 3090 offers a simpler deployment path. The full vLLM production guide covers multi-model serving and advanced optimisation.
For a deeper dive into how vLLM compares with other serving frameworks, explore our guides on self-hosting LLMs and check the tutorials section for more deployment walkthroughs.
RTX 3090 Servers Ready for vLLM
24GB VRAM, full root access, pre-installed CUDA. Deploy vLLM in minutes on dedicated hardware.
Browse GPU Servers