Table of Contents
VRAM Check: Does LLaMA 3 8B Fit?
The NVIDIA RTX 3090 has 24 GB of GDDR6X VRAM, which is more than enough for LLaMA 3 8B at any precision level. Here is what to expect on a dedicated GPU server:
| Precision | Model VRAM | KV Cache (8K ctx, batch 8) | Total | Fits RTX 3090? |
|---|---|---|---|---|
| FP16 | 16.1 GB | ~4 GB | ~20 GB | Yes (4 GB spare) |
| AWQ 4-bit | 6.5 GB | ~4 GB | ~10.5 GB | Yes (13.5 GB spare) |
| GGUF Q4_K_M | 5.3 GB | ~3 GB | ~8.3 GB | Yes (15.7 GB spare) |
At FP16, you get full model quality with room for concurrent requests. At 4-bit quantisation, you free up enough VRAM to run a second model (such as Faster-Whisper) on the same GPU. For full VRAM sizing, see our LLaMA 3 VRAM requirements guide.
Setup with vLLM
vLLM provides the highest throughput for production serving with continuous batching and PagedAttention.
# Install vLLM
pip install vllm
# Launch OpenAI-compatible API server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dtype float16 \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--port 8000
# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Explain GPU memory hierarchy."}],
"max_tokens": 512
}'
For a full comparison of serving frameworks, read our vLLM vs Ollama guide.
Setup with Ollama
Ollama is the fastest path to a running model, ideal for development and testing.
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull and run LLaMA 3 8B
ollama run llama3:8b-instruct
# Or serve as an API
ollama serve &
curl http://localhost:11434/api/generate \
-d '{"model": "llama3:8b-instruct", "prompt": "Hello, world!"}'
RTX 3090 Benchmark Results
Benchmarked with vLLM using a 512-token input prompt and 256-token generation. See the tokens-per-second benchmark tool for current data.
| Configuration | Prompt tok/s | Gen tok/s | Batch 1 Latency (TTFT) | Concurrent Users |
|---|---|---|---|---|
| FP16, batch 1 | 2,410 | 92 | 212 ms | 1 |
| FP16, batch 8 | 8,200 | 68 per user | 340 ms | 8 |
| AWQ 4-bit, batch 1 | 3,680 | 138 | 139 ms | 1 |
| AWQ 4-bit, batch 8 | 12,400 | 102 per user | 225 ms | 8 |
At 4-bit quantisation, the RTX 3090 delivers 138 tokens/second for a single user, which is fast enough for real-time chat applications. With batching, it can serve 8 concurrent users at over 100 tok/s each.
Optimisation Tips
- Use AWQ 4-bit for production serving. Quality loss is minimal (less than 2 points on MMLU) and throughput increases 50%.
- Enable continuous batching in vLLM (default) to maximise GPU utilisation under concurrent load.
- Set
--gpu-memory-utilization 0.90to give vLLM room for KV cache without OOM errors. - Use speculative decoding with a smaller draft model for additional speedups on long generations.
- Monitor with
nvidia-smito track VRAM usage and GPU utilisation in real time.
For cost estimation, use our cost-per-million-tokens calculator. Browse more deployment guides in the model guides section.
Next Steps
The RTX 3090 is an excellent match for LLaMA 3 8B. If you need more quality, consider upgrading to LLaMA 3 70B on a multi-GPU setup. To compare against other models at this tier, see our LLaMA 3 vs DeepSeek comparison. For the full self-hosting walkthrough, read our self-host LLM guide.
Deploy This Model Now
Get an RTX 3090 dedicated server pre-configured for LLM inference. Full root access and UK data centre hosting.
Browse GPU Servers