Yes, the RTX 3090 can run CodeLlama 34B in INT4 quantisation. With 24GB GDDR6X VRAM, the RTX 3090 fits this large coding model when quantised to 4-bit precision, making it a viable option for code model hosting. FP16 and INT8 require more VRAM than a single 3090 provides, but INT4 delivers surprisingly good code generation quality.
The Short Answer
YES in INT4 quantisation with 4K-8K context. NO in FP16 or INT8.
CodeLlama 34B has 33.7 billion parameters. In FP16, the model weights need approximately 67GB of VRAM, well beyond the RTX 3090’s 24GB. In INT8, it drops to about 34GB, still too large. In INT4 (GPTQ or AWQ), the model compresses to roughly 18-19GB for weights, leaving 5-6GB for KV cache and overhead.
For code generation specifically, INT4 quantisation preserves the model’s ability to produce syntactically correct, well-structured code. The quality degradation is more noticeable in natural language tasks than in code output, making CodeLlama 34B in INT4 a practical choice for development workflows.
VRAM Analysis
| Quantisation | Model VRAM | KV Cache (4K ctx) | Total | RTX 3090 (24GB) |
|---|---|---|---|---|
| FP16 | ~67GB | ~3.5GB | ~70.5GB | No |
| INT8 | ~34GB | ~3.5GB | ~37.5GB | No |
| INT4 (GPTQ) | ~19GB | ~3.5GB | ~22.5GB | Fits |
| INT4 (AWQ) | ~18GB | ~3.5GB | ~21.5GB | Fits |
| Q4_K_M (GGUF) | ~18.5GB | ~3.5GB | ~22GB | Fits |
With AWQ quantisation at 4K context, total VRAM usage is around 21.5GB, leaving about 2.5GB of breathing room. You can push to 8K context (CodeLlama supports up to 16K) but VRAM will be tight at around 23.5GB. For fill-in-the-middle (FIM) code completion, 4K context is typically sufficient. See our CodeLlama VRAM requirements guide for all configurations.
Performance Benchmarks
| GPU | Model | Quantisation | Tokens/sec | Context |
|---|---|---|---|---|
| RTX 3090 (24GB) | CodeLlama 34B | Q4_K_M | ~14 tok/s | 4096 |
| RTX 3090 (24GB) | CodeLlama 34B | AWQ | ~16 tok/s | 4096 |
| RTX 5090 (32GB) | CodeLlama 34B | Q4_K_M | ~28 tok/s | 8192 |
| RTX 3090 (24GB) | CodeLlama 7B | FP16 | ~45 tok/s | 16384 |
At 14-16 tok/s, CodeLlama 34B generates code at a comfortable reading pace on the RTX 3090. For code completion tasks where you typically generate 10-50 tokens, response times are under 3 seconds. Function generation and longer code blocks take 5-15 seconds depending on length. Check full benchmarks on our tokens per second benchmark page.
Setup Guide
Deploy CodeLlama 34B with Ollama for the simplest setup:
# Ollama: CodeLlama 34B in Q4_K_M
ollama run codellama:34b-instruct-q4_K_M
# For code completion (fill-in-middle) mode
ollama run codellama:34b-code-q4_K_M
For production serving with an OpenAI-compatible API:
# vLLM with AWQ quantisation
pip install vllm
vllm serve TheBloke/CodeLlama-34B-Instruct-AWQ \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.95 \
--host 0.0.0.0 --port 8000
Set --gpu-memory-utilization 0.95 to maximise the available VRAM for this tight fit. The AWQ format is recommended for vLLM as it provides the best INT4 inference speed on Ada and Ampere architectures.
Recommended Alternative
If you need CodeLlama 34B with more context or higher precision, the RTX 5090 with 32GB handles it in INT4 with 8K+ context at nearly double the throughput. For the smaller but still capable CodeLlama 13B, the RTX 3090 runs it in INT8 with excellent speed and longer context.
Consider also whether newer code models like DeepSeek Coder or Qwen Coder might serve your needs at smaller sizes. The RTX 3090 runs LLaMA 3 8B in FP16 which also handles code tasks well. For other workloads, check the Mixtral 8x7B analysis or Qwen 72B analysis. Browse all dedicated GPU servers or read the best GPU for LLM inference guide.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers