Home / Blog / GPU Comparisons / Can RTX 3090 Run CodeLlama 34B?

GPU Comparisons

Can RTX 3090 Run CodeLlama 34B?

The RTX 3090 can run CodeLlama 34B in INT4 quantisation with its 24GB VRAM. Here is the VRAM breakdown, coding benchmarks, and setup guide.

GPU Comparisons April 14, 2026 3 min read gigagpu

Yes, the RTX 3090 can run CodeLlama 34B in INT4 quantisation. With 24GB GDDR6X VRAM, the RTX 3090 fits this large coding model when quantised to 4-bit precision, making it a viable option for code model hosting. FP16 and INT8 require more VRAM than a single 3090 provides, but INT4 delivers surprisingly good code generation quality.

Table of Contents

The Short Answer
VRAM Analysis
Performance Benchmarks
Setup Guide
Recommended Alternative

The Short Answer

YES in INT4 quantisation with 4K-8K context. NO in FP16 or INT8.

CodeLlama 34B has 33.7 billion parameters. In FP16, the model weights need approximately 67GB of VRAM, well beyond the RTX 3090’s 24GB. In INT8, it drops to about 34GB, still too large. In INT4 (GPTQ or AWQ), the model compresses to roughly 18-19GB for weights, leaving 5-6GB for KV cache and overhead.

For code generation specifically, INT4 quantisation preserves the model’s ability to produce syntactically correct, well-structured code. The quality degradation is more noticeable in natural language tasks than in code output, making CodeLlama 34B in INT4 a practical choice for development workflows.

VRAM Analysis

Quantisation	Model VRAM	KV Cache (4K ctx)	Total	RTX 3090 (24GB)
FP16	~67GB	~3.5GB	~70.5GB	No
INT8	~34GB	~3.5GB	~37.5GB	No
INT4 (GPTQ)	~19GB	~3.5GB	~22.5GB	Fits
INT4 (AWQ)	~18GB	~3.5GB	~21.5GB	Fits
Q4_K_M (GGUF)	~18.5GB	~3.5GB	~22GB	Fits

With AWQ quantisation at 4K context, total VRAM usage is around 21.5GB, leaving about 2.5GB of breathing room. You can push to 8K context (CodeLlama supports up to 16K) but VRAM will be tight at around 23.5GB. For fill-in-the-middle (FIM) code completion, 4K context is typically sufficient. See our CodeLlama VRAM requirements guide for all configurations.

Performance Benchmarks

GPU	Model	Quantisation	Tokens/sec	Context
RTX 3090 (24GB)	CodeLlama 34B	Q4_K_M	~14 tok/s	4096
RTX 3090 (24GB)	CodeLlama 34B	AWQ	~16 tok/s	4096
RTX 5090 (32GB)	CodeLlama 34B	Q4_K_M	~28 tok/s	8192
RTX 3090 (24GB)	CodeLlama 7B	FP16	~45 tok/s	16384

At 14-16 tok/s, CodeLlama 34B generates code at a comfortable reading pace on the RTX 3090. For code completion tasks where you typically generate 10-50 tokens, response times are under 3 seconds. Function generation and longer code blocks take 5-15 seconds depending on length. Check full benchmarks on our tokens per second benchmark page.

Setup Guide

Deploy CodeLlama 34B with Ollama for the simplest setup:

# Ollama: CodeLlama 34B in Q4_K_M
ollama run codellama:34b-instruct-q4_K_M

# For code completion (fill-in-middle) mode
ollama run codellama:34b-code-q4_K_M

For production serving with an OpenAI-compatible API:

# vLLM with AWQ quantisation
pip install vllm
vllm serve TheBloke/CodeLlama-34B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95 \
  --host 0.0.0.0 --port 8000

Set --gpu-memory-utilization 0.95 to maximise the available VRAM for this tight fit. The AWQ format is recommended for vLLM as it provides the best INT4 inference speed on Ada and Ampere architectures.

Recommended Alternative

If you need CodeLlama 34B with more context or higher precision, the RTX 5090 with 32GB handles it in INT4 with 8K+ context at nearly double the throughput. For the smaller but still capable CodeLlama 13B, the RTX 3090 runs it in INT8 with excellent speed and longer context.

Consider also whether newer code models like DeepSeek Coder or Qwen Coder might serve your needs at smaller sizes. The RTX 3090 runs LLaMA 3 8B in FP16 which also handles code tasks well. For other workloads, check the Mixtral 8x7B analysis or Qwen 72B analysis. Browse all dedicated GPU servers or read the best GPU for LLM inference guide.

Deploy This Model Now

Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

GPU Comparisons

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Can RTX 3090 Run CodeLlama 34B?

The Short Answer

VRAM Analysis

Performance Benchmarks

Setup Guide

Recommended Alternative

Deploy This Model Now

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Can RTX 3090 Run CodeLlama 34B?

The Short Answer

VRAM Analysis

Performance Benchmarks

Setup Guide

Recommended Alternative

Deploy This Model Now

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 4090 24 GB vs RTX 5090 32 GB: The Generational Step

Best GPU for Embedding Workloads in 2026

LLaMA 3 8B vs Mistral 7B for Function Calling: GPU Benchmark

Can RTX 4060 Ti Run LLaMA 3 8B?

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?