Home / Blog / Model Guides / ChromaDB + LLM VRAM Requirements for RAG

Model Guides

ChromaDB + LLM VRAM Requirements for RAG

VRAM breakdown for running ChromaDB-based RAG pipelines with various LLMs. Covers embedding model overhead, LLM VRAM, total pipeline requirements, and GPU recommendations.

Model Guides April 14, 2026 3 min read gigagpu

Table of Contents

RAG Pipeline VRAM Components
Embedding Model VRAM
Total Pipeline VRAM by LLM
GPU Recommendations
Context Length Impact on VRAM
Deployment Recommendations

RAG Pipeline VRAM Components

A retrieval-augmented generation (RAG) pipeline using ChromaDB has three VRAM-consuming components: the embedding model that encodes documents and queries, ChromaDB’s index (typically stored in system RAM, not VRAM), and the LLM that generates answers from retrieved context. When self-hosting on a dedicated GPU server, the embedding model and LLM both compete for GPU memory.

ChromaDB itself is a vector database that runs primarily in system RAM and on disk, consuming minimal VRAM. The VRAM pressure comes from the models that feed into and consume from ChromaDB.

Embedding Model VRAM

Embedding Model	Parameters	FP16 VRAM	INT8 VRAM
all-MiniLM-L6-v2	22M	~0.1 GB	~0.05 GB
BGE-base-en-v1.5	110M	~0.3 GB	~0.15 GB
BGE-large-en-v1.5	335M	~0.7 GB	~0.4 GB
E5-large-v2	335M	~0.7 GB	~0.4 GB
Nomic-embed-text-v1.5	137M	~0.3 GB	~0.2 GB

Embedding models are lightweight, typically using under 1 GB of VRAM. The most popular choice, all-MiniLM-L6-v2, uses just 0.1 GB at FP16. This means the LLM dominates your VRAM budget in a RAG pipeline.

Total Pipeline VRAM by LLM

LLM	LLM VRAM (4K ctx)	Embedding VRAM	Total Pipeline
Phi-3 Mini (3.8B, FP16)	~8 GB	~0.3 GB	~8.3 GB
LLaMA 3 8B (AWQ 4-bit)	~7 GB	~0.3 GB	~7.3 GB
LLaMA 3 8B (FP16)	~18 GB	~0.3 GB	~18.3 GB
Mistral 7B (AWQ 4-bit)	~6 GB	~0.3 GB	~6.3 GB
Qwen 2.5 14B (AWQ 4-bit)	~11 GB	~0.3 GB	~11.3 GB
DeepSeek 16B (INT8)	~18 GB	~0.3 GB	~18.3 GB

A typical RAG pipeline with a quantised 7-8B LLM and a standard embedding model uses 6-8 GB total, fitting comfortably on an 8 GB GPU. FP16 deployments of larger models require 16-24 GB cards.

GPU Recommendations

GPU	VRAM	RAG Pipeline Capability
RTX 4060	8 GB	7B LLM (INT4) + embeddings
RTX 4060 Ti	16 GB	7-8B LLM (FP16) + embeddings
RTX 3090	24 GB	14B LLM (INT4) or 8B (FP16) + embeddings + reranker

For most RAG deployments, the RTX 4060 at 8 GB provides an excellent cost-to-performance ratio with quantised 7B models. Add a reranker model (another 0.3-0.7 GB) on 16 GB+ cards for improved retrieval quality.

Context Length Impact on VRAM

RAG pipelines inject retrieved documents into the LLM context, so effective context length is critical. Longer context means more retrieved passages but also more VRAM for the KV cache. A typical RAG query might use 2-4K tokens of retrieved context plus the user query.

Effective Context	Additional KV Cache (LLaMA 3 8B FP16)	Total Pipeline (AWQ 4-bit)
2K tokens	~1 GB	~7.6 GB
4K tokens	~2 GB	~8.6 GB
8K tokens	~4 GB	~10.6 GB
16K tokens	~8 GB	~14.6 GB

Keep retrieved context under 4K tokens on 8 GB GPUs to leave headroom for generation. On 16 GB+ cards, you can retrieve up to 8K tokens of context. See our LLaMA 3 VRAM requirements for detailed KV cache analysis.

Deployment Recommendations

For a production RAG stack, use ChromaDB for vector storage (runs in system RAM), a BGE or Nomic embedding model on GPU, and a quantised 7B LLM for generation. This entire stack runs on a single RTX 4060. For higher quality, use a 14B model on an RTX 3090 with a reranker.

Read the self-host LLM guide for full deployment instructions. Use the LLM cost calculator to estimate per-query costs. Browse all deployment guides in the model guides section.

Deploy RAG Pipelines on Dedicated GPUs

Run ChromaDB + LLM retrieval-augmented generation on dedicated GPU servers. Full root access for custom pipeline configurations.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

ChromaDB + LLM VRAM Requirements for RAG

RAG Pipeline VRAM Components

Embedding Model VRAM

Total Pipeline VRAM by LLM

GPU Recommendations

Context Length Impact on VRAM

Deployment Recommendations

Deploy RAG Pipelines on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

ChromaDB + LLM VRAM Requirements for RAG

RAG Pipeline VRAM Components

Embedding Model VRAM

Total Pipeline VRAM by LLM

GPU Recommendations

Context Length Impact on VRAM

Deployment Recommendations

Deploy RAG Pipelines on Dedicated GPUs

Need a Dedicated GPU Server?

gigagpu

Related Articles

Self-Hosted Vision-Language Model Comparison: Qwen-VL, Llama 3.2 Vision, Pixtral

YOLOv8 VRAM Requirements (All Model Sizes)

RTX 5060 Ti 16GB for Codestral 22B INT4

Run Mixtral 8x7B on RTX 3090 (MoE Deployment)

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?