Table of Contents
RAG Pipeline VRAM Components
A retrieval-augmented generation (RAG) pipeline using ChromaDB has three VRAM-consuming components: the embedding model that encodes documents and queries, ChromaDB’s index (typically stored in system RAM, not VRAM), and the LLM that generates answers from retrieved context. When self-hosting on a dedicated GPU server, the embedding model and LLM both compete for GPU memory.
ChromaDB itself is a vector database that runs primarily in system RAM and on disk, consuming minimal VRAM. The VRAM pressure comes from the models that feed into and consume from ChromaDB.
Embedding Model VRAM
| Embedding Model | Parameters | FP16 VRAM | INT8 VRAM |
|---|---|---|---|
| all-MiniLM-L6-v2 | 22M | ~0.1 GB | ~0.05 GB |
| BGE-base-en-v1.5 | 110M | ~0.3 GB | ~0.15 GB |
| BGE-large-en-v1.5 | 335M | ~0.7 GB | ~0.4 GB |
| E5-large-v2 | 335M | ~0.7 GB | ~0.4 GB |
| Nomic-embed-text-v1.5 | 137M | ~0.3 GB | ~0.2 GB |
Embedding models are lightweight, typically using under 1 GB of VRAM. The most popular choice, all-MiniLM-L6-v2, uses just 0.1 GB at FP16. This means the LLM dominates your VRAM budget in a RAG pipeline.
Total Pipeline VRAM by LLM
| LLM | LLM VRAM (4K ctx) | Embedding VRAM | Total Pipeline |
|---|---|---|---|
| Phi-3 Mini (3.8B, FP16) | ~8 GB | ~0.3 GB | ~8.3 GB |
| LLaMA 3 8B (AWQ 4-bit) | ~7 GB | ~0.3 GB | ~7.3 GB |
| LLaMA 3 8B (FP16) | ~18 GB | ~0.3 GB | ~18.3 GB |
| Mistral 7B (AWQ 4-bit) | ~6 GB | ~0.3 GB | ~6.3 GB |
| Qwen 2.5 14B (AWQ 4-bit) | ~11 GB | ~0.3 GB | ~11.3 GB |
| DeepSeek 16B (INT8) | ~18 GB | ~0.3 GB | ~18.3 GB |
A typical RAG pipeline with a quantised 7-8B LLM and a standard embedding model uses 6-8 GB total, fitting comfortably on an 8 GB GPU. FP16 deployments of larger models require 16-24 GB cards.
GPU Recommendations
| GPU | VRAM | RAG Pipeline Capability |
|---|---|---|
| RTX 4060 | 8 GB | 7B LLM (INT4) + embeddings |
| RTX 4060 Ti | 16 GB | 7-8B LLM (FP16) + embeddings |
| RTX 3090 | 24 GB | 14B LLM (INT4) or 8B (FP16) + embeddings + reranker |
For most RAG deployments, the RTX 4060 at 8 GB provides an excellent cost-to-performance ratio with quantised 7B models. Add a reranker model (another 0.3-0.7 GB) on 16 GB+ cards for improved retrieval quality.
Context Length Impact on VRAM
RAG pipelines inject retrieved documents into the LLM context, so effective context length is critical. Longer context means more retrieved passages but also more VRAM for the KV cache. A typical RAG query might use 2-4K tokens of retrieved context plus the user query.
| Effective Context | Additional KV Cache (LLaMA 3 8B FP16) | Total Pipeline (AWQ 4-bit) |
|---|---|---|
| 2K tokens | ~1 GB | ~7.6 GB |
| 4K tokens | ~2 GB | ~8.6 GB |
| 8K tokens | ~4 GB | ~10.6 GB |
| 16K tokens | ~8 GB | ~14.6 GB |
Keep retrieved context under 4K tokens on 8 GB GPUs to leave headroom for generation. On 16 GB+ cards, you can retrieve up to 8K tokens of context. See our LLaMA 3 VRAM requirements for detailed KV cache analysis.
Deployment Recommendations
For a production RAG stack, use ChromaDB for vector storage (runs in system RAM), a BGE or Nomic embedding model on GPU, and a quantised 7B LLM for generation. This entire stack runs on a single RTX 4060. For higher quality, use a 14B model on an RTX 3090 with a reranker.
Read the self-host LLM guide for full deployment instructions. Use the LLM cost calculator to estimate per-query costs. Browse all deployment guides in the model guides section.
Deploy RAG Pipelines on Dedicated GPUs
Run ChromaDB + LLM retrieval-augmented generation on dedicated GPU servers. Full root access for custom pipeline configurations.
Browse GPU Servers