Table of Contents
Multi-modal RAG (over documents containing both text and images) is a real production need by 2026: technical documentation with diagrams, financial reports with charts, marketing materials with screenshots. Three approaches; choose by image-heavy-ness of corpus.
Three approaches: (1) Image-to-text via VLM at ingest time — extract text descriptions; embed text. (2) Multi-modal embeddings (CLIP, BGE-VL) — embed text + images jointly; query with text or image. (3) VLM at query time — pass relevant page images to vision-language model for answer generation. Most practical: (1) for image-light, (3) for image-heavy.
Approaches
- Approach 1: Image-to-text at ingest: VLM (Qwen2-VL 7B / Pixtral) describes each image; descriptions added to chunks; standard text RAG. Simple, lossy.
- Approach 2: Multi-modal embeddings: CLIP / BGE-VL produce joint text+image embeddings. Single vector space; query in either modality.
- Approach 3: VLM at query time: retrieve relevant pages (from text + image embeddings); pass page images directly to VLM for final answer. Highest quality.
Models
- VLM for image-to-text: Qwen2-VL 7B (best quality), Pixtral 12B, Llama 3.2 Vision 11B
- Multi-modal embeddings: BGE-VL, CLIP variants, JinaCLIP
- Final answer VLM: Qwen2-VL 72B for premium, Pixtral 12B for cost
Setup
For Approach 3 (VLM at query):
- Render PDF pages to images at ingest; store
- OCR + text extraction; standard text embeddings to Qdrant
- At query time: retrieve relevant pages by text similarity
- Pass page images + text to VLM for final answer
- VRAM: 4090 / 5090 + Qwen2-VL 7B for SMB; 6000 Pro + 72B for premium
Verdict
For image-heavy documents (charts, diagrams, screenshots), pass page images directly to a VLM at query time — Approach 3. For mostly-text documents with occasional images, Approach 1 (description at ingest) is simpler and cheaper. Don't default to multi-modal RAG without measuring — many corpora benefit more from better text RAG than from image handling.
Bottom line
Approach 3 for image-heavy; Approach 1 otherwise. See Qwen2-VL.