Idefics3 from Hugging Face is an 8B vision-language model built on Llama 3 with strong document understanding and multi-image reasoning. On our dedicated GPU hosting it fits a 16 GB card at FP16.
Contents
VRAM
| Precision | Weights | Fits On |
|---|---|---|
| FP16 | ~16 GB | 16 GB card tight, 24 GB comfortable |
| FP8 | ~8 GB | 16 GB card with room |
| INT4 | ~5 GB | Any 8 GB+ card |
Deployment
Idefics3 works with Transformers’ pipeline rather than vLLM’s default multimodal path (as of 2026 vLLM support is experimental). For production use Transformers:
from transformers import AutoProcessor, AutoModelForVision2Seq
import torch
processor = AutoProcessor.from_pretrained("HuggingFaceM4/Idefics3-8B-Llama3")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/Idefics3-8B-Llama3",
torch_dtype=torch.bfloat16,
device_map="cuda"
)
Wrap it in a FastAPI server for a custom HTTP endpoint. See OpenAI-compatible API guide for the wrapping pattern.
Documents
Idefics3 is particularly strong at:
- Reading scanned invoices, receipts, forms
- Tables and charts with mixed text and numbers
- Multi-page document Q&A with image inputs per page
- Hand-drawn diagrams
For OCR-heavy pipelines where you need raw text extraction, pair with PaddleOCR as a preprocessor and use Idefics3 for semantic understanding of the extracted text and layout.
Document AI Hosting
Idefics3 preconfigured for document Q&A on UK dedicated GPU servers.
Browse GPU ServersCompare against Llama 3.2 Vision and Pixtral 12B.