Llama 3.2 Vision 11B is an instruction-tuned vision-language model that accepts images plus text and produces text. On our dedicated GPU hosting it sits in a sweet spot: good enough for document Q&A, captioning, and visual reasoning, small enough to fit a single mid-tier card.
Contents
VRAM
| Precision | Weights | Notes |
|---|---|---|
| FP16 | ~22 GB | Fits 24 GB+ card |
| FP8 | ~11 GB | 16 GB card |
| AWQ INT4 | ~7 GB | Any 8 GB+ card |
Vision models have a second memory consumer: image features during inference. Budget 1-2 GB extra beyond the weight VRAM for image processing.
GPU Options
- RTX 4060 Ti 16GB: FP8 fits with room for small batches
- RTX 3090 24GB: FP16 comfortable
- RTX 5090 32GB: FP16 with high concurrency
Deployment
vLLM supports Llama 3.2 Vision via the OpenAI-compatible multimodal API:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-11B-Vision-Instruct \
--max-model-len 8192 \
--max-num-seqs 8 \
--limit-mm-per-prompt 'image=1'
Client usage:
response = client.chat.completions.create(
model="llama3-vision",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What is in this image?"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
}]
)
Use Cases
Llama 3.2 Vision 11B performs well on:
- Document Q&A on scanned pages
- Image captioning
- Visual reasoning about charts, diagrams, screenshots
- UI automation support (screenshot-to-action)
It is weaker at fine-grained OCR and detailed counting. For OCR-heavy workloads pair with PaddleOCR as a preprocessor.
Self-Hosted Vision-Language Model
Llama 3.2 Vision 11B preconfigured on UK dedicated GPUs.
Browse GPU ServersFor larger VLM options see Pixtral 12B, Qwen VL 2, and Molmo 7B.