Llama 3.2 Vision adds image input to Llama 3 architecture. The 11B variant is the one that fits on 16 GB. Numbers on the RTX 5060 Ti 16GB at our hosting:
Contents
Setup
- Model: meta-llama/Llama-3.2-11B-Vision-Instruct
- vLLM 0.6.4 with
--trust-remote-codeand vision enabled - Input: 1024×1024 image + text query
VRAM
| Precision | Weights | Total with KV |
|---|---|---|
| FP16 | 22 GB | Does not fit |
| FP8 | 11 GB | ~13 GB at 4k context |
| AWQ INT4 | 7.2 GB | ~9 GB |
Image-Q&A Latency
| Precision | Image encode | Prefill (text) | Decode (t/s) |
|---|---|---|---|
| FP8 | 280 ms | 160 ms | 72 |
| AWQ INT4 | 290 ms | 190 ms | 88 |
Typical “describe this image” latency: ~300 ms to first token, decode at 70+ t/s. Acceptable for interactive VLM applications.
Batch Images
Processing multiple images in one request:
- 2 images: 550 ms encode time, similar decode
- 4 images: 1,100 ms encode time – approaches prefill cost for small text prompts
Verdict
Llama 3.2 11B Vision FP8 is the default multimodal LLM for this card. Qwen 2.5-VL 7B is a faster alternative with similar quality – see Qwen-VL benchmark.
Llama Vision on Blackwell 16GB
11B multimodal, 72 t/s decode at FP8. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: Qwen-VL benchmark, PaddleOCR, document Q&A, computer vision, multimodal workloads.