Llama 3.2 11B Vision Instruct is Meta’s first natively multimodal Llama: a 9B Llama text decoder married to a roughly 2B vision tower through a cross-attention adapter design rather than the more common token-concatenation route. On a single RTX 4090 24GB dedicated host from Gigagpu UK hosting, the FP8 build occupies roughly 13.5 GB total, encodes a 1120 x 1120 image in 220 ms, and decodes at ~115 tokens per second at batch 1. That leaves enough VRAM to run multi-image conversations with a 32k-token KV cache, batch up to four parallel OCR jobs, or pin a Whisper sidecar in the same process. This guide walks through the architecture, the latency breakdown by image resolution, OCR quality on UK documents, and the operational gotchas that bite people deploying it for the first time.
Contents
- Why Llama Vision 11B on a 4090
- Architecture and VRAM map
- Image encode latency
- Decode throughput and concurrency
- OCR and reasoning quality
- Deployment configuration
- Production gotchas
- Cost per million pages
- Verdict and alternatives
Why Llama Vision 11B on a 4090
What Meta built and why the architecture matters
Llama 3.2 11B Vision uses a cross-attention adapter instead of the now-standard token-concatenation approach. Where Qwen 2.5-VL prepends image tokens directly into the text stream (so a high-res image with 4 tiles consumes ~6,400 tokens of your KV budget), Llama 3.2 Vision injects image features through dedicated cross-attention layers that sit between the standard self-attention and feedforward blocks. The text decoder is structurally a 9B Llama variant (32 layers, 8 KV heads under GQA, 128k native context). The vision tower is a ~2B ViT with cross-attention projection heads. The Llama 3.2 Community License covers commercial use under standard restrictions.
Why the 4090 is the right card
The cross-attention design has a major operational consequence on the RTX 4090‘s 24 GB pool: image features do not pollute the autoregressive KV cache. A 32k-token text window stays a 32k-token text window even when you attach four images. The cost is that the cross-attention layers themselves are extra weights (~2.5 GB FP8) that you cannot quantise as aggressively as the text decoder. Ada’s native FP8 tensor cores at ~660 TFLOPS handle both the ViT prefill and the cross-attention projections without bandwidth contention; the GDDR6X 1,008 GB/s bus is the dominant decode bottleneck. The 24 GB pool is what makes 4-image batching at 32k context viable; the 16 GB cards (5080, 5060 Ti) cannot do this.
Architecture and VRAM map
Format-by-format footprint
| Component | FP16 | FP8 | AWQ INT4 (text only, vision FP8) |
|---|---|---|---|
| Text decoder (9B) | 18.0 GB | 9.0 GB | 5.6 GB |
| Vision tower + cross-attn adapter (~2B) | 4.0 GB | 2.5 GB | 2.5 GB (kept FP8) |
| KV @ 8k | 1.0 GB | 0.5 GB | 0.5 GB |
| KV @ 32k | 4.0 GB | 2.0 GB | 2.0 GB |
| Activations + scratch | 1.5 GB | 1.5 GB | 1.5 GB |
| Image features (1 x 1120 px) | 0.4 GB | 0.2 GB | 0.2 GB |
| Peak VRAM @ 8k, 1 image | ~25 GB (overflows) | ~13.7 GB | ~10.3 GB |
| Peak VRAM @ 32k, 4 images | does not fit | ~17.0 GB | ~13.6 GB |
Plain FP16 will not fit on a 4090 once even one image is folded in. FP8 is the sensible default and AWQ-quantising only the decoder buys roughly 3.5 GB of additional headroom for batching or longer contexts. See our FP8 tensor core notes for why the FP8 path is cheap on Ada.
Image encode latency
Encode by resolution
| Image size | FP16 encode | FP8 encode | Tokens emitted | Notes |
|---|---|---|---|---|
| 560 x 560 (single tile) | 140 ms | 120 ms | 1,601 | Avatars, thumbnails |
| 1120 x 1120 (single tile) | 270 ms | 220 ms | 1,601 | Default OCR resolution |
| 1120 x 1120 (4 tiles) | 720 ms | 560 ms | 6,404 | Dense documents, charts |
| 1120 x 1680 (3 tiles) | 510 ms | 410 ms | 4,803 | Portrait scans (A5, receipts) |
| 1680 x 2240 (6 tiles) | 1,080 ms | 860 ms | 9,606 | A4 invoice, Companies House filing |
Batched encode
| Batch (1120 px single-tile) | FP8 encode | ms / image | VRAM peak |
|---|---|---|---|
| 1 | 220 ms | 220 | 13.7 GB |
| 2 | 340 ms | 170 | 14.4 GB |
| 4 | 560 ms | 140 | 15.8 GB |
| 8 | 980 ms | 123 | 18.6 GB |
The vision tower is comparatively small, so batching scales well – the decode side is the bottleneck for a typical OCR pipeline, not the encoder. For sustained throughput, group images of similar resolution before encoding.
Decode throughput and concurrency
Decode by batch size, FP8 weights
| Batch | Per-stream t/s | Aggregate t/s | p50 TTFT (1 image + 200 tok) | p99 TTFT |
|---|---|---|---|---|
| 1 | 115 | 115 | 340 ms | 420 ms |
| 2 | 96 | 192 | 410 ms | 520 ms |
| 4 | 72 | 288 | 590 ms | 820 ms |
| 8 | 48 | 384 | 1,020 ms | 1,560 ms |
| 16 | OOM at 32k | — | — | — |
Cross-card decode comparison
| GPU | FP8 decode b=1 | Aggregate b=4 | Encode 1120 px | Max concurrent at 32k |
|---|---|---|---|---|
| RTX 5090 32GB | 168 t/s | 440 t/s | 140 ms | 16 |
| RTX 4090 24GB | 115 t/s | 288 t/s | 220 ms | 8 |
| RTX 5080 16GB | 92 t/s | OOM at 4 imgs | 240 ms | 2 |
| RTX 5060 Ti 16GB | 62 t/s | OOM at 4 imgs | 340 ms | 2 |
| RTX 3090 24GB | 54 t/s (no FP8) | 140 t/s | 320 ms | 4 |
| H100 80GB | 220 t/s | 720 t/s | 110 ms | 32+ |
The 3090 lacks native FP8, which is the headline gap on multimodal workloads. See 4090 vs 3090 and 4090 vs 5090 for the wider patterns.
OCR and reasoning quality
Public benchmark scores
| Benchmark | Llama 3.2 11B Vision | Qwen 2.5-VL 7B | GPT-4o-mini | Pixtral 12B |
|---|---|---|---|---|
| DocVQA (val, ANLS) | 88.4 | 93.0 | 89.6 | 90.7 |
| TextVQA | 75.2 | 84.9 | 78.0 | 74.8 |
| ChartQA | 83.4 | 84.5 | 81.0 | 81.8 |
| OCRBench | 782 | 864 | 805 | 754 |
| MMMU (validation) | 50.7 | 52.0 | 59.4 | 52.5 |
| MathVista | 51.5 | 62.3 | 56.7 | 56.9 |
Field extraction on 200 sampled UK documents
| Document type | Header field F1 | Line item F1 | Total amount accuracy |
|---|---|---|---|
| NHS letters | 0.96 | n/a | n/a |
| Trade invoices | 0.94 | 0.87 | 97.5% |
| Companies House filings | 0.93 | 0.84 | n/a |
| HMRC SA302 statements | 0.95 | 0.89 | 98.0% |
| Receipts (mixed quality) | 0.88 | 0.81 | 92.5% |
Strong enough to deploy with a light human-in-the-loop verification stage on edge cases. For OCR-dominant farms Qwen 2.5-VL is denser per pound; Llama 3.2 Vision wins on natural-image reasoning and on workloads where text context dominates over image count.
Deployment configuration
vLLM launch (FP8, 8k context, 4-image batching)
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-11B-Vision-Instruct \
--quantization fp8 --kv-cache-dtype fp8_e4m3 \
--max-model-len 8192 --max-num-seqs 8 \
--limit-mm-per-prompt image=4 \
--enable-chunked-prefill \
--gpu-memory-utilization 0.92
Pre-processing helper for production OCR queues
from PIL import Image
import io, requests
def fetch_for_llama_vision(url, max_dim=1120):
img = Image.open(io.BytesIO(requests.get(url, timeout=10).content))
img.thumbnail((max_dim, max_dim))
return img.convert("RGB")
Test rig and methodology
All numbers above were captured on a single-tenant Gigagpu node: RTX 4090 24GB Founders Edition (450 W stock), Ryzen 9 7950X with 64 GB DDR5-5600, Samsung 990 Pro 2TB Gen 4 NVMe; Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6, vLLM 0.6.4 (cross-attention vision support is in 0.6.3+), PyTorch 2.5, FlashAttention 2.6. Decode throughput is sustained mean over 60-second windows; image encode is mean of 50 samples after warm-up. UK document corpus was 200 sampled documents anonymised and labelled by hand. See our vLLM setup guide for installation steps.
Production gotchas
- The chat template is fiddly. Llama 3.2 Vision uses a structured message format with explicit image placeholders. If you pass a plain string with a URL, vLLM will silently ignore the image. Use the chat-completions endpoint with the proper
type: image_urlparts and confirm with a probe request before going live. - Tile count balloons VRAM unexpectedly. Auto-resize to 1120 px max dimension; an unconstrained 3000 x 4000 px scan will produce 12 tiles and ~19 k image tokens, blowing past your KV budget and OOMing the worker.
- FP16 does not fit, full stop. Always use FP8 weights or AWQ-on-text + FP8-on-vision. People try BF16 because it is the default and hit OOM after the first image.
- Cross-attention prefill is more expensive than text-only. A 200-token text prompt with a single image takes ~340 ms TTFT, not the ~80 ms a text-only Llama 3.1 8B would. Plan SLOs accordingly.
- Multi-image batching is brittle. Different images produce different cross-attention key/value sizes; vLLM pads to the max, so a batch with one A4 scan and one avatar wastes most of the compute on the avatar slot. Group similar-resolution requests where possible.
- The vision tower is sensitive to JPEG artefacts. Sub-50% JPEG quality measurably degrades OCR; if you control the upload pipeline keep clients at quality 80+.
- Bounding-box outputs are not native. Unlike Qwen 2.5-VL, Llama 3.2 Vision does not produce grounded bbox tokens. If you need layout-aware downstream work, pair it with a separate detector or use Qwen.
Cost per million pages
| Workload | Pages / hour | Pages / day | Cost / 1k pages (UK power) |
|---|---|---|---|
| Single-stream OCR (b=1, full A4) | 900 | 21,600 | £0.40 |
| Batched OCR (b=4, single tile) | 3,600 | 86,400 | £0.10 |
| Batched OCR (b=8, mixed) | 5,400 | 129,600 | £0.07 |
That undercuts AWS Textract by roughly 90% and Azure Document Intelligence by ~85% for steady-state workloads. See monthly hosting cost and vs OpenAI API cost for the broader API-vs-self-hosted picture.
Verdict and alternatives
Pick Llama 3.2 Vision 11B on a 4090 when you want strong general visual reasoning, are happy with cross-attention’s smaller KV footprint for long text contexts, and your dataset skews toward Western-language documents and natural images. If you need maximum OCR throughput, especially on Asian scripts or table-heavy layouts, Qwen 2.5-VL 7B is denser. For pure chart understanding both are close; for video-conditioned reasoning Qwen wins because Llama 3.2 Vision does not ingest video frames natively. For text-only workloads at higher throughput, drop to Llama 3 8B.
Vision LLMs on a single 4090
OCR, document Q&A and chart understanding at 100+ tok/s with multi-image batching. UK dedicated hosting.
Order the RTX 4090 24GBSee also: Qwen 2.5-VL guide, multimodal use cases, vLLM setup, FP8 deployment, prefill/decode, 4090 vs 5090, tokens per watt, concurrent users.