RTX 4090 24GB for Llama 3.2 Vision 11B: OCR, Multi-Image Reasoning and Latency GIGAGPU

Llama 3.2 11B Vision Instruct is Meta’s first natively multimodal Llama: a 9B Llama text decoder married to a roughly 2B vision tower through a cross-attention adapter design rather than the more common token-concatenation route. On a single RTX 4090 24GB dedicated host from Gigagpu UK hosting, the FP8 build occupies roughly 13.5 GB total, encodes a 1120 x 1120 image in 220 ms, and decodes at ~115 tokens per second at batch 1. That leaves enough VRAM to run multi-image conversations with a 32k-token KV cache, batch up to four parallel OCR jobs, or pin a Whisper sidecar in the same process. This guide walks through the architecture, the latency breakdown by image resolution, OCR quality on UK documents, and the operational gotchas that bite people deploying it for the first time.

Why Llama Vision 11B on a 4090

What Meta built and why the architecture matters

Llama 3.2 11B Vision uses a cross-attention adapter instead of the now-standard token-concatenation approach. Where Qwen 2.5-VL prepends image tokens directly into the text stream (so a high-res image with 4 tiles consumes ~6,400 tokens of your KV budget), Llama 3.2 Vision injects image features through dedicated cross-attention layers that sit between the standard self-attention and feedforward blocks. The text decoder is structurally a 9B Llama variant (32 layers, 8 KV heads under GQA, 128k native context). The vision tower is a ~2B ViT with cross-attention projection heads. The Llama 3.2 Community License covers commercial use under standard restrictions.

Why the 4090 is the right card

The cross-attention design has a major operational consequence on the RTX 4090‘s 24 GB pool: image features do not pollute the autoregressive KV cache. A 32k-token text window stays a 32k-token text window even when you attach four images. The cost is that the cross-attention layers themselves are extra weights (~2.5 GB FP8) that you cannot quantise as aggressively as the text decoder. Ada’s native FP8 tensor cores at ~660 TFLOPS handle both the ViT prefill and the cross-attention projections without bandwidth contention; the GDDR6X 1,008 GB/s bus is the dominant decode bottleneck. The 24 GB pool is what makes 4-image batching at 32k context viable; the 16 GB cards (5080, 5060 Ti) cannot do this.

Architecture and VRAM map

Format-by-format footprint

Component	FP16	FP8	AWQ INT4 (text only, vision FP8)
Text decoder (9B)	18.0 GB	9.0 GB	5.6 GB
Vision tower + cross-attn adapter (~2B)	4.0 GB	2.5 GB	2.5 GB (kept FP8)
KV @ 8k	1.0 GB	0.5 GB	0.5 GB
KV @ 32k	4.0 GB	2.0 GB	2.0 GB
Activations + scratch	1.5 GB	1.5 GB	1.5 GB
Image features (1 x 1120 px)	0.4 GB	0.2 GB	0.2 GB
Peak VRAM @ 8k, 1 image	~25 GB (overflows)	~13.7 GB	~10.3 GB
Peak VRAM @ 32k, 4 images	does not fit	~17.0 GB	~13.6 GB

Plain FP16 will not fit on a 4090 once even one image is folded in. FP8 is the sensible default and AWQ-quantising only the decoder buys roughly 3.5 GB of additional headroom for batching or longer contexts. See our FP8 tensor core notes for why the FP8 path is cheap on Ada.

Image encode latency

Encode by resolution

Image size	FP16 encode	FP8 encode	Tokens emitted	Notes
560 x 560 (single tile)	140 ms	120 ms	1,601	Avatars, thumbnails
1120 x 1120 (single tile)	270 ms	220 ms	1,601	Default OCR resolution
1120 x 1120 (4 tiles)	720 ms	560 ms	6,404	Dense documents, charts
1120 x 1680 (3 tiles)	510 ms	410 ms	4,803	Portrait scans (A5, receipts)
1680 x 2240 (6 tiles)	1,080 ms	860 ms	9,606	A4 invoice, Companies House filing

Batched encode

Batch (1120 px single-tile)	FP8 encode	ms / image	VRAM peak
1	220 ms	220	13.7 GB
2	340 ms	170	14.4 GB
4	560 ms	140	15.8 GB
8	980 ms	123	18.6 GB

The vision tower is comparatively small, so batching scales well – the decode side is the bottleneck for a typical OCR pipeline, not the encoder. For sustained throughput, group images of similar resolution before encoding.

Decode throughput and concurrency

Decode by batch size, FP8 weights

Batch	Per-stream t/s	Aggregate t/s	p50 TTFT (1 image + 200 tok)	p99 TTFT
1	115	115	340 ms	420 ms
2	96	192	410 ms	520 ms
4	72	288	590 ms	820 ms
8	48	384	1,020 ms	1,560 ms
16	OOM at 32k	—	—	—

Cross-card decode comparison

GPU	FP8 decode b=1	Aggregate b=4	Encode 1120 px	Max concurrent at 32k
RTX 5090 32GB	168 t/s	440 t/s	140 ms	16
RTX 4090 24GB	115 t/s	288 t/s	220 ms	8
RTX 5080 16GB	92 t/s	OOM at 4 imgs	240 ms	2
RTX 5060 Ti 16GB	62 t/s	OOM at 4 imgs	340 ms	2
RTX 3090 24GB	54 t/s (no FP8)	140 t/s	320 ms	4
H100 80GB	220 t/s	720 t/s	110 ms	32+

The 3090 lacks native FP8, which is the headline gap on multimodal workloads. See 4090 vs 3090 and 4090 vs 5090 for the wider patterns.

OCR and reasoning quality

Public benchmark scores

Benchmark	Llama 3.2 11B Vision	Qwen 2.5-VL 7B	GPT-4o-mini	Pixtral 12B
DocVQA (val, ANLS)	88.4	93.0	89.6	90.7
TextVQA	75.2	84.9	78.0	74.8
ChartQA	83.4	84.5	81.0	81.8
OCRBench	782	864	805	754
MMMU (validation)	50.7	52.0	59.4	52.5
MathVista	51.5	62.3	56.7	56.9

Field extraction on 200 sampled UK documents

Document type	Header field F1	Line item F1	Total amount accuracy
NHS letters	0.96	n/a	n/a
Trade invoices	0.94	0.87	97.5%
Companies House filings	0.93	0.84	n/a
HMRC SA302 statements	0.95	0.89	98.0%
Receipts (mixed quality)	0.88	0.81	92.5%

Strong enough to deploy with a light human-in-the-loop verification stage on edge cases. For OCR-dominant farms Qwen 2.5-VL is denser per pound; Llama 3.2 Vision wins on natural-image reasoning and on workloads where text context dominates over image count.

Deployment configuration

vLLM launch (FP8, 8k context, 4-image batching)

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-11B-Vision-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 8192 --max-num-seqs 8 \
  --limit-mm-per-prompt image=4 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.92

Pre-processing helper for production OCR queues

from PIL import Image
import io, requests

def fetch_for_llama_vision(url, max_dim=1120):
    img = Image.open(io.BytesIO(requests.get(url, timeout=10).content))
    img.thumbnail((max_dim, max_dim))
    return img.convert("RGB")

Test rig and methodology

All numbers above were captured on a single-tenant Gigagpu node: RTX 4090 24GB Founders Edition (450 W stock), Ryzen 9 7950X with 64 GB DDR5-5600, Samsung 990 Pro 2TB Gen 4 NVMe; Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6, vLLM 0.6.4 (cross-attention vision support is in 0.6.3+), PyTorch 2.5, FlashAttention 2.6. Decode throughput is sustained mean over 60-second windows; image encode is mean of 50 samples after warm-up. UK document corpus was 200 sampled documents anonymised and labelled by hand. See our vLLM setup guide for installation steps.

Production gotchas

The chat template is fiddly. Llama 3.2 Vision uses a structured message format with explicit image placeholders. If you pass a plain string with a URL, vLLM will silently ignore the image. Use the chat-completions endpoint with the proper type: image_url parts and confirm with a probe request before going live.
Tile count balloons VRAM unexpectedly. Auto-resize to 1120 px max dimension; an unconstrained 3000 x 4000 px scan will produce 12 tiles and ~19 k image tokens, blowing past your KV budget and OOMing the worker.
FP16 does not fit, full stop. Always use FP8 weights or AWQ-on-text + FP8-on-vision. People try BF16 because it is the default and hit OOM after the first image.
Cross-attention prefill is more expensive than text-only. A 200-token text prompt with a single image takes ~340 ms TTFT, not the ~80 ms a text-only Llama 3.1 8B would. Plan SLOs accordingly.
Multi-image batching is brittle. Different images produce different cross-attention key/value sizes; vLLM pads to the max, so a batch with one A4 scan and one avatar wastes most of the compute on the avatar slot. Group similar-resolution requests where possible.
The vision tower is sensitive to JPEG artefacts. Sub-50% JPEG quality measurably degrades OCR; if you control the upload pipeline keep clients at quality 80+.
Bounding-box outputs are not native. Unlike Qwen 2.5-VL, Llama 3.2 Vision does not produce grounded bbox tokens. If you need layout-aware downstream work, pair it with a separate detector or use Qwen.

Cost per million pages

Workload	Pages / hour	Pages / day	Cost / 1k pages (UK power)
Single-stream OCR (b=1, full A4)	900	21,600	£0.40
Batched OCR (b=4, single tile)	3,600	86,400	£0.10
Batched OCR (b=8, mixed)	5,400	129,600	£0.07

That undercuts AWS Textract by roughly 90% and Azure Document Intelligence by ~85% for steady-state workloads. See monthly hosting cost and vs OpenAI API cost for the broader API-vs-self-hosted picture.

Verdict and alternatives

Pick Llama 3.2 Vision 11B on a 4090 when you want strong general visual reasoning, are happy with cross-attention’s smaller KV footprint for long text contexts, and your dataset skews toward Western-language documents and natural images. If you need maximum OCR throughput, especially on Asian scripts or table-heavy layouts, Qwen 2.5-VL 7B is denser. For pure chart understanding both are close; for video-conditioned reasoning Qwen wins because Llama 3.2 Vision does not ingest video frames natively. For text-only workloads at higher throughput, drop to Llama 3 8B.

Vision LLMs on a single 4090

OCR, document Q&A and chart understanding at 100+ tok/s with multi-image batching. UK dedicated hosting.

Order the RTX 4090 24GB

RTX 4090 24GB for Llama 3.2 Vision 11B: OCR, Multi-Image Reasoning and Latency

Contents

Why Llama Vision 11B on a 4090

What Meta built and why the architecture matters

Why the 4090 is the right card

Architecture and VRAM map

Format-by-format footprint

Image encode latency

Encode by resolution

Batched encode

Decode throughput and concurrency

Decode by batch size, FP8 weights

Cross-card decode comparison

OCR and reasoning quality

Public benchmark scores

Field extraction on 200 sampled UK documents

Deployment configuration

vLLM launch (FP8, 8k context, 4-image batching)

Pre-processing helper for production OCR queues

Test rig and methodology

Production gotchas

Cost per million pages

Verdict and alternatives

Vision LLMs on a single 4090

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB for Llama 3.2 Vision 11B: OCR, Multi-Image Reasoning and Latency

Contents

Why Llama Vision 11B on a 4090

What Meta built and why the architecture matters

Why the 4090 is the right card

Architecture and VRAM map

Format-by-format footprint

Image encode latency

Encode by resolution

Batched encode

Decode throughput and concurrency

Decode by batch size, FP8 weights

Cross-card decode comparison

OCR and reasoning quality

Public benchmark scores

Field extraction on 200 sampled UK documents

Deployment configuration

vLLM launch (FP8, 8k context, 4-image batching)

Pre-processing helper for production OCR queues

Test rig and methodology

Production gotchas

Cost per million pages

Verdict and alternatives

Vision LLMs on a single 4090

Need a Dedicated GPU Server?

gigagpu

Related Articles

Running a 128K Context LLM on the RTX 5060 Ti 16 GB: What Actually Fits

Gemma VRAM Requirements (2B, 7B, 27B)

Qwen Coder 32B on a Dedicated GPU

DeepSeek Coder vs DeepSeek Chat: Choosing the Right Variant

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?