RTX 3050 - Order Now
Home / Blog / Model Guides / RTX 4090 24GB for Llama 3.2 Vision 11B: OCR, Multi-Image Reasoning and Latency
Model Guides

RTX 4090 24GB for Llama 3.2 Vision 11B: OCR, Multi-Image Reasoning and Latency

Llama 3.2 Vision 11B on the RTX 4090 24GB - 13.5GB FP8 footprint, 220ms image encode, 115 t/s decode, multi-image batching, OCR quality on UK documents and operational gotchas for production multimodal pipelines.

Llama 3.2 11B Vision Instruct is Meta’s first natively multimodal Llama: a 9B Llama text decoder married to a roughly 2B vision tower through a cross-attention adapter design rather than the more common token-concatenation route. On a single RTX 4090 24GB dedicated host from Gigagpu UK hosting, the FP8 build occupies roughly 13.5 GB total, encodes a 1120 x 1120 image in 220 ms, and decodes at ~115 tokens per second at batch 1. That leaves enough VRAM to run multi-image conversations with a 32k-token KV cache, batch up to four parallel OCR jobs, or pin a Whisper sidecar in the same process. This guide walks through the architecture, the latency breakdown by image resolution, OCR quality on UK documents, and the operational gotchas that bite people deploying it for the first time.

Contents

Why Llama Vision 11B on a 4090

What Meta built and why the architecture matters

Llama 3.2 11B Vision uses a cross-attention adapter instead of the now-standard token-concatenation approach. Where Qwen 2.5-VL prepends image tokens directly into the text stream (so a high-res image with 4 tiles consumes ~6,400 tokens of your KV budget), Llama 3.2 Vision injects image features through dedicated cross-attention layers that sit between the standard self-attention and feedforward blocks. The text decoder is structurally a 9B Llama variant (32 layers, 8 KV heads under GQA, 128k native context). The vision tower is a ~2B ViT with cross-attention projection heads. The Llama 3.2 Community License covers commercial use under standard restrictions.

Why the 4090 is the right card

The cross-attention design has a major operational consequence on the RTX 4090‘s 24 GB pool: image features do not pollute the autoregressive KV cache. A 32k-token text window stays a 32k-token text window even when you attach four images. The cost is that the cross-attention layers themselves are extra weights (~2.5 GB FP8) that you cannot quantise as aggressively as the text decoder. Ada’s native FP8 tensor cores at ~660 TFLOPS handle both the ViT prefill and the cross-attention projections without bandwidth contention; the GDDR6X 1,008 GB/s bus is the dominant decode bottleneck. The 24 GB pool is what makes 4-image batching at 32k context viable; the 16 GB cards (5080, 5060 Ti) cannot do this.

Architecture and VRAM map

Format-by-format footprint

ComponentFP16FP8AWQ INT4 (text only, vision FP8)
Text decoder (9B)18.0 GB9.0 GB5.6 GB
Vision tower + cross-attn adapter (~2B)4.0 GB2.5 GB2.5 GB (kept FP8)
KV @ 8k1.0 GB0.5 GB0.5 GB
KV @ 32k4.0 GB2.0 GB2.0 GB
Activations + scratch1.5 GB1.5 GB1.5 GB
Image features (1 x 1120 px)0.4 GB0.2 GB0.2 GB
Peak VRAM @ 8k, 1 image~25 GB (overflows)~13.7 GB~10.3 GB
Peak VRAM @ 32k, 4 imagesdoes not fit~17.0 GB~13.6 GB

Plain FP16 will not fit on a 4090 once even one image is folded in. FP8 is the sensible default and AWQ-quantising only the decoder buys roughly 3.5 GB of additional headroom for batching or longer contexts. See our FP8 tensor core notes for why the FP8 path is cheap on Ada.

Image encode latency

Encode by resolution

Image sizeFP16 encodeFP8 encodeTokens emittedNotes
560 x 560 (single tile)140 ms120 ms1,601Avatars, thumbnails
1120 x 1120 (single tile)270 ms220 ms1,601Default OCR resolution
1120 x 1120 (4 tiles)720 ms560 ms6,404Dense documents, charts
1120 x 1680 (3 tiles)510 ms410 ms4,803Portrait scans (A5, receipts)
1680 x 2240 (6 tiles)1,080 ms860 ms9,606A4 invoice, Companies House filing

Batched encode

Batch (1120 px single-tile)FP8 encodems / imageVRAM peak
1220 ms22013.7 GB
2340 ms17014.4 GB
4560 ms14015.8 GB
8980 ms12318.6 GB

The vision tower is comparatively small, so batching scales well – the decode side is the bottleneck for a typical OCR pipeline, not the encoder. For sustained throughput, group images of similar resolution before encoding.

Decode throughput and concurrency

Decode by batch size, FP8 weights

BatchPer-stream t/sAggregate t/sp50 TTFT (1 image + 200 tok)p99 TTFT
1115115340 ms420 ms
296192410 ms520 ms
472288590 ms820 ms
8483841,020 ms1,560 ms
16OOM at 32k

Cross-card decode comparison

GPUFP8 decode b=1Aggregate b=4Encode 1120 pxMax concurrent at 32k
RTX 5090 32GB168 t/s440 t/s140 ms16
RTX 4090 24GB115 t/s288 t/s220 ms8
RTX 5080 16GB92 t/sOOM at 4 imgs240 ms2
RTX 5060 Ti 16GB62 t/sOOM at 4 imgs340 ms2
RTX 3090 24GB54 t/s (no FP8)140 t/s320 ms4
H100 80GB220 t/s720 t/s110 ms32+

The 3090 lacks native FP8, which is the headline gap on multimodal workloads. See 4090 vs 3090 and 4090 vs 5090 for the wider patterns.

OCR and reasoning quality

Public benchmark scores

BenchmarkLlama 3.2 11B VisionQwen 2.5-VL 7BGPT-4o-miniPixtral 12B
DocVQA (val, ANLS)88.493.089.690.7
TextVQA75.284.978.074.8
ChartQA83.484.581.081.8
OCRBench782864805754
MMMU (validation)50.752.059.452.5
MathVista51.562.356.756.9

Field extraction on 200 sampled UK documents

Document typeHeader field F1Line item F1Total amount accuracy
NHS letters0.96n/an/a
Trade invoices0.940.8797.5%
Companies House filings0.930.84n/a
HMRC SA302 statements0.950.8998.0%
Receipts (mixed quality)0.880.8192.5%

Strong enough to deploy with a light human-in-the-loop verification stage on edge cases. For OCR-dominant farms Qwen 2.5-VL is denser per pound; Llama 3.2 Vision wins on natural-image reasoning and on workloads where text context dominates over image count.

Deployment configuration

vLLM launch (FP8, 8k context, 4-image batching)

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.2-11B-Vision-Instruct \
  --quantization fp8 --kv-cache-dtype fp8_e4m3 \
  --max-model-len 8192 --max-num-seqs 8 \
  --limit-mm-per-prompt image=4 \
  --enable-chunked-prefill \
  --gpu-memory-utilization 0.92

Pre-processing helper for production OCR queues

from PIL import Image
import io, requests

def fetch_for_llama_vision(url, max_dim=1120):
    img = Image.open(io.BytesIO(requests.get(url, timeout=10).content))
    img.thumbnail((max_dim, max_dim))
    return img.convert("RGB")

Test rig and methodology

All numbers above were captured on a single-tenant Gigagpu node: RTX 4090 24GB Founders Edition (450 W stock), Ryzen 9 7950X with 64 GB DDR5-5600, Samsung 990 Pro 2TB Gen 4 NVMe; Ubuntu 24.04 LTS, NVIDIA driver 560.x, CUDA 12.6, vLLM 0.6.4 (cross-attention vision support is in 0.6.3+), PyTorch 2.5, FlashAttention 2.6. Decode throughput is sustained mean over 60-second windows; image encode is mean of 50 samples after warm-up. UK document corpus was 200 sampled documents anonymised and labelled by hand. See our vLLM setup guide for installation steps.

Production gotchas

  1. The chat template is fiddly. Llama 3.2 Vision uses a structured message format with explicit image placeholders. If you pass a plain string with a URL, vLLM will silently ignore the image. Use the chat-completions endpoint with the proper type: image_url parts and confirm with a probe request before going live.
  2. Tile count balloons VRAM unexpectedly. Auto-resize to 1120 px max dimension; an unconstrained 3000 x 4000 px scan will produce 12 tiles and ~19 k image tokens, blowing past your KV budget and OOMing the worker.
  3. FP16 does not fit, full stop. Always use FP8 weights or AWQ-on-text + FP8-on-vision. People try BF16 because it is the default and hit OOM after the first image.
  4. Cross-attention prefill is more expensive than text-only. A 200-token text prompt with a single image takes ~340 ms TTFT, not the ~80 ms a text-only Llama 3.1 8B would. Plan SLOs accordingly.
  5. Multi-image batching is brittle. Different images produce different cross-attention key/value sizes; vLLM pads to the max, so a batch with one A4 scan and one avatar wastes most of the compute on the avatar slot. Group similar-resolution requests where possible.
  6. The vision tower is sensitive to JPEG artefacts. Sub-50% JPEG quality measurably degrades OCR; if you control the upload pipeline keep clients at quality 80+.
  7. Bounding-box outputs are not native. Unlike Qwen 2.5-VL, Llama 3.2 Vision does not produce grounded bbox tokens. If you need layout-aware downstream work, pair it with a separate detector or use Qwen.

Cost per million pages

WorkloadPages / hourPages / dayCost / 1k pages (UK power)
Single-stream OCR (b=1, full A4)90021,600£0.40
Batched OCR (b=4, single tile)3,60086,400£0.10
Batched OCR (b=8, mixed)5,400129,600£0.07

That undercuts AWS Textract by roughly 90% and Azure Document Intelligence by ~85% for steady-state workloads. See monthly hosting cost and vs OpenAI API cost for the broader API-vs-self-hosted picture.

Verdict and alternatives

Pick Llama 3.2 Vision 11B on a 4090 when you want strong general visual reasoning, are happy with cross-attention’s smaller KV footprint for long text contexts, and your dataset skews toward Western-language documents and natural images. If you need maximum OCR throughput, especially on Asian scripts or table-heavy layouts, Qwen 2.5-VL 7B is denser. For pure chart understanding both are close; for video-conditioned reasoning Qwen wins because Llama 3.2 Vision does not ingest video frames natively. For text-only workloads at higher throughput, drop to Llama 3 8B.

Vision LLMs on a single 4090

OCR, document Q&A and chart understanding at 100+ tok/s with multi-image batching. UK dedicated hosting.

Order the RTX 4090 24GB

See also: Qwen 2.5-VL guide, multimodal use cases, vLLM setup, FP8 deployment, prefill/decode, 4090 vs 5090, tokens per watt, concurrent users.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?