Pixtral 12B from Mistral is a vision-language model built on top of Mistral Nemo 12B, supporting variable image resolutions rather than fixed tiling. On our dedicated GPU hosting it fits a 24 GB card at FP16 with room for reasonable concurrency.
Contents
VRAM
| Precision | Weights | Fits On |
|---|---|---|
| FP16 | ~24 GB | 24 GB card tight, 32 GB comfortable |
| FP8 | ~12 GB | 16 GB+ card |
| AWQ INT4 | ~7 GB | Any 12 GB+ card |
Deployment
python -m vllm.entrypoints.openai.api_server \
--model mistral-community/pixtral-12b \
--max-model-len 32768 \
--limit-mm-per-prompt 'image=4' \
--tokenizer-mode mistral \
--config-format mistral
--limit-mm-per-prompt 'image=4' allows up to four images per request – Pixtral handles multi-image reasoning well.
Variable Resolution
Unlike many VLMs that downsample images to a fixed tile grid, Pixtral handles the input at its native resolution (up to the model’s context budget). Small images stay small; large images get more visual tokens. Practical impact:
- Better detail recognition on high-resolution photos
- Lower cost on simple images (no forced 336×336 tiling)
- Context usage varies with image size – budget KV cache accordingly
A 1024×1024 image consumes roughly 4x the visual tokens a 512×512 image does. For high-volume deployments, normalise input resolution before submission.
Vision-Language Model Hosting
Pixtral 12B preconfigured on UK dedicated GPUs with appropriate VRAM sizing.
Browse GPU ServersCompare against Llama 3.2 Vision 11B and Qwen VL 2.