Qwen2-VL is Alibaba’s vision-language family spanning three sizes. On our dedicated GPU hosting each variant has a natural GPU home, and each solves different use cases.
Contents
2B
~4 GB FP16. Runs on any GPU. Quality is limited but fine for straightforward captioning and simple visual Q&A. Useful as a cheap preprocessor before a larger model.
7B
~14 GB FP16. Fits 16 GB card. Good generalist VLM – strong on charts, documents, and multi-image reasoning. Best quality-to-cost in the family.
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2-VL-7B-Instruct \
--max-model-len 32768 \
--limit-mm-per-prompt 'image=4'
72B
~144 GB FP16, ~72 GB FP8, ~42 GB INT4. Flagship vision performance. Fits a 6000 Pro 96GB at FP8 comfortably. Use only when the 7B’s quality is insufficient.
| Variant | GPU |
|---|---|
| 2B | Any – 3050 or 4060 |
| 7B | 4060 Ti 16GB, 5080 |
| 72B | 6000 Pro FP8, or dual 5090 INT4 |
Which to Pick
Start with 7B. It covers the vast majority of VLM needs at reasonable hosting cost. Upgrade to 72B only when you have measured 7B quality as insufficient on your specific use case. Drop to 2B only for edge deployment where cost per query is the primary constraint.
See Llama 3.2 Vision and Pixtral 12B for alternatives.