Molmo 7B from the Allen Institute for AI is a vision-language model trained from scratch (not fine-tuned from a text LLM), with a focus on spatial reasoning – pointing at things in images, counting, and describing exact locations. On our dedicated GPU hosting it fits a 16 GB card at FP16.
Contents
VRAM
| Precision | Weights | Fits On |
|---|---|---|
| FP16 | ~14 GB | 16 GB card tight, 24 GB+ comfortable |
| FP8 | ~7 GB | 8 GB+ card |
| INT4 (if supported) | ~4 GB | Any 8 GB+ card |
Deployment
python -m vllm.entrypoints.openai.api_server \
--model allenai/Molmo-7B-D-0924 \
--dtype bfloat16 \
--trust-remote-code \
--max-num-seqs 4 \
--limit-mm-per-prompt 'image=1'
Molmo’s architecture requires --trust-remote-code. Review the model card before production deployment.
Strengths
Molmo excels at:
- Pointing: “where is the cat?” returns coordinates
- Counting: accurate object counting in crowded scenes
- Precise spatial descriptions
- UI element identification
It is weaker than Llama 3.2 Vision and Qwen VL on general Q&A and long reasoning. Use Molmo for specific spatial tasks, not as a generalist VLM.
Spatial Reasoning VLM Hosting
Molmo or Llama 3.2 Vision on UK dedicated GPUs tuned for your workload.
Browse GPU ServersFor generalist VLMs see Llama 3.2 Vision 11B, Pixtral 12B, and Qwen VL 2.