CogVLM2 from THUDM is a 19B-parameter vision-language model combining a 7B LLM with a dedicated visual expert. It is particularly strong at visual grounding (pointing to specific regions) and OCR. On our dedicated GPU hosting it needs a 24 GB+ card.
Contents
VRAM
| Precision | Weights | Fits On |
|---|---|---|
| FP16 | ~38 GB | 48 GB+ card |
| FP8 | ~19 GB | 24 GB card |
| INT4 | ~11 GB | 16 GB+ card |
Deployment
CogVLM2 is not yet fully supported in vLLM’s multimodal path. Production deployment via Transformers:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tokenizer = AutoTokenizer.from_pretrained("THUDM/cogvlm2-llama3-chat-19B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
"THUDM/cogvlm2-llama3-chat-19B",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
device_map="cuda"
)
Wrap in FastAPI for an HTTP endpoint. See OpenAI-compatible API guide.
Use Cases
CogVLM2 is strong on:
- Dense visual scenes with many objects
- Chinese-language document Q&A
- Bounding-box grounding (“point to the red car”)
- Medical image interpretation (with appropriate caveats)
For general-purpose VLM tasks Qwen VL 2 7B is usually easier to deploy. CogVLM2 shines when visual grounding or bilingual OCR matters.
Visual Grounding VLM Hosting
CogVLM2 or similar grounded VLMs on UK dedicated GPU servers.
Browse GPU ServersCompare Molmo 7B for similar pointing capabilities in a smaller package.