RTX 3050 - Order Now
Home / Blog / Model Guides / CogVLM2 Self-Hosted Deployment
Model Guides

CogVLM2 Self-Hosted Deployment

THUDM's CogVLM2 is a 19B-parameter vision-language model with strong visual grounding and OCR - a less-common but capable choice.

CogVLM2 from THUDM is a 19B-parameter vision-language model combining a 7B LLM with a dedicated visual expert. It is particularly strong at visual grounding (pointing to specific regions) and OCR. On our dedicated GPU hosting it needs a 24 GB+ card.

Contents

VRAM

PrecisionWeightsFits On
FP16~38 GB48 GB+ card
FP8~19 GB24 GB card
INT4~11 GB16 GB+ card

Deployment

CogVLM2 is not yet fully supported in vLLM’s multimodal path. Production deployment via Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("THUDM/cogvlm2-llama3-chat-19B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
  "THUDM/cogvlm2-llama3-chat-19B",
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  device_map="cuda"
)

Wrap in FastAPI for an HTTP endpoint. See OpenAI-compatible API guide.

Use Cases

CogVLM2 is strong on:

  • Dense visual scenes with many objects
  • Chinese-language document Q&A
  • Bounding-box grounding (“point to the red car”)
  • Medical image interpretation (with appropriate caveats)

For general-purpose VLM tasks Qwen VL 2 7B is usually easier to deploy. CogVLM2 shines when visual grounding or bilingual OCR matters.

Visual Grounding VLM Hosting

CogVLM2 or similar grounded VLMs on UK dedicated GPU servers.

Browse GPU Servers

Compare Molmo 7B for similar pointing capabilities in a smaller package.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?