Home / Blog / Model Guides / CogVLM2 Self-Hosted Deployment

Model Guides

CogVLM2 Self-Hosted Deployment

THUDM's CogVLM2 is a 19B-parameter vision-language model with strong visual grounding and OCR - a less-common but capable choice.

Model Guides April 19, 2026 1 min read gigagpu

CogVLM2 from THUDM is a 19B-parameter vision-language model combining a 7B LLM with a dedicated visual expert. It is particularly strong at visual grounding (pointing to specific regions) and OCR. On our dedicated GPU hosting it needs a 24 GB+ card.

VRAM
Deployment
Use cases

VRAM

Precision	Weights	Fits On
FP16	~38 GB	48 GB+ card
FP8	~19 GB	24 GB card
INT4	~11 GB	16 GB+ card

Deployment

CogVLM2 is not yet fully supported in vLLM’s multimodal path. Production deployment via Transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("THUDM/cogvlm2-llama3-chat-19B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
  "THUDM/cogvlm2-llama3-chat-19B",
  torch_dtype=torch.bfloat16,
  trust_remote_code=True,
  device_map="cuda"
)

Wrap in FastAPI for an HTTP endpoint. See OpenAI-compatible API guide.

Use Cases

CogVLM2 is strong on:

Dense visual scenes with many objects
Chinese-language document Q&A
Bounding-box grounding (“point to the red car”)
Medical image interpretation (with appropriate caveats)

For general-purpose VLM tasks Qwen VL 2 7B is usually easier to deploy. CogVLM2 shines when visual grounding or bilingual OCR matters.

Visual Grounding VLM Hosting

CogVLM2 or similar grounded VLMs on UK dedicated GPU servers.

Browse GPU Servers

Compare Molmo 7B for similar pointing capabilities in a smaller package.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

CogVLM2 Self-Hosted Deployment

Contents

VRAM

Deployment

Use Cases

Visual Grounding VLM Hosting

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

CogVLM2 Self-Hosted Deployment

Contents

VRAM

Deployment

Use Cases

Visual Grounding VLM Hosting

Need a Dedicated GPU Server?

gigagpu

Related Articles

RTX 5070 for Self-Hosted AI API: OpenAI-Compatible Endpoint at £139/mo

Gemma 2 27B on RTX 5090 – Complete Guide

Phi VRAM Requirements (Phi-2, Phi-3, Phi-3.5)

How to Run Stable Diffusion XL on a Dedicated Server

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?