GLM-4 9B Chat is Zhipu AI’s compact open-weights model with native function-calling and strong bilingual performance. On our dedicated GPU hosting it is a reasonable Llama 3 8B alternative with slightly different strengths.
Contents
VRAM
| Precision | Weights | Fits On |
|---|---|---|
| FP16 | ~18 GB | 24 GB card, 16 GB tight |
| FP8 | ~9 GB | 16 GB card comfortable |
| AWQ INT4 | ~5.5 GB | 8 GB+ card |
Deployment
python -m vllm.entrypoints.openai.api_server \
--model THUDM/glm-4-9b-chat \
--dtype bfloat16 \
--max-model-len 32768 \
--trust-remote-code \
--gpu-memory-utilization 0.92
GLM-4 supports 128k context in its extended variant. For the base 9B chat model, 32k covers most practical use.
Function Calling
GLM-4’s function-calling template differs from OpenAI’s. The model expects tool descriptions as part of the system prompt and emits tool calls in its own format. vLLM’s function-calling support for GLM-4 requires a specific tool-parser:
--enable-auto-tool-choice \
--tool-call-parser glm4
Once configured, standard OpenAI SDK tool-calling code works.
Self-Hosted Tool-Use LLM
GLM-4 9B preconfigured for function calling on UK dedicated GPUs.
Browse GPU ServersCompare against function calling with Llama 3.3 and Qwen Coder tool use.