GLM-4 9B from Zhipu AI offers native function calling and strong bilingual performance. On the RTX 5060 Ti 16GB it fits FP8 comfortably via our dedicated hosting.
Contents
Fit
- FP16: ~18 GB – does not fit with KV cache
- FP8: ~9 GB – comfortable
- AWQ INT4: ~5.5 GB – very comfortable
Deployment
python -m vllm.entrypoints.openai.api_server \
--model THUDM/glm-4-9b-chat \
--quantization fp8 \
--trust-remote-code \
--max-model-len 32768 \
--enable-auto-tool-choice \
--tool-call-parser glm4 \
--gpu-memory-utilization 0.92
Function Calling
GLM-4 has a specific tool call format. vLLM’s --tool-call-parser glm4 handles parsing – after configuration, standard OpenAI-style tool calling works from the client side:
response = client.chat.completions.create(
model="glm-4",
messages=[...],
tools=[{"type": "function", "function": {...}}]
)
Tool use quality is strong – comparable to Qwen 14B on function calling benchmarks despite smaller size.
Long Context
GLM-4 ships with a 128k context variant (glm-4-9b-chat-1m). KV cache math at this tier is identical to Mistral Nemo – 1 concurrent 128k sequence, 4-6 concurrent 32k sequences. See long context guide.
Decode speed: ~75 t/s at FP8, comfortable for agent workloads.
Tool-Use LLM on Blackwell
GLM-4 9B with native function calling. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: full GLM-4 guide, function calling with Llama 3.3.