RTX 3050 - Order Now
Home / Blog / Model Guides / GLM-4 9B Chat Self-Hosted
Model Guides

GLM-4 9B Chat Self-Hosted

Zhipu AI's GLM-4 9B is a compact model with strong tool-use and function-calling support - a practical alternative to Llama 3 8B.

GLM-4 9B Chat is Zhipu AI’s compact open-weights model with native function-calling and strong bilingual performance. On our dedicated GPU hosting it is a reasonable Llama 3 8B alternative with slightly different strengths.

Contents

VRAM

PrecisionWeightsFits On
FP16~18 GB24 GB card, 16 GB tight
FP8~9 GB16 GB card comfortable
AWQ INT4~5.5 GB8 GB+ card

Deployment

python -m vllm.entrypoints.openai.api_server \
  --model THUDM/glm-4-9b-chat \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.92

GLM-4 supports 128k context in its extended variant. For the base 9B chat model, 32k covers most practical use.

Function Calling

GLM-4’s function-calling template differs from OpenAI’s. The model expects tool descriptions as part of the system prompt and emits tool calls in its own format. vLLM’s function-calling support for GLM-4 requires a specific tool-parser:

--enable-auto-tool-choice \
--tool-call-parser glm4

Once configured, standard OpenAI SDK tool-calling code works.

Self-Hosted Tool-Use LLM

GLM-4 9B preconfigured for function calling on UK dedicated GPUs.

Browse GPU Servers

Compare against function calling with Llama 3.3 and Qwen Coder tool use.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?