Home / Blog / Model Guides / GLM-4 9B Chat Self-Hosted

Model Guides

GLM-4 9B Chat Self-Hosted

Zhipu AI's GLM-4 9B is a compact model with strong tool-use and function-calling support - a practical alternative to Llama 3 8B.

Model Guides April 19, 2026 1 min read admin

GLM-4 9B Chat is Zhipu AI’s compact open-weights model with native function-calling and strong bilingual performance. On our dedicated GPU hosting it is a reasonable Llama 3 8B alternative with slightly different strengths.

VRAM
Deployment
Function calling

VRAM

Precision	Weights	Fits On
FP16	~18 GB	24 GB card, 16 GB tight
FP8	~9 GB	16 GB card comfortable
AWQ INT4	~5.5 GB	8 GB+ card

Deployment

python -m vllm.entrypoints.openai.api_server \
  --model THUDM/glm-4-9b-chat \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --trust-remote-code \
  --gpu-memory-utilization 0.92

GLM-4 supports 128k context in its extended variant. For the base 9B chat model, 32k covers most practical use.

Function Calling

GLM-4’s function-calling template differs from OpenAI’s. The model expects tool descriptions as part of the system prompt and emits tool calls in its own format. vLLM’s function-calling support for GLM-4 requires a specific tool-parser:

--enable-auto-tool-choice \
--tool-call-parser glm4

Once configured, standard OpenAI SDK tool-calling code works.

Self-Hosted Tool-Use LLM

GLM-4 9B preconfigured for function calling on UK dedicated GPUs.

Browse GPU Servers

Compare against function calling with Llama 3.3 and Qwen Coder tool use.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

GLM-4 9B Chat Self-Hosted

Contents

VRAM

Deployment

Function Calling

Self-Hosted Tool-Use LLM

Need a Dedicated GPU Server?

admin

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

GLM-4 9B Chat Self-Hosted

Contents

VRAM

Deployment

Function Calling

Self-Hosted Tool-Use LLM

Need a Dedicated GPU Server?

admin

Related Articles

RTX 5060 Ti 16GB for Phi-3-medium

RTX 5060 Ti 16GB for Llama 3 70B INT4 – Does It Fit?

Gemma 2 for Transcription Enhancement: GPU Requirements & Setup

Phi-3 for Transcription Enhancement: GPU Requirements & Setup

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?