Home / Blog / Tutorials / Tool Use with Qwen Coder Self-Hosted

Tutorials

Tool Use with Qwen Coder Self-Hosted

Qwen Coder handles tool use reliably and benefits from a larger tool vocabulary than Llama. Here is how to configure it on a dedicated GPU.

Tutorials April 23, 2026 2 min read gigagpu

Qwen Coder 32B (and Qwen 2.5 72B) are among the best open-weights models for structured tool use in 2026. Reliable JSON emission, fewer hallucinated tool names, better handling of larger tool catalogs. On our dedicated GPU hosting setup follows a similar pattern to Llama 3.3 with a Qwen-specific parser.

vLLM config
Qwen tool format
Handling many tools
Tips

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
  --quantization awq \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

--tool-call-parser hermes works for Qwen’s tool format. Alternative qwen parser is available in recent vLLM.

Format

Qwen emits tool calls as:

<tool_call>
{"name": "get_weather", "arguments": {"location": "London"}}
</tool_call>

vLLM parses these into OpenAI-format tool calls automatically. Your client code need not change from an OpenAI integration.

Many Tools

Qwen Coder handles 20-40 tools in a single call with better reliability than Llama or Mistral. For larger catalogs, a two-stage pattern works:

First LLM call with all tool names + 1-line descriptions. Ask which 3-5 are relevant.
Second LLM call with just those tools’ full schemas. Actual call.

This preserves context budget and improves call accuracy when tool catalog exceeds 50.

Tips

Keep tool names in snake_case – Qwen tokenises them better
Avoid deeply-nested JSON schemas – flatten to maybe 2 levels
For coding tool use (file operations, shell), Qwen Coder is strongest
For pure tool routing (which of 30 tools to call), Qwen 2.5 72B Instruct is slightly better than the Coder variant

Production Tool-Use LLM Hosting

Qwen Coder 32B on UK dedicated GPUs with tool parsing enabled.

Browse GPU Servers

See function calling with Llama 3.3.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Tool Use with Qwen Coder Self-Hosted

Contents

vLLM

Format

Many Tools

Tips

Production Tool-Use LLM Hosting

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Tool Use with Qwen Coder Self-Hosted

Contents

vLLM

Format

Many Tools

Tips

Production Tool-Use LLM Hosting

Need a Dedicated GPU Server?

gigagpu

Related Articles

Celery + GPU: Distributed AI Tasks

Connect Discord Bot to Self-Hosted LLM on GPU

Fine-Tuning an Embedding Model on a Dedicated GPU

How to Optimise vLLM Memory Usage for Maximum Throughput

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?