RTX 3050 - Order Now
Home / Blog / Tutorials / Tool Use with Qwen Coder Self-Hosted
Tutorials

Tool Use with Qwen Coder Self-Hosted

Qwen Coder handles tool use reliably and benefits from a larger tool vocabulary than Llama. Here is how to configure it on a dedicated GPU.

Qwen Coder 32B (and Qwen 2.5 72B) are among the best open-weights models for structured tool use in 2026. Reliable JSON emission, fewer hallucinated tool names, better handling of larger tool catalogs. On our dedicated GPU hosting setup follows a similar pattern to Llama 3.3 with a Qwen-specific parser.

Contents

vLLM

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
  --quantization awq \
  --enable-auto-tool-choice \
  --tool-call-parser hermes

--tool-call-parser hermes works for Qwen’s tool format. Alternative qwen parser is available in recent vLLM.

Format

Qwen emits tool calls as:

<tool_call>
{"name": "get_weather", "arguments": {"location": "London"}}
</tool_call>

vLLM parses these into OpenAI-format tool calls automatically. Your client code need not change from an OpenAI integration.

Many Tools

Qwen Coder handles 20-40 tools in a single call with better reliability than Llama or Mistral. For larger catalogs, a two-stage pattern works:

  1. First LLM call with all tool names + 1-line descriptions. Ask which 3-5 are relevant.
  2. Second LLM call with just those tools’ full schemas. Actual call.

This preserves context budget and improves call accuracy when tool catalog exceeds 50.

Tips

  • Keep tool names in snake_case – Qwen tokenises them better
  • Avoid deeply-nested JSON schemas – flatten to maybe 2 levels
  • For coding tool use (file operations, shell), Qwen Coder is strongest
  • For pure tool routing (which of 30 tools to call), Qwen 2.5 72B Instruct is slightly better than the Coder variant

Production Tool-Use LLM Hosting

Qwen Coder 32B on UK dedicated GPUs with tool parsing enabled.

Browse GPU Servers

See function calling with Llama 3.3.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?