Qwen Coder 32B (and Qwen 2.5 72B) are among the best open-weights models for structured tool use in 2026. Reliable JSON emission, fewer hallucinated tool names, better handling of larger tool catalogs. On our dedicated GPU hosting setup follows a similar pattern to Llama 3.3 with a Qwen-specific parser.
Contents
vLLM
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
--quantization awq \
--enable-auto-tool-choice \
--tool-call-parser hermes
--tool-call-parser hermes works for Qwen’s tool format. Alternative qwen parser is available in recent vLLM.
Format
Qwen emits tool calls as:
<tool_call>
{"name": "get_weather", "arguments": {"location": "London"}}
</tool_call>
vLLM parses these into OpenAI-format tool calls automatically. Your client code need not change from an OpenAI integration.
Many Tools
Qwen Coder handles 20-40 tools in a single call with better reliability than Llama or Mistral. For larger catalogs, a two-stage pattern works:
- First LLM call with all tool names + 1-line descriptions. Ask which 3-5 are relevant.
- Second LLM call with just those tools’ full schemas. Actual call.
This preserves context budget and improves call accuracy when tool catalog exceeds 50.
Tips
- Keep tool names in snake_case – Qwen tokenises them better
- Avoid deeply-nested JSON schemas – flatten to maybe 2 levels
- For coding tool use (file operations, shell), Qwen Coder is strongest
- For pure tool routing (which of 30 tools to call), Qwen 2.5 72B Instruct is slightly better than the Coder variant
Production Tool-Use LLM Hosting
Qwen Coder 32B on UK dedicated GPUs with tool parsing enabled.
Browse GPU Servers