Llama 3.3 was trained with function-calling support but the implementation differs from OpenAI’s format. On our dedicated GPU hosting getting reliable tool use requires both the right vLLM flags and correct message formatting.
Contents
vLLM Flags
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.3-70B-Instruct \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--chat-template /path/to/llama3_chat_template.jinja
--tool-call-parser llama3_json teaches vLLM how to extract tool calls from Llama’s native output. Without it, tool calls come through as free text.
Request
response = client.chat.completions.create(
model="llama3.3",
messages=[{"role": "user", "content": "What's the weather in London?"}],
tools=[{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}],
tool_choice="auto"
)
Parsing
Llama 3.3 emits tool calls in JSON. vLLM parses them into the standard OpenAI structure:
if response.choices[0].message.tool_calls:
for tc in response.choices[0].message.tool_calls:
name = tc.function.name
args = json.loads(tc.function.arguments)
result = execute(name, **args)
Reliability
- Describe tools precisely – Llama follows descriptions closely
- Keep tool schemas small – 5-10 tools per call is the sweet spot
- Include an explicit example in the system prompt for rare tools
- At 70B you get near-OpenAI-level reliability. At 8B, function calling is noticeably weaker – keep tools simple
Self-Hosted Function Calling
Llama 3.3 with vLLM tool parsing preconfigured on UK dedicated GPUs.
Browse GPU Servers