Chat Template Errors in vLLM
You send a chat completions request to your vLLM server and get:
ValueError: As of transformers v4.44, default chat template is no longer allowed.
Please use the chat template defined in the model's tokenizer config.
jinja2.exceptions.TemplateSyntaxError: unexpected '}' at line 3
KeyError: 'system' role not supported in chat template
vLLM’s /v1/chat/completions endpoint converts the OpenAI-format messages array into a single string using a Jinja2 chat template. If the template is missing, malformed, or incompatible with your message format, the request fails before the model even sees it.
How Chat Templates Work
When you send a chat request with messages like [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}], vLLM uses a Jinja2 template to convert these into the token format the model expects. Different models expect different formats:
- Llama 3 uses
<|begin_of_text|><|start_header_id|>system<|end_header_id|> - ChatML-based models use
<|im_start|>system\n - Mistral uses
[INST]tokens with no explicit system role.
The template is typically embedded in the model’s tokenizer_config.json under the chat_template key.
Fix 1: Model Has No Chat Template
Some base models (not instruct variants) lack a chat template entirely. Provide one explicitly:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--chat-template ./llama3_template.jinja
Create the template file for Llama 3 style:
{% for message in messages %}
{%- if message['role'] == 'system' -%}
<|begin_of_text|><|start_header_id>system<|end_header_id>
{{ message['content'] }}<|eot_id|>
{%- elif message['role'] == 'user' -%}
<|start_header_id>user<|end_header_id>
{{ message['content'] }}<|eot_id|>
{%- elif message['role'] == 'assistant' -%}
<|start_header_id>assistant<|end_header_id>
{{ message['content'] }}<|eot_id|>
{%- endif %}
{%- endfor %}
<|start_header_id>assistant<|end_header_id>
Fix 2: Jinja2 Syntax Errors
If the template embedded in the tokenizer config has syntax errors, you can override it. First, extract the existing template to see what is wrong:
python -c "
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('your-model-name')
print(tok.chat_template)
"
Copy the output, fix the syntax, save it to a file, and pass it with --chat-template. Common issues: unescaped braces, missing endif tags, and incorrect Jinja2 filter names.
Fix 3: System Role Not Supported
Some models like older Mistral variants do not support a system role. Two options:
- Prepend the system message to the first user message in your application code before sending to vLLM.
- Write a custom template that folds the system message into the first user turn:
{% for message in messages %}
{%- if message['role'] == 'system' -%}
{# Store for prepending to first user message #}
{%- set system_message = message['content'] -%}
{%- elif message['role'] == 'user' -%}
[INST] {% if system_message is defined and loop.first %}{{ system_message }}\n\n{% endif %}{{ message['content'] }} [/INST]
{%- elif message['role'] == 'assistant' -%}
{{ message['content'] }}
{%- endif %}
{%- endfor %}
Fix 4: Incorrect Special Token Handling
If the model generates garbage or cuts off immediately, special tokens may be misconfigured. Check that the tokenizer’s special tokens match what the template produces:
python -c "
from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained('your-model-name')
print(f'BOS: {tok.bos_token} ({tok.bos_token_id})')
print(f'EOS: {tok.eos_token} ({tok.eos_token_id})')
print(f'PAD: {tok.pad_token} ({tok.pad_token_id})')
# Test template application
messages = [{'role': 'user', 'content': 'Hello'}]
text = tok.apply_chat_template(messages, tokenize=False)
print(f'Template output: {repr(text)}')
"
Testing the Chat Endpoint
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"}
],
"max_tokens": 50
}'
The response should contain a coherent answer. If the model produces nonsensical output, the template is formatting the prompt incorrectly even though it does not throw an error. Compare the tokenised output against the model’s training format.
For production GPU server deployments, document which template you use and pin your vLLM and tokenizer versions. Follow our vLLM production setup for the complete deployment checklist. Our API security guide covers protecting chat endpoints, and additional tutorials address related configuration topics.
Serve LLMs with Confidence
GigaGPU dedicated GPU servers provide the reliable infrastructure your vLLM deployment needs for production chat applications.
Browse GPU Servers