Table of Contents
Prompts tuned on GPT-4o or Claude often underperform on Llama 3.1 8B or Mistral 7B. The behaviour is different.
Open-weight 7B-class models prefer: more explicit instructions, shorter chain-of-thought, fewer few-shot examples, simpler structured output. Test and iterate; do not assume GPT-4 prompts transfer.
How open models differ
- Less robust to ambiguous instructions — be explicit
- Less consistent on long chain-of-thought — bound at 5 steps
- Better with structured output (JSON mode) than free-form
- Tool use varies: Mistral / Llama 3.1+ / Qwen 2.5 native; older models don't
Patterns that work
- Explicit role + clear task statement
- Output schema given upfront
- Few-shot examples (1-3, not 10)
- Step-by-step instructions for multi-step tasks
- Constrained output via JSON mode or Outlines/Instructor
Verdict
Don't copy GPT-4o prompts to open-weight models. Re-tune for the smaller model. The eval harness will tell you when prompts are working.
Bottom line
Prompt engineering for open-weight is a real engineering activity. See eval pipeline.