Table of Contents
For production LLM workloads needing structured output, the choice is between asking nicely (prompt engineering) and forcing the issue (constrained decoding via vLLM's guided_json or OpenAI's response_format). Constrained decoding wins on reliability; the trade-off is small.
Prompt engineering: ask LLM to output JSON; ~95-98% valid output rate. Constrained decoding: force valid output by token-level masking; 100% valid by construction. ~5% throughput cost. For production, always use constrained decoding when output format matters. Prompt engineering still useful for output content quality (within the schema).
Comparison
| Aspect | Prompt engineering | Constrained decoding |
|---|---|---|
| Output validity | ~95-98% | 100% by construction |
| Throughput cost | None | ~5% |
| Implementation | Prompt template | Schema in API call |
| Schema flexibility | Free-form | JSON schema / regex / grammar |
| Retry rate on parse failures | ~2-5% | 0% |
When prompting still
Prompt engineering remains the right tool for:
- Output content quality within a structured shell (the what, not the how)
- Few-shot examples of good outputs
- Explaining the task to the model
- Format guidance for fields the schema can't fully constrain (free-text fields with style preferences)
Use both: constrained decoding for guaranteed format, prompt engineering for content quality.
Verdict
For production structured outputs, constrained decoding is the right default. Use vLLM's response_format={"type":"json_schema",...} for OpenAI compatibility, or guided_json / guided_choice / guided_regex for finer control. Prompt engineering supplements but doesn't replace.
Bottom line
Constrained decoding for format; prompts for content. See guided decoding.