Home / Blog / LLM Hosting / LLM Output: Structured JSON Responses

LLM Hosting

LLM Output: Structured JSON Responses

Get reliable structured JSON output from self-hosted LLMs. Covers guided generation, output parsing, schema enforcement, error recovery, and vLLM structured output configuration on GPU servers.

LLM Hosting April 16, 2026 4 min read gigagpu

Your LLM Returns Broken JSON or Free-Form Text

You ask the model for JSON and get a response wrapped in markdown code fences, or a preamble like “Sure! Here is the JSON:” followed by a mangled object with trailing commas and unquoted keys. Your downstream parser crashes because the LLM does not reliably produce valid JSON. This problem plagues every team building LLM-powered applications, but self-hosted models on your GPU server offer solutions that cloud APIs do not — including guaranteed schema-valid output.

Guided Generation with vLLM

vLLM supports structured output generation that constrains the model to only produce valid tokens for your schema:

# vLLM guided decoding — guarantees valid JSON matching your schema
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
    "messages": [
      {"role": "system", "content": "Extract entities from the text."},
      {"role": "user", "content": "John Smith called from London about invoice #4521"}
    ],
    "response_format": {
      "type": "json_schema",
      "json_schema": {
        "name": "entity_extraction",
        "schema": {
          "type": "object",
          "properties": {
            "person": {"type": "string"},
            "location": {"type": "string"},
            "reference": {"type": "string"},
            "intent": {"type": "string", "enum": ["inquiry", "complaint", "request"]}
          },
          "required": ["person", "location", "reference", "intent"]
        }
      }
    }
  }'

# Response is GUARANTEED to be valid JSON matching the schema
# No parsing errors, no cleanup needed

Prompt-Based JSON Extraction

When guided generation is not available, structure the prompt to maximise JSON compliance:

SYSTEM_PROMPT = """You are a data extraction API. You ONLY output valid JSON.
NEVER include explanations, markdown formatting, or text outside the JSON object.

Output schema:
{
  "person": "string — full name of the person",
  "location": "string — city or country mentioned",
  "reference": "string — any ID or reference number",
  "intent": "inquiry" | "complaint" | "request"
}

Respond with ONLY the JSON object. No other text."""

# Reinforcement in the user message helps:
user_msg = """Extract entities from this text. Output ONLY valid JSON, nothing else.

Text: "John Smith called from London about invoice #4521"

JSON:"""

Robust JSON Parsing with Fallbacks

Even with strong prompting, LLMs occasionally produce slightly malformed JSON. Build resilient parsing:

import json, re

def parse_llm_json(text):
    """Parse JSON from LLM output with multiple fallback strategies."""

    # Strategy 1: direct parse
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        pass

    # Strategy 2: extract JSON from markdown code block
    match = re.search(r'```(?:json)?\s*\n?(.*?)\n?```', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group(1))
        except json.JSONDecodeError:
            pass

    # Strategy 3: find the first { ... } or [ ... ] block
    match = re.search(r'(\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\})', text, re.DOTALL)
    if match:
        try:
            return json.loads(match.group(1))
        except json.JSONDecodeError:
            pass

    # Strategy 4: fix common LLM JSON mistakes
    cleaned = text.strip()
    cleaned = re.sub(r',\s*}', '}', cleaned)  # trailing comma
    cleaned = re.sub(r',\s*]', ']', cleaned)   # trailing comma in array
    cleaned = re.sub(r"'", '"', cleaned)        # single quotes to double
    try:
        return json.loads(cleaned)
    except json.JSONDecodeError:
        raise ValueError(f"Cannot parse JSON from LLM output: {text[:200]}")

# Usage
response = call_llm(prompt)
data = parse_llm_json(response)

Schema Validation After Parsing

Validate that parsed JSON matches your expected structure before passing it downstream:

from pydantic import BaseModel, Field, validator
from typing import Literal

class EntityExtraction(BaseModel):
    person: str = Field(min_length=1)
    location: str = Field(min_length=1)
    reference: str
    intent: Literal["inquiry", "complaint", "request"]

    @validator("person")
    def person_not_empty(cls, v):
        if not v.strip():
            raise ValueError("Person name cannot be empty")
        return v.strip()

# Parse and validate in one step
def extract_entities(llm_response):
    data = parse_llm_json(llm_response)
    return EntityExtraction(**data)

# With automatic retry on validation failure
def extract_with_retry(prompt, max_retries=3):
    for attempt in range(max_retries):
        response = call_llm(prompt)
        try:
            return extract_entities(response)
        except (ValueError, Exception) as e:
            if attempt == max_retries - 1:
                raise
            prompt += f"\n\nPrevious output was invalid: {e}. Try again."

Batch JSON Extraction

For high-volume extraction, batch requests to maximise GPU throughput:

import asyncio, aiohttp

async def batch_extract(texts, concurrency=10):
    sem = asyncio.Semaphore(concurrency)
    async with aiohttp.ClientSession() as session:
        async def process_one(text):
            async with sem:
                payload = {
                    "model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
                    "messages": [
                        {"role": "system", "content": SYSTEM_PROMPT},
                        {"role": "user", "content": f"Extract from: {text}\n\nJSON:"}
                    ],
                    "response_format": {"type": "json_object"},
                    "temperature": 0.0
                }
                async with session.post(
                    "http://localhost:8000/v1/chat/completions",
                    json=payload
                ) as resp:
                    result = await resp.json()
                    return parse_llm_json(
                        result["choices"][0]["message"]["content"])

        return await asyncio.gather(*[process_one(t) for t in texts])

For vLLM deployments on your GPU server, guided generation is the most reliable approach. Ollama supports JSON mode as well. The vLLM production guide has API configuration details, the LLM hosting section covers deployment, and the tutorials walk through parsing patterns. See benchmarks for throughput data.

Structured LLM Output on GPU

vLLM guided generation on GigaGPU dedicated servers. Guaranteed valid JSON, every request.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

LLM Hosting

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

LLM Output: Structured JSON Responses

Your LLM Returns Broken JSON or Free-Form Text

Guided Generation with vLLM

Prompt-Based JSON Extraction

Robust JSON Parsing with Fallbacks

Schema Validation After Parsing

Batch JSON Extraction

Structured LLM Output on GPU

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

LLM Output: Structured JSON Responses

Your LLM Returns Broken JSON or Free-Form Text

Guided Generation with vLLM

Prompt-Based JSON Extraction

Robust JSON Parsing with Fallbacks

Schema Validation After Parsing

Batch JSON Extraction

Structured LLM Output on GPU

Need a Dedicated GPU Server?

gigagpu

Related Articles

Qwen 2.5 Context Length: VRAM at 4K to 128K Tokens

FP16 vs FP8 vs INT4: Precision vs Speed

LLM A/B Testing in Production

ExLlamaV2 vs vLLM: Quantized Model Speed Comparison

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?