What You’ll Connect
After this guide, your Sentry project will capture and organise every error from your AI inference pipeline — giving you stack traces, GPU context, and request details for every failure on your dedicated GPU server. Model timeouts, out-of-memory crashes, and malformed requests get tracked with full diagnostic context.
The integration adds the Sentry SDK to your FastAPI inference wrapper or middleware layer. Each error is enriched with GPU state (memory usage, utilisation) and request metadata (model name, prompt length, token count), making debugging AI-specific failures straightforward.
FastAPI Middleware –> vLLM Inference –> GPU Server | | | | Sentry SDK Captures errors Model errors OOM, CUDA initialised adds GPU context (timeout, OOM) hardware faults | | | Sentry Platform <-- Error event <-- Exception caught with (dashboard, with GPU state, stack trace + context alerts, issues) request metadata -->Prerequisites
- A GigaGPU server running an LLM behind a FastAPI inference server or similar wrapper
- A Sentry account with a project created for your inference service
- Python 3.10+ with the Sentry SDK:
pip install sentry-sdk[fastapi] - The Sentry DSN from your project settings
- GPU monitoring tools:
nvidia-smiorpynvmllibrary
Integration Steps
Install the Sentry SDK with the FastAPI integration: pip install sentry-sdk[fastapi]. Initialise Sentry at the top of your inference server’s entry point with your project DSN. The FastAPI integration automatically captures unhandled exceptions from any route.
Add a custom Sentry scope processor that attaches GPU state to every error event. Use pynvml to read current GPU memory usage, utilisation, and temperature at the moment of the error. This context appears alongside the stack trace in the Sentry issue view.
Wrap your inference calls with try/except blocks that capture specific AI failure modes: CUDA out-of-memory errors, model loading failures, generation timeouts, and invalid token counts. Each exception type gets a distinct Sentry fingerprint so they group into separate issues rather than one noisy bucket.
Code Example
Sentry integration for a FastAPI inference server on your vLLM deployment:
import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from fastapi import FastAPI, Request
from openai import OpenAI
import pynvml, os
pynvml.nvmlInit()
sentry_sdk.init(
dsn=os.environ["SENTRY_DSN"],
integrations=[FastApiIntegration()],
traces_sample_rate=0.1,
environment="production",
release="inference-server@1.0.0",
)
app = FastAPI()
llm = OpenAI(base_url="http://localhost:8000/v1", api_key=os.environ["GPU_API_KEY"])
def get_gpu_context():
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
temp = pynvml.nvmlDeviceGetTemperature(handle, pynvml.NVML_TEMPERATURE_GPU)
return {
"gpu_memory_used_mb": mem.used // (1024 * 1024),
"gpu_memory_total_mb": mem.total // (1024 * 1024),
"gpu_utilisation_pct": util.gpu,
"gpu_temperature_c": temp,
}
@app.post("/v1/chat/completions")
async def chat(request: Request):
body = await request.json()
try:
with sentry_sdk.push_scope() as scope:
scope.set_context("gpu", get_gpu_context())
scope.set_tag("model", body.get("model", "unknown"))
scope.set_tag("prompt_tokens", str(len(str(body.get("messages", "")))))
completion = llm.chat.completions.create(**body)
return completion.model_dump()
except Exception as e:
sentry_sdk.set_context("gpu", get_gpu_context())
sentry_sdk.set_tag("error_type", type(e).__name__)
sentry_sdk.capture_exception(e)
raise
Testing Your Integration
Send a request with an invalid model name to trigger an error. Check the Sentry dashboard — within seconds, a new issue should appear with the stack trace, GPU context (memory, utilisation, temperature), and request tags. Verify the GPU context section shows realistic values from your server.
Simulate an OOM condition by sending a request with an extremely long prompt or high max_tokens on a memory-constrained model. The Sentry event should capture the CUDA error with the GPU memory state at the moment of failure — this is the most valuable diagnostic data for inference OOM issues.
Production Tips
Configure Sentry’s alerting to notify you when new error types appear or when error rates exceed a threshold. Set up a Slack or PagerDuty integration within Sentry so inference failures reach your on-call team immediately. Use Sentry’s performance monitoring (traces) to track slow inference requests without waiting for them to fail.
Use Sentry’s release tracking to correlate errors with model or code deployments. Tag each deployment with a release version so you can identify which change introduced a new class of errors. This is especially useful when swapping models or updating vLLM versions.
For teams running production AI workloads on open-source models, Sentry provides the error visibility needed to maintain reliability. Pair with our API security guide to protect your endpoints while tracking failures. Browse more tutorials or get started with GigaGPU to build observable AI infrastructure.