A financial regulator asks your company to demonstrate exactly which AI model version processed a specific customer complaint, what data was sent, and what response was generated — for a transaction that occurred eight months ago. Without a comprehensive audit trail, you cannot answer. With one, you pull the record in seconds. Audit logging is not optional for production AI systems. This guide covers how to build inference audit trails on self-hosted GPU infrastructure that satisfy regulators, auditors, and your own debugging needs.
What to Log for Every Inference
Each inference request generates a chain of events that must be captured. The minimum viable audit record includes:
| Field | Example | Purpose |
|---|---|---|
| Request ID | uuid-v4 | Unique correlation identifier |
| Timestamp | ISO 8601 with timezone | Precise event ordering |
| Model ID | llama-3-8b-instruct-v2.1 | Exact model version |
| Model checksum | SHA-256 of weights file | Prove model integrity |
| Input hash | SHA-256 of prompt | Prove input unchanged |
| Output hash | SHA-256 of response | Prove output unchanged |
| User/API key ID | api-key-hash or user-id | Attribution |
| Token count | Input: 342, Output: 156 | Usage tracking, cost allocation |
| Latency | 1,240 ms | Performance monitoring |
| Status | 200 / 500 / timeout | Reliability tracking |
For regulated industries, also log the full input prompt and output response (encrypted). This enables complete reconstruction of any inference event. On private infrastructure, storing this data carries no third-party risk.
Logging Architecture
Never store audit logs on the same server running inference. A compromised GPU server should not be able to modify its own audit trail. The recommended architecture places vLLM on the GPU server, ships structured logs via Fluent Bit to a separate log aggregation server, and stores logs in append-only storage.
Use structured JSON logging. Each inference event produces a JSON document that is machine-parseable and human-readable. Avoid unstructured text logs — they are difficult to query and unreliable for compliance evidence. Ship logs over a TLS-encrypted connection to the aggregation server within 5 seconds of the event.
Tamper-Evident Storage
Auditors need confidence that logs have not been modified after the fact. Implement tamper-evidence through hash chaining: each log entry includes a hash of the previous entry, creating a blockchain-like chain. Any modification to a historical entry breaks the chain. Store daily chain-head hashes in a separate, independently secured system.
For strongest tamper-evidence, use write-once storage: Amazon S3 Object Lock with Compliance mode (for off-site backup), local WORM-configured ZFS datasets, or append-only PostgreSQL tables with row-level cryptographic signatures. Even on self-hosted infrastructure, you can implement tamper-evidence without third-party services.
Retention Policies
Different frameworks require different retention periods. GDPR mandates no longer than necessary. PCI DSS requires 12 months minimum. NHS DSPT follows NHS records management policy (potentially 7+ years for clinical AI). Financial services under FCA oversight typically retain for 5-7 years. Define your retention policy based on the most demanding applicable framework, then implement automated deletion for logs beyond that period.
Separate retention into tiers: hot storage (last 90 days, fast query), warm storage (90 days to 1 year, slower query), and cold storage (1+ years, archive retrieval). GPU inference logs generate substantial volume — a busy inference server producing 10,000 requests per day generates approximately 2 GB of structured logs daily. Plan storage accordingly.
Querying Audit Trails
Logs are useless if you cannot search them. Implement a query layer that supports finding all inferences for a specific user or API key, finding all inferences processed by a specific model version, reconstructing the full request-response pair for a given request ID, and aggregating statistics (error rates, latency percentiles) over time ranges.
Elasticsearch or Loki provide the query capabilities. For compliance queries, pre-build saved searches that answer common auditor questions. Infrastructure monitoring integrates with the same logging stack. Review GDPR compliance requirements for guidance on logging personal data.
Implementation Steps
Start by instrumenting your vLLM deployment with structured logging middleware. Configure Fluent Bit on the GPU server to ship logs to your aggregation server. Enable hash chaining on the aggregation server. Set up automated retention enforcement. Build a compliance dashboard showing log completeness (percentage of inference requests with complete audit records) and chain integrity status. Test your audit trail by running a mock regulatory query: pick a random inference from 6 months ago and reconstruct the complete event. If you can do that in under 5 minutes, your audit trail is production-ready. Open-source model deployments benefit from logging model provenance, and use cases across sectors share this logging architecture.
Audit-Ready AI Infrastructure
Dedicated GPU servers with the performance for real-time inference and the architecture for comprehensive audit trails. UK-hosted.
Browse GPU Servers