Your team deploys AI models in Docker containers for reproducibility and isolation. But the default Docker configuration runs containers as root, mounts the Docker socket, and gives the container full access to all GPUs on the host. A prompt injection exploit that escapes the inference process now has root privileges inside the container, access to every GPU (including those serving other models), and potentially a path to the host system via the mounted socket. Container security for AI workloads requires deliberate hardening. This guide covers Docker security for inference on dedicated GPU servers.
Non-Root Container Execution
Never run inference containers as root. Create a dedicated user in your Dockerfile:
FROM nvidia/cuda:12.4.1-runtime-ubuntu22.04
# Create non-root user for inference
RUN groupadd -r inference && useradd -r -g inference -d /app -s /sbin/nologin inference
# Install dependencies as root
RUN apt-get update && apt-get install -y python3 python3-pip && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt
COPY --chown=inference:inference . .
# Switch to non-root user
USER inference
CMD ["python3", "serve.py"]
If the inference process is compromised, the attacker operates as an unprivileged user. They cannot install packages, modify system files, or access other containers’ data.
GPU Device Isolation
The NVIDIA Container Toolkit exposes GPUs to containers. By default, all GPUs are visible. Restrict each container to only the GPUs it needs:
| Flag | Effect | Use Case |
|---|---|---|
--gpus '"device=0"' | Single GPU access | One model per GPU |
--gpus '"device=0,1"' | Specific GPU pair | Tensor parallel model |
--gpus all | All GPUs visible | Avoid — no isolation |
NVIDIA_VISIBLE_DEVICES=none | No GPU access | CPU-only preprocessing |
For multi-tenant vLLM deployments, assign each model’s container a specific GPU. A compromised container cannot access another model’s GPU memory or weights.
Read-Only Filesystem and Volumes
Run containers with a read-only root filesystem. Mount writable volumes only where needed:
docker run -d \
--name llm-inference \
--gpus '"device=0"' \
--read-only \
--tmpfs /tmp:size=1G \
-v /data/models:/models:ro \
-v /data/logs:/app/logs:rw \
--memory=32g \
--memory-swap=32g \
--pids-limit=256 \
my-inference-image:latest
Model weights mount as read-only (:ro). Only the logs directory is writable. The --tmpfs provides scratch space that disappears when the container stops. Memory limits prevent a runaway process from consuming all host RAM, and --pids-limit blocks fork bombs. This applies equally to Ollama and other model serving containers.
Image Scanning and Supply Chain
AI Docker images pull from multiple sources: NVIDIA base images, PyPI packages, Hugging Face model weights. Each is an attack vector. Scan images before deployment with Trivy or Grype. Pin base image digests — not just tags — so rebuilds produce identical images. Verify model weight checksums after download. For private deployments, run a local container registry so production images never pull from public sources at runtime.
Avoid installing unnecessary packages. Every additional package increases the attack surface. A minimal inference container needs Python, the serving framework, and model dependencies — not build tools, editors, or debugging utilities.
Container Network Security
Isolate inference containers on dedicated Docker networks. Do not use the default bridge network. Create purpose-specific networks:
# Create isolated network for inference
docker network create --driver bridge --internal inference-net
# Inference container: internal only, no internet access
docker run -d --network inference-net --name llm my-inference-image
# API gateway: connected to both public and inference networks
docker run -d --network inference-net --network public-net --name gateway my-gateway-image
The --internal flag prevents containers on that network from reaching the internet. The inference container communicates only with the API gateway. This pattern protects models serving chatbots, document processing, and vision workloads.
Runtime Security Monitoring
Deploy runtime security monitoring that detects anomalous container behaviour: unexpected process execution (shells spawning inside inference containers), file writes to read-only paths (attempted exploits), network connections to unexpected destinations, and GPU utilisation patterns inconsistent with inference (cryptocurrency mining). Tools like Falco provide runtime threat detection with GPU-aware rules. Integrate alerts with your incident response plan. Review infrastructure security practices and GDPR compliance requirements for comprehensive container hardening.
Secure GPU Container Hosting
Dedicated GPU servers with NVIDIA Container Toolkit, full root access for Docker hardening, and network isolation. UK data centres.
Browse GPU Servers