You will configure Kubernetes to schedule AI inference workloads on GPU nodes with proper resource requests, device plugins, and autoscaling. By the end, you will have a vLLM deployment running on GPU pods in your cluster with health checks, resource limits, and horizontal scaling based on queue depth.
Prerequisites
Your Kubernetes cluster needs the NVIDIA device plugin to expose GPUs as schedulable resources. On dedicated GPU servers, install the NVIDIA drivers, container toolkit, and device plugin.
# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
# Verify GPU nodes
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'
GPU Pod Configuration
Request GPU resources in your pod spec. Kubernetes schedules the pod only on nodes with available GPUs.
# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-inference
labels:
app: vllm
spec:
replicas: 1
selector:
matchLabels:
app: vllm
template:
metadata:
labels:
app: vllm
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
- "--max-model-len"
- "8192"
- "--gpu-memory-utilization"
- "0.90"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: "32Gi"
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: "24Gi"
cpu: "4"
readinessProbe:
httpGet:
path: /v1/models
port: 8000
initialDelaySeconds: 120
periodSeconds: 10
livenessProbe:
httpGet:
path: /v1/models
port: 8000
initialDelaySeconds: 180
periodSeconds: 30
volumeMounts:
- name: model-cache
mountPath: /root/.cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache-pvc
nodeSelector:
nvidia.com/gpu.present: "true"
For the vLLM configuration details, see the production deployment guide.
Service and Ingress
Expose the inference deployment as a Kubernetes Service with an Ingress for external access.
# service.yaml
apiVersion: v1
kind: Service
metadata:
name: vllm-service
spec:
selector:
app: vllm
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
rules:
- host: inference.yourdomain.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: vllm-service
port:
number: 8000
Multi-GPU Pods
For larger models that require multiple GPUs, request more than one GPU resource and enable tensor parallelism.
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-70B-Instruct"
- "--tensor-parallel-size"
- "2"
- "--gpu-memory-utilization"
- "0.90"
resources:
limits:
nvidia.com/gpu: 2
memory: "200Gi"
requests:
nvidia.com/gpu: 2
memory: "180Gi"
GPU-Aware Autoscaling
Scale inference pods based on custom metrics like request queue depth or GPU utilisation.
# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-inference
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "10"
behavior:
scaleUp:
stabilizationWindowSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
Export custom metrics from your inference server using Prometheus. The Prometheus adapter makes these metrics available to the HPA controller.
Model Storage
Use PersistentVolumeClaims to cache model weights and avoid re-downloading on pod restarts.
# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 100Gi
storageClassName: fast-ssd
For model versioning and deployment strategies, see the model versioning guide and the blue-green deployment guide. The CI/CD pipeline guide covers automated model deployments. The self-hosting guide covers base infrastructure, and our tutorials section has more orchestration patterns.
Run Kubernetes AI Workloads on Dedicated GPUs
Deploy GPU-accelerated Kubernetes clusters on bare-metal servers. Full NVIDIA driver access, no virtualisation overhead.
Browse GPU Servers