RTX 3050 - Order Now
Home / Blog / Tutorials / Kubernetes for AI: GPU Pod Config
Tutorials

Kubernetes for AI: GPU Pod Config

Complete guide to configuring Kubernetes GPU pods for AI inference covering NVIDIA device plugin, resource requests, node affinity, autoscaling, and deploying vLLM on dedicated GPU servers.

You will configure Kubernetes to schedule AI inference workloads on GPU nodes with proper resource requests, device plugins, and autoscaling. By the end, you will have a vLLM deployment running on GPU pods in your cluster with health checks, resource limits, and horizontal scaling based on queue depth.

Prerequisites

Your Kubernetes cluster needs the NVIDIA device plugin to expose GPUs as schedulable resources. On dedicated GPU servers, install the NVIDIA drivers, container toolkit, and device plugin.

# Install NVIDIA device plugin
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# Verify GPU nodes
kubectl get nodes -o json | jq '.items[].status.capacity["nvidia.com/gpu"]'

GPU Pod Configuration

Request GPU resources in your pod spec. Kubernetes schedules the pod only on nodes with available GPUs.

# vllm-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-inference
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--host"
            - "0.0.0.0"
            - "--port"
            - "8000"
            - "--max-model-len"
            - "8192"
            - "--gpu-memory-utilization"
            - "0.90"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "32Gi"
              cpu: "8"
            requests:
              nvidia.com/gpu: 1
              memory: "24Gi"
              cpu: "4"
          readinessProbe:
            httpGet:
              path: /v1/models
              port: 8000
            initialDelaySeconds: 120
            periodSeconds: 10
          livenessProbe:
            httpGet:
              path: /v1/models
              port: 8000
            initialDelaySeconds: 180
            periodSeconds: 30
          volumeMounts:
            - name: model-cache
              mountPath: /root/.cache/huggingface
      volumes:
        - name: model-cache
          persistentVolumeClaim:
            claimName: model-cache-pvc
      nodeSelector:
        nvidia.com/gpu.present: "true"

For the vLLM configuration details, see the production deployment guide.

Service and Ingress

Expose the inference deployment as a Kubernetes Service with an Ingress for external access.

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: vllm-service
spec:
  selector:
    app: vllm
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
spec:
  rules:
    - host: inference.yourdomain.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: vllm-service
                port:
                  number: 8000

Multi-GPU Pods

For larger models that require multiple GPUs, request more than one GPU resource and enable tensor parallelism.

containers:
  - name: vllm
    image: vllm/vllm-openai:latest
    args:
      - "--model"
      - "meta-llama/Llama-3.1-70B-Instruct"
      - "--tensor-parallel-size"
      - "2"
      - "--gpu-memory-utilization"
      - "0.90"
    resources:
      limits:
        nvidia.com/gpu: 2
        memory: "200Gi"
      requests:
        nvidia.com/gpu: 2
        memory: "180Gi"

GPU-Aware Autoscaling

Scale inference pods based on custom metrics like request queue depth or GPU utilisation.

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-inference
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: inference_queue_depth
        target:
          type: AverageValue
          averageValue: "10"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

Export custom metrics from your inference server using Prometheus. The Prometheus adapter makes these metrics available to the HPA controller.

Model Storage

Use PersistentVolumeClaims to cache model weights and avoid re-downloading on pod restarts.

# pvc.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-cache-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 100Gi
  storageClassName: fast-ssd

For model versioning and deployment strategies, see the model versioning guide and the blue-green deployment guide. The CI/CD pipeline guide covers automated model deployments. The self-hosting guide covers base infrastructure, and our tutorials section has more orchestration patterns.

Run Kubernetes AI Workloads on Dedicated GPUs

Deploy GPU-accelerated Kubernetes clusters on bare-metal servers. Full NVIDIA driver access, no virtualisation overhead.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?