Your Inference API Is Open to the Entire Internet
You deployed vLLM on port 8000, opened the port, and walked away. Anyone who discovers the endpoint can drain your GPU compute for free, overload the server with requests, or probe for vulnerabilities in model responses. A GPU server running AI inference needs firewall rules that protect the API while allowing legitimate traffic, monitoring connections, and internal multi-GPU communication.
UFW: Simple Firewall for AI Servers
UFW provides readable firewall management on Ubuntu:
# Install and enable UFW
sudo apt install -y ufw
# Default policy: deny incoming, allow outgoing
sudo ufw default deny incoming
sudo ufw default allow outgoing
# Allow SSH (always do this BEFORE enabling UFW)
sudo ufw allow 22/tcp comment 'SSH'
# Allow inference API from specific IPs only
sudo ufw allow from 10.0.0.0/8 to any port 8000 proto tcp \
comment 'vLLM API - internal'
sudo ufw allow from 203.0.113.50 to any port 8000 proto tcp \
comment 'vLLM API - app server'
# Allow Ollama API (internal only)
sudo ufw allow from 10.0.0.0/8 to any port 11434 proto tcp \
comment 'Ollama API - internal'
# Allow HTTPS reverse proxy
sudo ufw allow 443/tcp comment 'HTTPS'
# Allow monitoring (Prometheus, Grafana)
sudo ufw allow from 10.0.0.0/8 to any port 9090 proto tcp \
comment 'Prometheus'
# Enable firewall
sudo ufw enable
sudo ufw status verbose
Multi-GPU and NCCL Traffic Rules
Distributed training and tensor parallelism require NCCL communication between GPUs:
# NCCL uses a range of ports for inter-node GPU communication
# For multi-node training, allow NCCL traffic between GPU servers
# NCCL port range (default)
sudo ufw allow from 10.0.1.0/24 to any port 29400:29500 proto tcp \
comment 'NCCL multi-node'
# For InfiniBand/RoCE (RDMA)
sudo ufw allow from 10.0.1.0/24 to any port 4791 proto udp \
comment 'RoCE v2'
# Allow PyTorch distributed (torch.distributed)
sudo ufw allow from 10.0.1.0/24 to any port 29500 proto tcp \
comment 'PyTorch distributed master'
# If using NVLink within a single server, no firewall rules needed
# NVLink bypasses the network stack entirely
Rate Limiting with iptables
Prevent API abuse without a separate rate limiter:
# Rate limit inference API: max 30 connections per minute per IP
sudo iptables -A INPUT -p tcp --dport 8000 -m state --state NEW \
-m recent --set --name INFERENCE
sudo iptables -A INPUT -p tcp --dport 8000 -m state --state NEW \
-m recent --update --seconds 60 --hitcount 30 --name INFERENCE \
-j DROP
# Connection limit: max 50 simultaneous connections per IP
sudo iptables -A INPUT -p tcp --dport 8000 \
-m connlimit --connlimit-above 50 --connlimit-mask 32 \
-j REJECT --reject-with tcp-reset
# Log dropped packets for debugging
sudo iptables -A INPUT -p tcp --dport 8000 -j LOG \
--log-prefix "INFERENCE-DROPPED: " --log-level 4
# Save iptables rules to persist across reboots
sudo apt install -y iptables-persistent
sudo netfilter-persistent save
Nginx as a Security Layer
Place Nginx in front of the inference API for authentication and TLS:
# /etc/nginx/sites-available/inference
upstream vllm_backend {
server 127.0.0.1:8000;
keepalive 32;
}
server {
listen 443 ssl;
server_name inference.example.com;
ssl_certificate /etc/letsencrypt/live/inference.example.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/inference.example.com/privkey.pem;
# API key authentication
location /v1/ {
if ($http_authorization = "") { return 401; }
if ($http_authorization != "Bearer YOUR_API_KEY") { return 403; }
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_read_timeout 120s;
}
# Block direct model access
location / { return 404; }
}
# With Nginx in front, bind vLLM to localhost only:
# vllm serve ... --host 127.0.0.1
Audit and Verify Firewall Rules
# List all active rules with numbers
sudo ufw status numbered
# Scan from outside to verify closed ports
# (run from a different machine)
nmap -p 1-65535 your-gpu-server-ip
# Check for listening ports that should not be exposed
sudo ss -tlnp | grep -v '127.0.0.1'
# Monitor connection attempts
sudo tail -f /var/log/ufw.log
# Test that API works from allowed IPs
curl -s -o /dev/null -w "%{http_code}" \
https://inference.example.com/v1/models
Proper firewall configuration protects your GPU server without blocking legitimate AI traffic. For vLLM API setup, see the production guide. Secure Ollama endpoints similarly. Monitor access with our monitoring guide. Browse infrastructure articles, tutorials, and benchmarks for related configuration.
Secure GPU Infrastructure
GigaGPU dedicated servers with full root access. Configure firewalls, deploy AI models, and control access your way.
Browse GPU Servers