You will build a gRPC inference service that streams AI model responses with lower latency and higher throughput than REST APIs. By the end, you will have a working gRPC server on your GPU server with streaming, health checks, and client libraries in Python and Node.js.
Why gRPC for Inference
gRPC uses HTTP/2 with binary Protocol Buffers instead of JSON over HTTP/1.1. For AI inference, this delivers measurable advantages.
| Metric | REST/JSON | gRPC/Protobuf |
|---|---|---|
| Serialisation overhead | Higher (text JSON) | Lower (binary protobuf) |
| Streaming | SSE (server only) | Bidirectional native |
| Connection reuse | Keep-alive | Multiplexed streams |
| Type safety | Runtime validation | Compile-time from proto |
| Latency per request | Baseline | 30-50% lower |
Protocol Buffer Definition
Define the inference service contract. This generates typed client and server code in any language.
// inference.proto
syntax = "proto3";
package inference;
service InferenceService {
rpc Generate (GenerateRequest) returns (GenerateResponse);
rpc GenerateStream (GenerateRequest) returns (stream TokenResponse);
rpc Health (HealthRequest) returns (HealthResponse);
}
message GenerateRequest {
repeated ChatMessage messages = 1;
int32 max_tokens = 2;
float temperature = 3;
float top_p = 4;
}
message ChatMessage {
string role = 1;
string content = 2;
}
message GenerateResponse {
string content = 1;
string model = 2;
int32 tokens_used = 3;
}
message TokenResponse {
string token = 1;
bool done = 2;
}
message HealthRequest {}
message HealthResponse {
string status = 1;
string model = 2;
}
# Generate Python code
pip install grpcio grpcio-tools
python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. inference.proto
gRPC Server Implementation
Build the server that connects to a vLLM backend and serves inference via gRPC.
import grpc
from concurrent import futures
from openai import OpenAI
import inference_pb2
import inference_pb2_grpc
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
MODEL = "meta-llama/Llama-3.1-8B-Instruct"
class InferenceServicer(inference_pb2_grpc.InferenceServiceServicer):
def Generate(self, request, context):
messages = [{"role": m.role, "content": m.content} for m in request.messages]
response = client.chat.completions.create(
model=MODEL, messages=messages,
max_tokens=request.max_tokens or 256,
temperature=request.temperature or 0.7
)
return inference_pb2.GenerateResponse(
content=response.choices[0].message.content,
model=response.model,
tokens_used=response.usage.total_tokens
)
def GenerateStream(self, request, context):
messages = [{"role": m.role, "content": m.content} for m in request.messages]
stream = client.chat.completions.create(
model=MODEL, messages=messages,
max_tokens=request.max_tokens or 1024,
temperature=request.temperature or 0.7,
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
yield inference_pb2.TokenResponse(token=content, done=False)
yield inference_pb2.TokenResponse(token="", done=True)
def Health(self, request, context):
try:
client.models.list()
return inference_pb2.HealthResponse(status="healthy", model=MODEL)
except Exception:
return inference_pb2.HealthResponse(status="unhealthy", model=MODEL)
def serve():
server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
inference_pb2_grpc.add_InferenceServiceServicer_to_server(InferenceServicer(), server)
server.add_insecure_port("[::]:50051")
server.start()
print("gRPC server running on port 50051")
server.wait_for_termination()
if __name__ == "__main__":
serve()
For the REST alternative, see the FastAPI server guide. For WebSocket streaming, check WebSockets for real-time AI.
Python Client
import grpc
import inference_pb2
import inference_pb2_grpc
channel = grpc.insecure_channel("localhost:50051")
stub = inference_pb2_grpc.InferenceServiceStub(channel)
# Unary call
request = inference_pb2.GenerateRequest(
messages=[inference_pb2.ChatMessage(role="user", content="Explain GPU compute units.")],
max_tokens=256, temperature=0.7
)
response = stub.Generate(request)
print(response.content)
# Streaming call
for token_response in stub.GenerateStream(request):
if not token_response.done:
print(token_response.token, end="", flush=True)
Load Balancing and Scaling
gRPC’s HTTP/2 multiplexing requires connection-aware load balancing. Standard TCP load balancers route all streams from one connection to the same backend.
# Use Envoy or Nginx with gRPC support
# Envoy configuration snippet
clusters:
- name: inference_cluster
type: STRICT_DNS
lb_policy: ROUND_ROBIN
http2_protocol_options: {}
load_assignment:
cluster_name: inference_cluster
endpoints:
- lb_endpoints:
- endpoint:
address:
socket_address:
address: gpu-server-1
port_value: 50051
- endpoint:
address:
socket_address:
address: gpu-server-2
port_value: 50051
For Kubernetes-based scaling, see the GPU pod configuration guide. Add Prometheus monitoring with gRPC interceptors for latency tracking. The self-hosting guide covers infrastructure planning, and our tutorials section has more API patterns. For API gateway integration, see Kong/Traefik for AI.
Deploy gRPC AI Services on Dedicated GPUs
Run high-performance gRPC inference servers on bare-metal GPU hardware. Minimal latency, maximum throughput.
Browse GPU Servers