RTX 3050 - Order Now
Home / Blog / Tutorials / gRPC for AI Inference: High-Performance API
Tutorials

gRPC for AI Inference: High-Performance API

Complete guide to building a gRPC AI inference service on GPU servers covering protobuf definitions, server-side streaming, bidirectional communication, load balancing, and performance comparison with REST.

You will build a gRPC inference service that streams AI model responses with lower latency and higher throughput than REST APIs. By the end, you will have a working gRPC server on your GPU server with streaming, health checks, and client libraries in Python and Node.js.

Why gRPC for Inference

gRPC uses HTTP/2 with binary Protocol Buffers instead of JSON over HTTP/1.1. For AI inference, this delivers measurable advantages.

MetricREST/JSONgRPC/Protobuf
Serialisation overheadHigher (text JSON)Lower (binary protobuf)
StreamingSSE (server only)Bidirectional native
Connection reuseKeep-aliveMultiplexed streams
Type safetyRuntime validationCompile-time from proto
Latency per requestBaseline30-50% lower

Protocol Buffer Definition

Define the inference service contract. This generates typed client and server code in any language.

// inference.proto
syntax = "proto3";
package inference;

service InferenceService {
  rpc Generate (GenerateRequest) returns (GenerateResponse);
  rpc GenerateStream (GenerateRequest) returns (stream TokenResponse);
  rpc Health (HealthRequest) returns (HealthResponse);
}

message GenerateRequest {
  repeated ChatMessage messages = 1;
  int32 max_tokens = 2;
  float temperature = 3;
  float top_p = 4;
}

message ChatMessage {
  string role = 1;
  string content = 2;
}

message GenerateResponse {
  string content = 1;
  string model = 2;
  int32 tokens_used = 3;
}

message TokenResponse {
  string token = 1;
  bool done = 2;
}

message HealthRequest {}
message HealthResponse {
  string status = 1;
  string model = 2;
}
# Generate Python code
pip install grpcio grpcio-tools
python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. inference.proto

gRPC Server Implementation

Build the server that connects to a vLLM backend and serves inference via gRPC.

import grpc
from concurrent import futures
from openai import OpenAI
import inference_pb2
import inference_pb2_grpc

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
MODEL = "meta-llama/Llama-3.1-8B-Instruct"

class InferenceServicer(inference_pb2_grpc.InferenceServiceServicer):
    def Generate(self, request, context):
        messages = [{"role": m.role, "content": m.content} for m in request.messages]
        response = client.chat.completions.create(
            model=MODEL, messages=messages,
            max_tokens=request.max_tokens or 256,
            temperature=request.temperature or 0.7
        )
        return inference_pb2.GenerateResponse(
            content=response.choices[0].message.content,
            model=response.model,
            tokens_used=response.usage.total_tokens
        )

    def GenerateStream(self, request, context):
        messages = [{"role": m.role, "content": m.content} for m in request.messages]
        stream = client.chat.completions.create(
            model=MODEL, messages=messages,
            max_tokens=request.max_tokens or 1024,
            temperature=request.temperature or 0.7,
            stream=True
        )
        for chunk in stream:
            content = chunk.choices[0].delta.content
            if content:
                yield inference_pb2.TokenResponse(token=content, done=False)
        yield inference_pb2.TokenResponse(token="", done=True)

    def Health(self, request, context):
        try:
            client.models.list()
            return inference_pb2.HealthResponse(status="healthy", model=MODEL)
        except Exception:
            return inference_pb2.HealthResponse(status="unhealthy", model=MODEL)

def serve():
    server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
    inference_pb2_grpc.add_InferenceServiceServicer_to_server(InferenceServicer(), server)
    server.add_insecure_port("[::]:50051")
    server.start()
    print("gRPC server running on port 50051")
    server.wait_for_termination()

if __name__ == "__main__":
    serve()

For the REST alternative, see the FastAPI server guide. For WebSocket streaming, check WebSockets for real-time AI.

Python Client

import grpc
import inference_pb2
import inference_pb2_grpc

channel = grpc.insecure_channel("localhost:50051")
stub = inference_pb2_grpc.InferenceServiceStub(channel)

# Unary call
request = inference_pb2.GenerateRequest(
    messages=[inference_pb2.ChatMessage(role="user", content="Explain GPU compute units.")],
    max_tokens=256, temperature=0.7
)
response = stub.Generate(request)
print(response.content)

# Streaming call
for token_response in stub.GenerateStream(request):
    if not token_response.done:
        print(token_response.token, end="", flush=True)

Load Balancing and Scaling

gRPC’s HTTP/2 multiplexing requires connection-aware load balancing. Standard TCP load balancers route all streams from one connection to the same backend.

# Use Envoy or Nginx with gRPC support
# Envoy configuration snippet
clusters:
  - name: inference_cluster
    type: STRICT_DNS
    lb_policy: ROUND_ROBIN
    http2_protocol_options: {}
    load_assignment:
      cluster_name: inference_cluster
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: gpu-server-1
                    port_value: 50051
            - endpoint:
                address:
                  socket_address:
                    address: gpu-server-2
                    port_value: 50051

For Kubernetes-based scaling, see the GPU pod configuration guide. Add Prometheus monitoring with gRPC interceptors for latency tracking. The self-hosting guide covers infrastructure planning, and our tutorials section has more API patterns. For API gateway integration, see Kong/Traefik for AI.

Deploy gRPC AI Services on Dedicated GPUs

Run high-performance gRPC inference servers on bare-metal GPU hardware. Minimal latency, maximum throughput.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?