What You’ll Connect
After this guide, your Vercel-deployed application will stream AI responses from your own GPU server — no API costs, no rate limits. The Vercel AI SDK handles the frontend streaming logic while your vLLM or Ollama backend on dedicated GPU hardware generates the completions.
This integration is ideal for building AI-powered SaaS products, chatbots, or content tools. Vercel handles global edge deployment of your frontend while your GPU server handles the compute-intensive inference — giving you the best of both worlds without paying per-token to a third-party AI provider.
Vercel Edge –> API Route (serverless) –> GPU Server (vLLM) (React/Next.js CDN + SSR /api/chat LLM inference on chat UI) globally proxies to GPU dedicated GPU | | Streamed tokens <-- Vercel AI SDK <-- SSE stream <-- Token-by-token render in UI handles stream from GPU server generation -->Prerequisites
- A GigaGPU server with a running LLM behind an OpenAI-compatible API (vLLM production guide)
- A Vercel account with a Next.js project deployed (or ready to deploy)
- Node.js 18+ for local development
- HTTPS access to your GPU server (Nginx proxy guide)
- GPU API key stored as a Vercel environment variable (security guide)
Integration Steps
Install the Vercel AI SDK in your Next.js project: npm install ai openai. The AI SDK provides React hooks for streaming chat interfaces and server-side utilities for proxying to OpenAI-compatible endpoints.
Create an API route at app/api/chat/route.ts that initialises an OpenAI client pointed at your GPU server’s base URL. The route receives messages from the frontend, forwards them to your inference endpoint, and streams the response back using the AI SDK’s StreamingTextResponse helper.
On the frontend, use the useChat hook from the AI SDK. It manages message state, handles streaming, and provides a ready-made interface for building chat UIs. The hook calls your API route, which in turn calls your GPU server — keeping the API key server-side and never exposing it to the browser.
Code Example
API route and frontend component connecting to your GPU inference server:
// app/api/chat/route.ts
import OpenAI from 'openai';
import { OpenAIStream, StreamingTextResponse } from 'ai';
const client = new OpenAI({
baseURL: process.env.GPU_API_URL + '/v1',
apiKey: process.env.GPU_API_KEY,
});
export async function POST(req: Request) {
const { messages } = await req.json();
const response = await client.chat.completions.create({
model: 'meta-llama/Llama-3-70b-chat-hf',
messages,
stream: true,
max_tokens: 1024,
});
const stream = OpenAIStream(response);
return new StreamingTextResponse(stream);
}
// app/page.tsx
'use client';
import { useChat } from 'ai/react';
export default function Chat() {
const { messages, input, handleInputChange, handleSubmit } = useChat();
return (
<div>
{messages.map(m => (
<div key={m.id}><b>{m.role}:</b> {m.content}</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange} placeholder="Ask anything..." />
<button type="submit">Send</button>
</form>
</div>
);
}
Testing Your Integration
Run npm run dev locally to test the chat interface. Type a message and verify that tokens stream into the UI in real time rather than appearing all at once. Check the browser’s Network tab to confirm the API route returns a streaming response (Content-Type: text/event-stream).
Deploy to Vercel with vercel deploy. Set GPU_API_URL and GPU_API_KEY as environment variables in the Vercel dashboard. Test the deployed version to confirm Vercel’s serverless functions can reach your GPU server over the public internet.
Production Tips
Vercel serverless functions have a default timeout of 10 seconds (60 seconds on Pro). For longer model responses, enable Vercel’s streaming function support which keeps the connection open as long as data flows. The AI SDK’s streaming response format is designed to work within these constraints.
Add rate limiting in your API route to prevent abuse of your GPU endpoint from the public-facing app. Use Vercel’s Edge Config or a simple in-memory counter to limit requests per IP. For authenticated apps, tie rate limits to user accounts instead.
This pattern — Vercel frontend, self-hosted GPU backend — is the foundation for building AI SaaS products with predictable infrastructure costs. Your open-source model handles the inference while Vercel handles global distribution. Build an AI chatbot, explore more tutorials, or get started with GigaGPU to power your Vercel AI application.