What You’ll Connect
After this guide, your Next.js application will stream AI responses from your own GPU server through a server-side API route — keeping your API key secure on the server while delivering real-time token streaming to the browser. Your vLLM or Ollama endpoint on dedicated GPU hardware powers the AI features, and the Next.js API route acts as a secure proxy between your frontend and the GPU backend.
The integration uses Next.js App Router API routes with the Vercel AI SDK pattern for streaming. Your GPU endpoint serves the OpenAI-compatible API, and the Next.js backend streams completions to the client using the standard ReadableStream interface.
Prerequisites
- A GigaGPU server running a self-hosted LLM (setup guide)
- Network access from your Next.js server to the GPU endpoint
- Next.js 14+ with App Router enabled
- API key for your GPU inference server stored in environment variables
Integration Steps
Create an API route in your Next.js App Router that accepts chat messages from the frontend. The route calls your GPU server’s completion endpoint with streaming enabled, then pipes the response stream directly back to the client. This keeps the GPU API key server-side while giving the frontend real-time token streaming.
Build the client-side hook that calls your Next.js API route and parses the streaming response. Use the useChat pattern — manage conversation history in state, append user messages, stream assistant responses, and handle loading and error states. The Vercel AI SDK provides this hook out of the box, and it works with any OpenAI-compatible backend.
For server-rendered pages, use React Server Components to fetch AI-generated content at request time. The server component calls your GPU endpoint directly — no streaming needed for pre-rendered content — and includes the result in the initial HTML response.
Code Example
Next.js API route and client hook for streaming from your self-hosted LLM:
// app/api/chat/route.ts — Server-side API route
import { NextRequest } from "next/server";
export async function POST(req: NextRequest) {
const { messages } = await req.json();
const response = await fetch(
process.env.GPU_API_URL + "/v1/chat/completions",
{
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${process.env.GPU_API_KEY}`,
},
body: JSON.stringify({
model: "meta-llama/Llama-3-70b-chat-hf",
messages,
stream: true,
max_tokens: 1024,
}),
}
);
// Pipe the GPU server's stream directly to the client
return new Response(response.body, {
headers: { "Content-Type": "text/event-stream" },
});
}
// app/chat/page.tsx — Client component with streaming
"use client";
import { useChat } from "ai/react"; // Vercel AI SDK
export default function ChatPage() {
const { messages, input, handleInputChange, handleSubmit, isLoading } =
useChat({ api: "/api/chat" });
return (
<div>
{messages.map((m) => (
<div key={m.id}>
<strong>{m.role}:</strong> {m.content}
</div>
))}
<form onSubmit={handleSubmit}>
<input value={input} onChange={handleInputChange}
placeholder="Ask something..." />
<button type="submit" disabled={isLoading}>Send</button>
</form>
</div>
);
}
Testing Your Integration
Start your Next.js dev server and open the chat page. Send a test message and verify tokens stream progressively. Check the Network tab to confirm requests go to your /api/chat route, not directly to the GPU server. Verify that the GPU_API_KEY environment variable is not exposed in the client bundle by searching the browser source.
Test edge cases: rapid message sending, very long responses, network interruptions, and concurrent users. The API route should handle each request independently with proper stream cleanup on client disconnection.
Production Tips
Add rate limiting to your API route using middleware to prevent abuse. Implement user authentication so each chat session is tied to a verified user. Store conversation history in a database rather than client-side state for persistence across sessions. Add request logging to track token usage and response latency per user.
For SEO-critical pages, use React Server Components to pre-render AI-generated summaries, descriptions, or metadata at request time. The server calls the GPU endpoint synchronously, includes the content in the HTML, and search engines see the full page without JavaScript execution. Build a complete AI chatbot product with Next.js as the framework. Explore more tutorials or get started with GigaGPU to power your Next.js apps.