RTX 3050 - Order Now
Home / Blog / Tutorials / Connect React App to Self-Hosted AI
Tutorials

Connect React App to Self-Hosted AI

Stream AI responses from your GPU server directly into a React application. This guide covers fetching completions from your self-hosted LLM, implementing streaming with server-sent events, managing conversation state, and building a responsive chat UI.

What You’ll Connect

After this guide, your React app will stream AI responses from your own GPU server in real time — tokens appearing word by word as the model generates them. The frontend calls your vLLM or Ollama endpoint on dedicated GPU hardware through an OpenAI-compatible API, giving your React application the same streaming chat experience as ChatGPT, powered entirely by your self-hosted infrastructure.

The integration uses the standard Fetch API with streaming to consume server-sent events from your GPU endpoint. No special SDK is required — your React app talks to the same OpenAI-compatible API that any OpenAI client library uses, pointed at your own server.

Prerequisites

  • A GigaGPU server running a self-hosted LLM (setup guide)
  • HTTPS access to your inference endpoint with CORS configured
  • A React 18+ application (Create React App, Vite, or Next.js)
  • API key for your GPU inference server

Integration Steps

Configure CORS on your GPU server’s reverse proxy to allow requests from your React app’s domain. Create an environment variable for the API base URL so you can switch between development and production endpoints. Build a service module that handles streaming fetch requests to the chat completions endpoint.

Implement a streaming parser that reads server-sent events from the response body. Each chunk contains a delta token that you append to the conversation state. Use React state management to update the UI as tokens arrive, creating the progressive text appearance that users expect from AI chat interfaces.

Build a conversation manager that tracks message history, handles system prompts, and manages the request lifecycle — loading states, error handling, and request cancellation via AbortController when users navigate away or start a new message.

Code Example

React hook for streaming chat completions from your self-hosted LLM:

import { useState, useCallback, useRef } from "react";

const API_URL = process.env.REACT_APP_GPU_API_URL + "/v1/chat/completions";
const API_KEY = process.env.REACT_APP_GPU_API_KEY;

export function useChat() {
  const [messages, setMessages] = useState([]);
  const [isStreaming, setIsStreaming] = useState(false);
  const abortRef = useRef(null);

  const sendMessage = useCallback(async (userMessage) => {
    const newMessages = [...messages, { role: "user", content: userMessage }];
    setMessages([...newMessages, { role: "assistant", content: "" }]);
    setIsStreaming(true);

    abortRef.current = new AbortController();

    const response = await fetch(API_URL, {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${API_KEY}`,
      },
      body: JSON.stringify({
        model: "meta-llama/Llama-3-70b-chat-hf",
        messages: newMessages,
        stream: true,
        max_tokens: 1024,
      }),
      signal: abortRef.current.signal,
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let assistantContent = "";

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const chunk = decoder.decode(value);
      const lines = chunk.split("\n").filter((l) => l.startsWith("data: "));
      for (const line of lines) {
        if (line === "data: [DONE]") break;
        const json = JSON.parse(line.slice(6));
        const token = json.choices[0]?.delta?.content || "";
        assistantContent += token;
        setMessages((prev) => {
          const updated = [...prev];
          updated[updated.length - 1] = {
            role: "assistant",
            content: assistantContent,
          };
          return updated;
        });
      }
    }
    setIsStreaming(false);
  }, [messages]);

  const cancel = () => abortRef.current?.abort();
  return { messages, sendMessage, isStreaming, cancel };
}

Testing Your Integration

Start your React dev server and send a test message through the chat interface. Verify tokens stream progressively rather than arriving in a single block. Test error handling by stopping your GPU server mid-response — the UI should show an error state and allow retrying. Test the cancel function by sending a long request and clicking stop before it completes.

Check the browser’s Network tab to confirm requests go to your GPU server, not a third-party API. Verify CORS headers are present on responses. Test with multiple concurrent conversations to confirm state isolation between components.

Production Tips

Route API requests through your own backend rather than calling the GPU server directly from the browser — this hides your API key and lets you add rate limiting, user authentication, and request logging. Your backend proxies requests to the GPU endpoint, streaming the response back to the React client.

Use React’s useMemo to avoid unnecessary re-renders during streaming. For long conversations, implement message pagination that loads history on demand rather than keeping the entire thread in state. Build an AI chatbot experience with conversation persistence, user accounts, and model selection. Explore more tutorials or get started with GigaGPU to power your React apps with GPU inference.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?