RTX 3050 - Order Now
Home / Blog / Tutorials / Connect Android App to Self-Hosted AI
Tutorials

Connect Android App to Self-Hosted AI

Stream AI responses from your GPU server into a native Android application using Kotlin. This guide covers OkHttp streaming, Kotlin coroutines integration, Jetpack Compose state management, and building a native AI chat experience with your self-hosted LLM.

What You’ll Connect

After this guide, your native Android app will stream AI responses from your own GPU server using OkHttp and Kotlin coroutines — tokens flowing into a Jetpack Compose UI in real time. Your vLLM or Ollama endpoint on dedicated GPU hardware serves the OpenAI-compatible API, and OkHttp’s streaming response body feeds tokens through a Flow that Compose observes and renders instantly.

The integration uses OkHttp for HTTP streaming and Kotlin Flow for reactive token delivery. A ViewModel holds conversation state using Compose’s mutableStateListOf, and the UI recomposes only the active message bubble as new tokens arrive.

Prerequisites

  • A GigaGPU server running a self-hosted LLM (setup guide)
  • HTTPS access to your inference endpoint
  • Android Studio with Kotlin 1.9+ and Jetpack Compose
  • API key for your GPU inference server

Integration Steps

Add OkHttp and the JSON serialisation library to your Gradle dependencies. Create a service class that builds an HTTP POST request to your GPU endpoint with stream: true, reads the response body line by line using OkHttp’s streaming API, and emits tokens through a Kotlin Flow. The Flow runs on Dispatchers.IO to keep the main thread free.

Build a ViewModel that holds the message list in a mutableStateListOf for Compose observation. The sendMessage() function launches a coroutine that collects tokens from the service Flow and appends them to the current assistant message. Compose detects the state change and recomposes the affected message composable.

Create the chat screen with a LazyColumn for messages, a TextField for input, and a send IconButton. Use LaunchedEffect with a key on the message count to auto-scroll to the bottom. The LazyColumn efficiently renders only visible messages, handling long conversations without performance issues.

Code Example

Kotlin streaming service and ViewModel for your self-hosted LLM:

import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.flow
import okhttp3.*
import okhttp3.MediaType.Companion.toMediaType
import okhttp3.RequestBody.Companion.toRequestBody
import org.json.JSONArray
import org.json.JSONObject

class AiService(private val apiUrl: String, private val apiKey: String) {
    private val client = OkHttpClient()

    fun streamCompletion(messages: List<ChatMessage>): Flow<String> = flow {
        val body = JSONObject().apply {
            put("model", "meta-llama/Llama-3-70b-chat-hf")
            put("stream", true)
            put("max_tokens", 1024)
            put("messages", JSONArray().apply {
                messages.forEach { put(JSONObject().apply {
                    put("role", it.role); put("content", it.content)
                })}
            })
        }

        val request = Request.Builder()
            .url("$apiUrl/v1/chat/completions")
            .addHeader("Authorization", "Bearer $apiKey")
            .post(body.toString().toRequestBody("application/json".toMediaType()))
            .build()

        client.newCall(request).execute().use { response ->
            val source = response.body?.source() ?: return@flow
            while (!source.exhausted()) {
                val line = source.readUtf8Line() ?: break
                if (!line.startsWith("data: ") || line == "data: [DONE]") continue
                val json = JSONObject(line.substring(6))
                val token = json.getJSONArray("choices")
                    .getJSONObject(0).getJSONObject("delta")
                    .optString("content", "")
                if (token.isNotEmpty()) emit(token)
            }
        }
    }
}

// ViewModel with Compose state
class ChatViewModel(private val service: AiService) : ViewModel() {
    val messages = mutableStateListOf<ChatMessage>()
    var isStreaming by mutableStateOf(false)
    private var job: Job? = null

    fun send(text: String) {
        messages.add(ChatMessage("user", text))
        messages.add(ChatMessage("assistant", ""))
        isStreaming = true

        job = viewModelScope.launch(Dispatchers.IO) {
            service.streamCompletion(messages.dropLast(1)).collect { token ->
                withContext(Dispatchers.Main) {
                    val last = messages.last()
                    messages[messages.lastIndex] =
                        last.copy(content = last.content + token)
                }
            }
            withContext(Dispatchers.Main) { isStreaming = false }
        }
    }

    fun cancel() { job?.cancel(); isStreaming = false }
}

Testing Your Integration

Run the app on an emulator and a physical device. Send test messages and verify tokens stream in real time on both. Test with the Android Profiler open to confirm the main thread stays responsive during streaming — all network I/O should happen on the IO dispatcher. Test rotation and configuration changes to verify the ViewModel survives and streaming continues.

Test network transitions: switch from WiFi to cellular mid-stream. OkHttp should throw an IOException that your coroutine catches, showing an error state with a retry button. Test the cancel function and verify the coroutine cancels the OkHttp call cleanly.

Production Tips

Route requests through your own backend to keep the API key off the device. Use OkHttp certificate pinning for additional transport security. Store conversation history in Room database for offline access and fast app restarts. Implement Android’s WorkManager for long-running AI tasks that should complete even if the user backgrounds the app.

Optimise the LazyColumn with stable keys and contentType parameters so Compose reuses composables efficiently during rapid streaming updates. Add Material 3 theming with dynamic colour for a native Android look. Build a full AI chatbot Android experience. Explore more tutorials or get started with GigaGPU to power your Android apps.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?