What You’ll Connect
After this guide, your native Android app will stream AI responses from your own GPU server using OkHttp and Kotlin coroutines — tokens flowing into a Jetpack Compose UI in real time. Your vLLM or Ollama endpoint on dedicated GPU hardware serves the OpenAI-compatible API, and OkHttp’s streaming response body feeds tokens through a Flow that Compose observes and renders instantly.
The integration uses OkHttp for HTTP streaming and Kotlin Flow for reactive token delivery. A ViewModel holds conversation state using Compose’s mutableStateListOf, and the UI recomposes only the active message bubble as new tokens arrive.
Prerequisites
- A GigaGPU server running a self-hosted LLM (setup guide)
- HTTPS access to your inference endpoint
- Android Studio with Kotlin 1.9+ and Jetpack Compose
- API key for your GPU inference server
Integration Steps
Add OkHttp and the JSON serialisation library to your Gradle dependencies. Create a service class that builds an HTTP POST request to your GPU endpoint with stream: true, reads the response body line by line using OkHttp’s streaming API, and emits tokens through a Kotlin Flow. The Flow runs on Dispatchers.IO to keep the main thread free.
Build a ViewModel that holds the message list in a mutableStateListOf for Compose observation. The sendMessage() function launches a coroutine that collects tokens from the service Flow and appends them to the current assistant message. Compose detects the state change and recomposes the affected message composable.
Create the chat screen with a LazyColumn for messages, a TextField for input, and a send IconButton. Use LaunchedEffect with a key on the message count to auto-scroll to the bottom. The LazyColumn efficiently renders only visible messages, handling long conversations without performance issues.
Code Example
Kotlin streaming service and ViewModel for your self-hosted LLM:
import kotlinx.coroutines.flow.Flow
import kotlinx.coroutines.flow.flow
import okhttp3.*
import okhttp3.MediaType.Companion.toMediaType
import okhttp3.RequestBody.Companion.toRequestBody
import org.json.JSONArray
import org.json.JSONObject
class AiService(private val apiUrl: String, private val apiKey: String) {
private val client = OkHttpClient()
fun streamCompletion(messages: List<ChatMessage>): Flow<String> = flow {
val body = JSONObject().apply {
put("model", "meta-llama/Llama-3-70b-chat-hf")
put("stream", true)
put("max_tokens", 1024)
put("messages", JSONArray().apply {
messages.forEach { put(JSONObject().apply {
put("role", it.role); put("content", it.content)
})}
})
}
val request = Request.Builder()
.url("$apiUrl/v1/chat/completions")
.addHeader("Authorization", "Bearer $apiKey")
.post(body.toString().toRequestBody("application/json".toMediaType()))
.build()
client.newCall(request).execute().use { response ->
val source = response.body?.source() ?: return@flow
while (!source.exhausted()) {
val line = source.readUtf8Line() ?: break
if (!line.startsWith("data: ") || line == "data: [DONE]") continue
val json = JSONObject(line.substring(6))
val token = json.getJSONArray("choices")
.getJSONObject(0).getJSONObject("delta")
.optString("content", "")
if (token.isNotEmpty()) emit(token)
}
}
}
}
// ViewModel with Compose state
class ChatViewModel(private val service: AiService) : ViewModel() {
val messages = mutableStateListOf<ChatMessage>()
var isStreaming by mutableStateOf(false)
private var job: Job? = null
fun send(text: String) {
messages.add(ChatMessage("user", text))
messages.add(ChatMessage("assistant", ""))
isStreaming = true
job = viewModelScope.launch(Dispatchers.IO) {
service.streamCompletion(messages.dropLast(1)).collect { token ->
withContext(Dispatchers.Main) {
val last = messages.last()
messages[messages.lastIndex] =
last.copy(content = last.content + token)
}
}
withContext(Dispatchers.Main) { isStreaming = false }
}
}
fun cancel() { job?.cancel(); isStreaming = false }
}
Testing Your Integration
Run the app on an emulator and a physical device. Send test messages and verify tokens stream in real time on both. Test with the Android Profiler open to confirm the main thread stays responsive during streaming — all network I/O should happen on the IO dispatcher. Test rotation and configuration changes to verify the ViewModel survives and streaming continues.
Test network transitions: switch from WiFi to cellular mid-stream. OkHttp should throw an IOException that your coroutine catches, showing an error state with a retry button. Test the cancel function and verify the coroutine cancels the OkHttp call cleanly.
Production Tips
Route requests through your own backend to keep the API key off the device. Use OkHttp certificate pinning for additional transport security. Store conversation history in Room database for offline access and fast app restarts. Implement Android’s WorkManager for long-running AI tasks that should complete even if the user backgrounds the app.
Optimise the LazyColumn with stable keys and contentType parameters so Compose reuses composables efficiently during rapid streaming updates. Add Material 3 theming with dynamic colour for a native Android look. Build a full AI chatbot Android experience. Explore more tutorials or get started with GigaGPU to power your Android apps.