The Challenge: Overwhelmed Phone Lines, Inconsistent Triage
An NHS 111 service provider covering the West Midlands handles approximately 1,200 calls every 24 hours. Each call follows a clinical decision support algorithm: a trained health advisor works through symptom questions to determine whether the caller needs a 999 ambulance, A&E attendance, GP appointment within hours, or simple self-care advice. The problem is variability. Triage outcomes depend heavily on individual advisor experience, and during winter peaks when agency staff supplement the workforce, disposition accuracy drops measurably. The provider wants an AI co-pilot that suggests the correct triage pathway in real time, reducing under- and over-triage rates simultaneously.
Commercial symptom-checking APIs exist, but routing live patient conversations — complete with names, dates of birth, and detailed symptom descriptions — through third-party servers creates GDPR compliance exposure the provider cannot accept. The AI must run on infrastructure the organisation controls.
AI Solution: Fine-Tuned LLM as Clinical Co-Pilot
A large language model fine-tuned on NHS Pathways clinical content and historical triage call transcripts can serve as a real-time co-pilot. As the health advisor speaks with the caller, the system captures the conversation (via Whisper-based transcription), extracts symptom mentions, and queries the LLM to suggest the next clinical question and a preliminary disposition category.
The architecture is deliberately advisory: the LLM never communicates directly with the patient. It presents suggestions to the human advisor on a secondary screen, preserving clinical accountability while accelerating decision-making. Models like Mistral 7B or LLaMA 3 8B, fine-tuned on triage protocols, achieve this without the massive compute footprint of larger models. Serving through vLLM ensures sub-second response latency even under heavy concurrent load.
GPU Requirements: Real-Time Inference at Scale
The critical metric is tokens-per-second across concurrent sessions. During peak hours, 60-80 advisors may be on calls simultaneously, each generating LLM queries every 20-30 seconds. The system must sustain 80+ concurrent inference streams with p95 latency under 800 milliseconds.
| GPU Model | VRAM | Concurrent Triage Sessions | p95 Latency (Mistral 7B) |
|---|---|---|---|
| NVIDIA RTX 5090 | 24 GB | ~25 | ~600 ms |
| NVIDIA RTX 6000 Pro | 48 GB | ~50 | ~500 ms |
| NVIDIA RTX 6000 Pro | 48 GB | ~55 | ~450 ms |
| NVIDIA RTX 6000 Pro 96 GB | 80 GB | ~90 | ~350 ms |
For 80 concurrent advisors, an RTX 6000 Pro provides comfortable headroom. Smaller services covering 30-40 simultaneous sessions can operate well on an RTX 6000 Pro through GigaGPU’s dedicated hosting. The key advantage of dedicated hardware here is predictable latency — an advisor cannot wait three seconds for a suggestion mid-conversation.
Recommended Stack
- vLLM for high-throughput LLM serving with continuous batching — critical for sustaining dozens of concurrent sessions.
- Faster-Whisper for real-time speech-to-text on the advisor-caller audio stream.
- Mistral 7B-Instruct or LLaMA 3 8B fine-tuned on NHS Pathways decision trees and anonymised historical call data.
- LangChain with a retrieval-augmented generation (RAG) component pulling from the latest clinical guidelines and formulary data.
- WebSocket API for real-time bidirectional communication between the advisor’s workstation and the inference server.
Adding document AI capabilities lets the system also parse incoming GP referral letters and patient summaries, feeding relevant medical history into the triage LLM context for more informed suggestions.
Cost vs. Alternatives
Proprietary clinical decision support systems from established vendors carry licensing fees of £200,000-£500,000 annually, and they offer limited flexibility to incorporate new AI capabilities. Building on open-source LLMs hosted on dedicated infrastructure gives the provider full ownership of the model, the ability to fine-tune on their own call data, and no per-query costs regardless of call volume.
The economic case strengthens when measuring clinical outcomes. Reducing over-triage by even 5% — sending fewer low-acuity patients to A&E — saves the wider system significant per-attendance costs. Under-triage reduction carries an even more compelling argument measured in patient safety rather than pounds.
Getting Started
Pilot with a single symptom pathway — chest pain is the standard benchmark because it is high-stakes and well-studied. Fine-tune the LLM on 10,000 anonymised chest pain call transcripts with known dispositions. Deploy in shadow mode alongside 10 advisors for four weeks, comparing AI-suggested dispositions against advisor decisions and eventual clinical outcomes.
GigaGPU provides private AI hosting with the latency guarantees triage demands and the UK data residency NHS commissioners require. Scale from pilot to full deployment on the same infrastructure by upgrading GPU tier as concurrent session counts grow.
GigaGPU’s UK-based dedicated GPU servers deliver the sub-second latency and data sovereignty NHS triage demands.
Explore Dedicated GPU Plans