RTX 3050 - Order Now
Home / Blog / Use Cases / Patient Triage AI: LLM-Powered Symptom Assessment
Use Cases

Patient Triage AI: LLM-Powered Symptom Assessment

An urgent care centre fielding 1,200 calls per day needs AI-driven symptom assessment that matches trained clinician accuracy — without routing patient conversations through US cloud providers.

The Challenge: Overwhelmed Phone Lines, Inconsistent Triage

An NHS 111 service provider covering the West Midlands handles approximately 1,200 calls every 24 hours. Each call follows a clinical decision support algorithm: a trained health advisor works through symptom questions to determine whether the caller needs a 999 ambulance, A&E attendance, GP appointment within hours, or simple self-care advice. The problem is variability. Triage outcomes depend heavily on individual advisor experience, and during winter peaks when agency staff supplement the workforce, disposition accuracy drops measurably. The provider wants an AI co-pilot that suggests the correct triage pathway in real time, reducing under- and over-triage rates simultaneously.

Commercial symptom-checking APIs exist, but routing live patient conversations — complete with names, dates of birth, and detailed symptom descriptions — through third-party servers creates GDPR compliance exposure the provider cannot accept. The AI must run on infrastructure the organisation controls.

AI Solution: Fine-Tuned LLM as Clinical Co-Pilot

A large language model fine-tuned on NHS Pathways clinical content and historical triage call transcripts can serve as a real-time co-pilot. As the health advisor speaks with the caller, the system captures the conversation (via Whisper-based transcription), extracts symptom mentions, and queries the LLM to suggest the next clinical question and a preliminary disposition category.

The architecture is deliberately advisory: the LLM never communicates directly with the patient. It presents suggestions to the human advisor on a secondary screen, preserving clinical accountability while accelerating decision-making. Models like Mistral 7B or LLaMA 3 8B, fine-tuned on triage protocols, achieve this without the massive compute footprint of larger models. Serving through vLLM ensures sub-second response latency even under heavy concurrent load.

GPU Requirements: Real-Time Inference at Scale

The critical metric is tokens-per-second across concurrent sessions. During peak hours, 60-80 advisors may be on calls simultaneously, each generating LLM queries every 20-30 seconds. The system must sustain 80+ concurrent inference streams with p95 latency under 800 milliseconds.

GPU ModelVRAMConcurrent Triage Sessionsp95 Latency (Mistral 7B)
NVIDIA RTX 509024 GB~25~600 ms
NVIDIA RTX 6000 Pro48 GB~50~500 ms
NVIDIA RTX 6000 Pro48 GB~55~450 ms
NVIDIA RTX 6000 Pro 96 GB80 GB~90~350 ms

For 80 concurrent advisors, an RTX 6000 Pro provides comfortable headroom. Smaller services covering 30-40 simultaneous sessions can operate well on an RTX 6000 Pro through GigaGPU’s dedicated hosting. The key advantage of dedicated hardware here is predictable latency — an advisor cannot wait three seconds for a suggestion mid-conversation.

Recommended Stack

  • vLLM for high-throughput LLM serving with continuous batching — critical for sustaining dozens of concurrent sessions.
  • Faster-Whisper for real-time speech-to-text on the advisor-caller audio stream.
  • Mistral 7B-Instruct or LLaMA 3 8B fine-tuned on NHS Pathways decision trees and anonymised historical call data.
  • LangChain with a retrieval-augmented generation (RAG) component pulling from the latest clinical guidelines and formulary data.
  • WebSocket API for real-time bidirectional communication between the advisor’s workstation and the inference server.

Adding document AI capabilities lets the system also parse incoming GP referral letters and patient summaries, feeding relevant medical history into the triage LLM context for more informed suggestions.

Cost vs. Alternatives

Proprietary clinical decision support systems from established vendors carry licensing fees of £200,000-£500,000 annually, and they offer limited flexibility to incorporate new AI capabilities. Building on open-source LLMs hosted on dedicated infrastructure gives the provider full ownership of the model, the ability to fine-tune on their own call data, and no per-query costs regardless of call volume.

The economic case strengthens when measuring clinical outcomes. Reducing over-triage by even 5% — sending fewer low-acuity patients to A&E — saves the wider system significant per-attendance costs. Under-triage reduction carries an even more compelling argument measured in patient safety rather than pounds.

Getting Started

Pilot with a single symptom pathway — chest pain is the standard benchmark because it is high-stakes and well-studied. Fine-tune the LLM on 10,000 anonymised chest pain call transcripts with known dispositions. Deploy in shadow mode alongside 10 advisors for four weeks, comparing AI-suggested dispositions against advisor decisions and eventual clinical outcomes.

GigaGPU provides private AI hosting with the latency guarantees triage demands and the UK data residency NHS commissioners require. Scale from pilot to full deployment on the same infrastructure by upgrading GPU tier as concurrent session counts grow.

Deploy real-time clinical triage AI on infrastructure you control.
GigaGPU’s UK-based dedicated GPU servers deliver the sub-second latency and data sovereignty NHS triage demands.

Explore Dedicated GPU Plans

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?