Three Thousand Consultations, Zero Typists
A GP federation spanning 42 practices across South Yorkshire processes over 3,200 patient consultations daily. Each appointment generates a narrative note — diagnosis, examination findings, management plan, prescriptions — that a clinician must enter into EMIS Web or SystmOne. GPs report spending 11 minutes per consultation on documentation versus 9 minutes on the actual patient interaction. Across the federation, that documentation burden equates to 24 full-time-equivalent clinicians doing nothing but typing.
Cloud-based transcription services such as AWS Transcribe Medical or Google’s Healthcare NLP exist, but they route audio containing patient identifiers through third-party infrastructure outside the practice’s direct control. For UK general practice operating under the UK GDPR framework, that creates a data-processing liability the federation’s Caldicott Guardian flagged as unacceptable. Self-hosted speech-to-text on private GPU infrastructure eliminates that exposure entirely.
AI Architecture for Medical Speech-to-Text
The pipeline begins with audio capture — either a USB microphone on the clinician’s desk or a dedicated recording appliance in the consultation room. Audio streams to an on-site or hosted Whisper large-v3 instance running on a dedicated GPU server. Whisper produces raw transcription with speaker diarisation (distinguishing clinician from patient).
A second-stage model — typically a fine-tuned Llama 3 or Mistral variant — restructures the raw transcript into a SOAP-format clinical note: Subjective, Objective, Assessment, Plan. This clinical summarisation model identifies SNOMED-CT codes, drug names with dosages, and follow-up actions. The structured output is pushed to the EHR via its API, populating the correct fields without the clinician touching a keyboard.
For practices also running document AI for incoming correspondence, both workloads can share a single GPU server — voice transcription peaks during surgery hours (08:00–18:30) while document processing runs as overnight batch jobs.
GPU Requirements for Real-Time Clinical Dictation
Whisper large-v3 requires approximately 10 GB VRAM. The clinical summarisation LLM adds 8–16 GB depending on quantisation. Real-time transcription demands that the model processes audio faster than it arrives — a real-time factor (RTF) below 0.3 for comfortable margin.
| GPU Model | VRAM | Concurrent Streams | RTF (Whisper large-v3) |
|---|---|---|---|
| RTX 3090 | 24 GB | 2–3 | 0.28 |
| RTX 5090 | 24 GB | 4–5 | 0.15 |
| RTX 6000 Pro | 48 GB | 8–10 | 0.12 |
| RTX 6000 Pro 96 GB | 80 GB | 16–20 | 0.07 |
A federation of 42 practices rarely has more than 30 simultaneous consultations. An RTX 6000 Pro handles peak load with room for the summarisation model. Smaller single-practice deployments can start with an RTX 5090. Read the full voice agent hosting guide for architectural patterns.
Recommended Software Stack
- Speech-to-Text: Whisper large-v3 with CTranslate2 for 2–3x inference speedup
- Speaker Diarisation: pyannote.audio 3.x for clinician/patient separation
- Clinical Summarisation: Llama 3 8B fine-tuned on MIMIC-III discharge summaries, served via optimised inference frameworks
- Medical Coding: SNOMED-CT lookup via Elasticsearch sidecar
- EHR Integration: EMIS IM1 API, SystmOne APIs, or HL7 FHIR endpoints
- Audio Preprocessing: WebRTC VAD for silence trimming, noisereduce for ambient filtering
Compliance and Cost Comparison
Clinical audio recordings containing patient information are special-category personal data under UK GDPR Article 9. The ICO expects data controllers to demonstrate that processing occurs on infrastructure with appropriate technical and organisational measures. A dedicated server with encrypted storage, access-controlled SSH, and no multi-tenancy satisfies these requirements more straightforwardly than a shared cloud environment. Consult the UK data location guide for jurisdiction details.
| Approach | Monthly Cost (42 practices) | Data Control |
|---|---|---|
| Cloud transcription API | £3,100–£5,400 | Third-party processor |
| Commercial medical dictation SaaS | £6,200–£8,500 | Vendor-controlled |
| GigaGPU RTX 6000 Pro Dedicated Server | From £899/mo | Full sovereignty |
The dedicated server approach costs a fraction of commercial SaaS dictation licences while giving the federation full ownership of the trained models and transcription data. Additional use case studies cover similar savings in adjacent healthcare workflows.
Getting Started
Pilot with three practices over six weeks. Record consultations (with patient consent under existing clinical audio-recording policies), transcribe with Whisper, and measure word error rate (WER) against manual transcription. Medical terminology WER below 8% is the threshold most GP federations accept. Fine-tune Whisper on 200 hours of your own consultation audio to bring domain-specific WER below 5%. Once accuracy is validated, roll out federation-wide with a centralised GPU server and per-practice audio streaming clients. Practices also exploring compliance audit automation can leverage the same transcription infrastructure for audit trail generation.
Transcribe Consultations Securely on Dedicated GPU Hardware
Run Whisper and clinical NLP models on GigaGPU — real-time dictation, full UK data residency, no per-minute API charges.
Browse GPU Servers