Train AI on EHR Data Without Exfiltration
The EHR data paradox for AI researchers
Electronic health records contain some of the most valuable training data for clinical AI: structured lab results, medication histories, diagnosis codes, vital sign trends, problem lists, and longitudinal care trajectories. Unlike imaging data, which requires pixel-level processing, EHR data is compact, structured, and can train predictive models with relatively modest compute requirements.
But EHR data is also the most identifiable clinical data format. A single row of structured fields — date of service, diagnosis code, age, zip code — can be re-identified. Exporting raw EHR records to an AI company's training environment, even under a cloud BAA, creates an open-ended PHI exposure surface.
On-prem EHR training with Rapha Protocol
The edge appliance connects to local EHR databases (Epic, Cerner, Meditech, or custom systems) through read-only, policy-controlled data mounts. Training scripts access data through RaphaDataLoader, which counts unique records before each batch reaches model code. The raw EHR rows never leave the appliance. Only trained model weights, aggregated metrics, and cryptographic proof receipts exit.
Supported EHR training patterns:
- Clinical risk prediction — train models on real lab results, vital signs, and diagnoses to predict readmission, sepsis, or deterioration.
- Treatment outcome modelling — train on medication histories and longitudinal outcomes to model treatment efficacy.
- Patient stratification — identify high-risk cohorts from real clinical data without exposing individual patient records.
- Phenotype extraction — train NLP models on provider notes to extract structured phenotypes from unstructured text.
EHR compatibility and data formats
The platform supports multiple EHR data formats and extraction patterns:
- FHIR R4 — RESTful API access to standardised clinical resources (Patient, Observation, Condition, MedicationRequest).
- CSV/Parquet exports — pre-extracted structured datasets from hospital data warehouses.
- HL7 v2 — message-based integration for real-time or batch EHR data feeds.
- Custom SQL views — read-only views configured by the hospital's data engineering team.
Record-level settlement
Rapha Protocol uses RaphaDataLoader to count unique records consumed during training. Settlement is per-record, not per-epoch — the hospital is paid based on the distinct dataset used, not the number of training loops. This aligns incentives: researchers pay for data access, hospitals earn for data contribution, and neither side is incentivised to over-train or over-expose records.
Important: A minimum cohort size of 25 records is enforced by OPA policy. Smaller cohorts create re-identification risk and are rejected at the policy gate. Production EHR training requires institutional approval, data governance review, and applicable DPA/BAA analysis.