Clinical NLP Data

Medical NLP Dataset Access for AI Training

Clinical text is the most valuable and least accessible NLP training data

Public biomedical NLP datasets — MIMIC-III/IV, PubMed abstracts, i2b2 challenges, n2c2 shared tasks — have driven significant progress in clinical NLP research. But these datasets share a critical limitation: they are not representative of real-time clinical documentation at your target deployment site.

A model trained on MIMIC (US ICU data from 2001-2012) will underperform on NHS discharge summaries written in 2026. A model trained on PubMed abstracts will fail to capture the telegraphic, abbreviation-dense language of real radiology reports. The NLP community has wrung everything it can from public datasets. The next leap requires real clinical text.

Clinical text modalities accessible through Rapha Protocol

Radiology reports: Structured and unstructured reports from MRI, CT, X-ray, ultrasound, mammography, and nuclear medicine studies. Typically 200-500 words. Rich in anatomical terminology, disease descriptions, and clinical recommendations.
Discharge summaries: Comprehensive clinical narratives summarising a patient's hospital stay — presenting complaint, hospital course, procedures, discharge medications, follow-up instructions. Typically 500-2000 words. Critical for training summarisation and clinical reasoning models.
Operative notes: Surgical procedure documentation including indication, findings, technique, and post-operative plan. Structured enough for procedure code extraction, unstructured enough to require clinical NLP.
Pathology reports: Histopathology descriptions with diagnostic conclusions. Combines structured elements (specimen type, gross description) with unstructured clinical interpretation. Critical for oncology NLP.
Progress notes / clinic letters: Day-to-day clinical documentation from inpatient and outpatient encounters. Most variable in length, quality, and structure — and most representative of real clinical workflows.

What makes this different from MIMIC and other public NLP datasets

Public datasets (MIMIC, i2b2, n2c2)

Age: Data is years to decades old by the time it is released for research.
Representativeness: Single institution. ICU-only in MIMIC. Cannot capture regional variation, documentation practice evolution, or rare disease distributions.
Access: Requires CITI training, data use agreement, and institutional affiliation. Shared under restrictive terms.
Volume: Fixed. Cannot request additional records of specific types or from specific clinical contexts.
Commercial use: Restricted or prohibited. MIMIC's PhysioNet license explicitly limits commercial use.

Rapha Protocol — on-demand clinical text

Real-time: Train on clinical text generated this month at an active NHS trust or private hospital.
Representative: Train on text from the specific institution, department, and clinical context your model will be deployed in.
Access: Register for API access. Submit training job with dataset profile. Receive trained weights — not text. Institutional governance review still required.
Volume: Variable. Submit jobs against datasets of any available size.
Commercial use: You own your trained model weights. The output of compute-to-data is your IP.

Private-alpha. Clinical text data availability depends on configured hospital nodes and institutional governance approval. NLP workloads are supported through LoRA fine-tuning on edge GPU. Raw text is not exported under any circumstances — only trained model adapters.

Healthcare LLM training infrastructure Train AI on EHR data without exfiltration Train LLM on hospital data without exporting PHI Healthcare generative AI training data AI Researchers — start here