Why Data Sharing Is Dead for Clinical AI
Data sharing between hospitals and AI companies is collapsing — and that's a good thing
For the past decade, the clinical AI industry has operated on an implicit assumption: hospitals will eventually share their data. Data use agreements will be signed. De-identification pipelines will be built. Data lakes will be filled. AI companies will train on thousands of hospitals' worth of real clinical data.
This assumption is wrong. It has always been wrong. And the evidence is now overwhelming.
Five reasons data sharing has failed for clinical AI
- Hospitals cannot share what they do not own. Patient data is held in trust by hospitals — it is not their asset to trade. The Caldicott Guardian, DPO, and general counsel all have professional duties to protect this data from unauthorised disclosure. Even well-intentioned data sharing agreements face institutional resistance from multiple gatekeepers with veto power.
- De-identification is a leaky abstraction. Every additional data field makes re-identification easier. The features most valuable for AI training — precise dates, detailed demographics, unstructured clinical notes — are the features de-identification must strip. The result: de-identified data is simultaneously too risky (residual re-identification risk) and too degraded (ML signal loss) to satisfy either side of the negotiation.
- The legal framework is adversarial. Data use agreements are negotiated between hospitals (who want to minimise liability) and AI companies (who want maximum data access). The outcome is a restrictive document that limits data use to a specific protocol, prohibits commercial exploitation, and requires data destruction after the agreement term. Every time the AI company iterates on its model, the DUA must be renegotiated. This is not a scalable model for AI development.
- Regulators are getting stricter, not looser. The GDPR enforcement trend is toward higher fines for health data breaches. The UK ICO has explicitly warned about re-identification risks in AI training datasets. The US FTC has begun regulating AI training data practices under its unfair practices authority. The regulatory environment for clinical data sharing is deteriorating — not improving.
- Patient trust is eroding. High-profile incidents — the Royal Free London / DeepMind data sharing controversy, the NHS data sharing opt-out backlash, the 23andMe bankruptcy data sale concerns — have made patients increasingly aware that their health data can be shared in ways they did not consent to. The social license for clinical data sharing is being revoked.
Compute-to-data: the architecture that doesn't require sharing
Rapha Protocol eliminates the sharing problem by eliminating the sharing requirement. The AI company does not receive data. The hospital does not export data. The regulator does not need to review a data export agreement. The patient does not need to trust that their de-identified record won't be re-identified.
Instead: the AI model trains at the hospital edge, inside a hardware-enforced trusted execution environment, under policy control, with network isolation, producing trained weights — not data exports — as its output. Every stakeholder's interests align.