PPML Guide

Privacy-Preserving Machine Learning in Healthcare

Privacy-preserving ML in healthcare: techniques ranked by real-world utility

Privacy-preserving machine learning (PPML) encompasses a family of techniques designed to train or deploy ML models while limiting exposure of training data. In healthcare, where training data is PHI and exposure carries regulatory and ethical consequences, PPML is not optional — it is a prerequisite. But not all PPML techniques are equal. Some provide formal privacy guarantees at the cost of model utility. Others provide hardware-enforced isolation that preserves both privacy and utility. This guide ranks them by real-world applicability to clinical AI.

Technique comparison for clinical AI training

Differential Privacy (DP)

How it works: Adds calibrated noise to model training to mask individual contributions. Provides a mathematical guarantee (ε, δ) that any single record's influence on the model is bounded.

ML utility: 4/10. Noise degrades model accuracy, especially on rare classes and minority populations — the very cases clinical AI most needs to capture. ε must be carefully tuned; too low = useless model, too high = weak privacy.

Privacy guarantee: 8/10. Formal mathematical guarantee. But the guarantee is only as strong as the ε budget. Real-world deployments often use ε values too high to provide meaningful privacy.

Best for: Publishing aggregate statistics. Training models where some accuracy loss is acceptable. Regulatory submissions requiring formal privacy guarantees.

Homomorphic Encryption (HE)

How it works: Encrypts data such that computations can be performed on encrypted values without decryption. The computation result, when decrypted, matches the result of the same computation on plaintext data.

ML utility: 6/10. Full computation on encrypted data. In theory, no utility loss. In practice, HE is computationally expensive — training a single modern neural network with HE is orders of magnitude slower than plaintext training. Most clinical AI models cannot practically be trained with HE.

Privacy guarantee: 9/10. Strong cryptographic guarantee. Data is never decrypted during computation. But the guarantee applies only during computation — input and output are plaintext.

Best for: Inference on encrypted data. Small-scale computations where latency is acceptable. Not practical for training large clinical AI models.

Secure Multi-Party Computation (SMPC)

How it works: Distributes computation across multiple parties such that no single party can reconstruct the inputs. Each party holds a share of the computation. The output is reconstructed from shares.

ML utility: 5/10. In theory, full utility. In practice, SMPC introduces massive communication overhead between parties. Training a neural network with SMPC requires constant message passing between computation nodes — impractical for anything beyond simple models.

Privacy guarantee: 8/10. Strong guarantee assuming honest majority or majority of non-colluding parties. But the guarantee depends on party independence — if parties collude, privacy fails.

Best for: Specific computations where multiple distrusting parties must jointly compute a function. Not practical for training clinical AI models at scale.

Compute-to-Data with TEE (Rapha Protocol)

How it works: Model training executes inside a hardware-enforced trusted execution environment (SGX/TDX enclave) at the hospital edge. The enclave encrypts memory — even the hospital's own OS and hypervisor cannot inspect data in use. Only trained weights leave the enclave.

ML utility: 10/10. Full plaintext training on real clinical data. No noise injection. No encryption overhead during computation. No multi-party coordination. Same model quality as training on exported data — without the export.

Privacy guarantee: 8/10. Hardware-enforced isolation. Intel SGX/TDX attestation verifiable through DCAP. TPM 2.0 measured boot. Kernel-level air-gap severs network during training. Privacy guarantee depends on trust in Intel's SGX implementation — a known, audited, and battle-tested hardware root of trust.

Best for: Training production clinical AI models on real PHI-containing data at the hospital edge. Regulatory-compatible by design — data minimisation principle satisfied because data never leaves the institution.

Why hardware beats software for clinical AI privacy

Software-based PPML techniques — differential privacy, homomorphic encryption, SMPC — add computational overhead that scales with model size and data volume. In clinical AI, where models are large (vision transformers, LLMs) and datasets are massive (millions of DICOM studies, billions of EHR rows), this overhead renders software PPML impractical for training.

Hardware-based PPML — confidential computing with SGX/TDX enclaves — provides isolation at the silicon level with near-zero computational overhead. The enclave adds a few percent performance overhead compared to native execution. The privacy guarantee comes from the hardware, not from mathematical noise or cryptographic overhead.

For clinical AI training where model quality and training throughput are critical, hardware-enforced privacy is the only PPML technique that delivers both privacy and utility.

Rapha Protocol does not claim that trained model weights are automatically privacy-preserving. Additional leakage testing, differential privacy, and minimum cohort thresholds should be evaluated per deployment. TEE-based compute-to-data eliminates data export risk; it does not guarantee that trained weights are immune to membership inference or model inversion attacks.

Confidential compute for clinical AI Federated learning vs compute-to-data Privacy-preserving healthcare AI training Edge Core OS technical details Technical whitepaper