Medical Data
De-identification
De-identify clinical notes, PDFs, DICOM images, structured data, and FHIR resources with regulatory-grade accuracy. Process billions of records without moving data from your secure environment. Includes Expert Determination, tokenization, and consistent obfuscation.
Used by Providence Health to De-identify
2 Billion Patient Notes
- Consistent obfuscation, date shifting, and tokenization per patient
- Equity analysis across gender, age, ethnicity, and geography
- 3-month "red teaming" by external security provider
- Manual review of 35,000+ notes by compliance team
- Validated at unprecedented scale on 2 billion real-world patient notes
- Adversarial testing on 790 patients with zero re-identifications
- Surpasses triple manual review by domain experts
- Published peer-reviewed methodology and results
Why This is the Most Widely Deployed
True Multimodal Processing
Consistent de-identification across modalities—unstructured clinical text, PDFs, DICOM images, structured tables, and FHIR resources. One solution, one pipeline, zero inconsistencies.
Expert Determination Included
Complete HIPAA Expert Determination process with legal documentation. Not just de-identification—full regulatory compliance with auditable validation and statistical risk analysis.
De-identify Data Where It Lives
Process data in-place within your secure environment—Databricks, Snowflake, Azure, or AWS. Zero data movement. Everything happens in memory. Your data never leaves your control.
Unmatched Accuracy at Scale
99%+ PHI detection rate proven at 2 billion note scale. Validated through independent audits and red team testing. Surpasses triple manual review by domain experts.
Deterministic &
Consistent
Same PHI always produces the same token. Maintains longitudinal patient linkage across records. Built for reproducible, auditable results.
Cost-Effective
at Scale
80%+ cheaper than API-based solutions. Fixed-cost deployment with no per-token fees. Proven economics at billion-record scale with predictable, transparent pricing.
The Data De-identification Process
- Risk analysis
- Legal requirements review
- HIPAA Safe Harbor, HIPAA Expert Determination
- CCPA
- GDPR pseudoanonymization, GDPR anonymization
- Quality assurance strategy & process
- ID, name, email, patient ID, SSN, credit card, address, birthday, phone, URL, license number
- Physician name, hospital name, profession, employer, affiliation
- Racial or ethnic origin, religion, political or union affiliation, biometric or genetic data, sexual practice or orientation
- Cleanroom AI Platform (on-site)
- Annotation tool
- Active learning
- Accuracy Measurement & agreement processes
- Correct sampling
- Multi-lingual
- Tabular (headers, values)
- Text (NER, text matching)
- PDF: Text or Scanned
- Images(OCR & metadata)
- DICOM (OCR & metadata)
- Replace (or delete a field)
- Mask (hash identifiers or shift dates)
- Obfuscate (name, locations, organizations)
- Generalize (disease codes, dates, addresses)
- Ongoing measurement & model improvement
- Missed sensitive data
- Incident response
- GDPR & CCPA requests
- Emergency unblinding
- Audits
De-identification Across All Medical Data Types
Clinical Text Notes
Automatically identify and obfuscate 23+ PHI entity types including patient names, doctors, hospitals, medical records, locations, dates, and contact information.


PDF Documents (Text & Scanned)
Process PDFs directly—preserving original structure and formatting while removing PHI with 94%+ accuracy. Handles both digital and scanned documents with built-in OCR, ensuring HIPAA-compliant de-identification without document recreation.


DICOM Medical Images
Redact burned-in PHI using intelligent pixel inpainting with complete metadata de-identification. Handles all DICOM modalities while maintaining image utility, saving de-identified data into DCM format for further processing.


Structured Data & Databases
De-identify PHI from structured datasets automatically while enforcing GDPR and HIPAA compliance and maintaining linkage of clinical data across files. Process millions of records with consistent obfuscation.


Consistent Tokenization & Obfuscation
Protect patient privacy without breaking data relationships: consistent tokenization ensures the same individual receives identical replacement values across all documents, preserving referential integrity for research and analysis while meeting regulatory requirements.


Link Multimodal Patient Data Over Time
Maintain complete patient context across longitudinal records—preserving temporal relationships, narrative continuity, and data linkability while ensuring privacy through consistent de-identification and intelligent date shifting.


Multi-Language Support Across Healthcare Systems
English

English

Spanish

French

German

Italian

Portuguese

Dutch

Romanian

Arabic

Languages Supported
PHI Entity Types Detected
Detection Accuracy
Customizable for Your Unique Requirements
Custom Entity Recognition
Define and train models to recognize your organization’s specific identifiers:
- Internal patient ID formats
- Custom medical record numbering
- Facility-specific codes
- Department identifiers
- Study participant IDs
Flexible Policies
Configure granular policies that balance data utility with privacy protection:
- Obfuscate names with realistic fakes
- Mask specific fields with asterisks
- Shift dates while preserving intervals
- Generalize locations (city → state)
- Tokenize for consistent linkage
Format-Specific Processing
Native support for every healthcare document format, preserving structure and content:
- Custom PDF templates and forms
- Proprietary EHR exports
- Legacy system formats
- Complex table layouts
- Embedded images and charts
Regulatory Compliance
Configure policies to meet your regulatory requirements:
- HIPAA Safe Harbor
- HIPAA Expert Determination
- GDPR compliance ready
- CCPA and state privacy laws
- Custom regulatory frameworks
Enterprise Integration
Seamless integration into your existing tech stack with robust API access:
- RESTful API Integration
- Real Time Processing
- Batch Processing
- Horizontal Scaling & High availability
- Full Audit Logs and Monitoring
Enterprise Security
Secured solution with multiple deployment options for any security posture
- End-to-end Data Encryption
- On-premise or Cloud Deployment
- Air-gapped Environments
- Zero Data Retention by Default
Peer-Reviewed, State-of-the-Art Accuracy



Watch Real-World De-identification Challenges Solved
Try It on Your Own Data
Run our pre-trained de-identification pipeline in a Google Colab notebook. See exactly how it works, test with your data, and customize for your needs.
Runs in One Click
Easy to Customize
Code Examples























