was successfully added to your cart.

Medical Data De-identification

  • Simple process & setup
  • Automatically de-identify structured data, unstructured data, documents, PDF files, and images in compliance with HIPAA, GDPR, or custom needs
  • Trusted by 5 of 8 Top Pharma Companies
Schedule a Call
Try Live Demo

How Providence Health De-Identified 700 Million Patient Notes with Spark NLP

Accuracy:
99.19correctly de-identified sentences
Performance:

2.46hours

to de-identify 500K patient notes.

Peer-Reviewed, State-of-the-Art Accuracy

Can Zero-Shot Commercial APIs Deliver Regulatory-Grade Clinical Text DeIdentification?
Accepted at Text2Story Workshop at ECIR 2025
Beyond Accuracy: Automated De-Identification of Large Real-World Clinical Text Datasets
Machine Learning for Health (ML4H) 2023
Accurate Clinical and Biomedical Named Entity Recognition at Scale
Software Impacts, July 2022

Live Test with Your Medical Data

The Data De-identification Software

1
Analyze
Human
  • Risk analysis​
  • Legal requirements review
  • HIPAA Safe Harbor, HIPAA Expert Determination​
  • CCPA​
  • GDPR pseudoanonymization, GDPR anonymization
  • Quality assurance strategy & process
Receive raw data
2
Identify
Software
  • ID, name, email, patient ID, SSN, credit card, address, birthday, phone, URL, license number
  • Physician name, hospital name, profession, employer, affiliation
  • Racial or ethnic origin, religion, political or union affiliation, biometric or genetic data, sexual practice or orientation
3
Measure
Human
  • Cleanroom AI Platform (on-site)
  • Annotation tool
  • Active learning
  • Accuracy Measurement & agreement processes
  • Correct sampling
  • Multi-lingual
4
De-identify
Software
We support:
  • Tabular (headers, values)
  • Text (NER, text matching)
  • PDF: Text or Scanned
  • Images(OCR & metadata)
  • DICOM (OCR & metadata)
So you can:
  • Replace (or delete a field)
  • Mask (hash identifiers or shift dates)
  • Obfuscate (name, locations, organizations)
  • Generalize (disease codes, dates, addresses)
Deliver de-identified data
5
Monitor
Human
  • Ongoing measurement & model improvement
  • Missed sensitive data
  • Incident response
  • GDPR & CCPA requests
  • Emergency unblinding
  • Audits

De-identificiation Solutions with Full Range of Features

John Snow Labs’ De-identification solutions AWS Medical Comprehend Microsoft Presidio Google DLP
De-dentification tool
End-to-end service
Available also as a standalone library
Established new state of the art accuracy in peer reviewed publication
Real world reference with >99% correctly recognized PHI
Scanned PDF Integrated Separate service Separate service
DICOM Integrated Separate service Separate service
Obfuscation
Software with Multilingual support
Built on big data framework
Possible to fine tune standard pre-trained models
Data does not leave your premise
Works in air gap insulated server with no internet access

De-identification in Action

De-Identify Unstructured Clinical Text

Automatically identify protected health information up to 23 entities including Patient, Doctor, Hospital, MedicalRecord, IDNum , Location, Profession etc in clinical documents using our pretrained Spark NLP models.

De-identify structured data

Tools to De-identify PHI (Protected Health Information) from structured datasets automatically while enforcing GDPR and HIPAA compliance and maintaining linkage of clinical data across files.

healthcare data de identification
health information de identification
De-identify PDF documents

De-identify PDF documents using HIPAA guidelines by masking PHI information using out of the box Spark NLP and Spark OCR models.

De-identify DICOM documents

De-identify DICOM documents by masking PHI information on the image and by either masking or obfuscating PHI from the metadata.

de-identified health data
De-identify PHI in Multiple Languages

This pipeline can be used to de-identify PHI information from English medical texts. The PHI information will be masked and obfuscated in the resulting text.

Consistent Tokenization & Obfuscation

Ensure data clarity, usability, and consistency while prioritizing privacy and security. Protect sensitive information, without hindering data usability or insight extraction.

Link Multimodal Patient Data Over Time

Normalize and shift dates with ease.

Data De-identification Tools: Webinars