Home » Watch Webinar: Accurate de-identification, obfuscation, and editing of scanned medical documents and images

Watch the webinar

Accurate de-identification, obfuscation, and editing of scanned medical documents and images

One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results, referrals, or genetic testing to DICOM files with medical imaging. These files are challenging to de-identify, because personal health information (PHI) can appear anywhere in free text – so cannot be removed with rules or regular expressions – or “burned” into images so that it’s not even available as digital text to begin with.

This webinar presents a software system that tackles these challenges, with lessons learned from applying it in real-world production systems. The workflow uses:

Spark OCR to extract both digital and scanned text from PDF and DICOM files (PDF OCR)
Spark NLP for Healthcare to recognize sensitive data in the extracted free text
The de-identification module to delete, replace, or obfuscate PHI
Spark OCR to generate new PDF or DICOM file with the de-identified data
Run the whole workflow within a local secure environment, with no need to share data with any third party or a public cloud API

About the speaker

Dr. Alina Petukhova

Data Scientist

Data Scientist at John Snow Labs