One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results, referrals, or genetic testing to DICOM files with medical imaging. These files are challenging to de-identify, because personal health information (PHI) can appear anywhere in free text – so cannot be removed with rules or regular expressions – or “burned” into images so that it’s not even available as digital text to begin with.
This webinar presents a software system that tackles these challenges, with lessons learned from applying it in real-world production systems. The workflow uses:
- Spark OCR to extract both digital and scanned text from PDF and DICOM files
- Spark NLP for Healthcare to recognize sensitive data in the extracted free text
- The de-identification module to delete, replace, or obfuscate PHI
- Spark OCR to generate new PDF or DICOM file with the de-identified health data
- Run the whole workflow within a local secure environment, with no need to share data with any third party or a public cloud API