The ability to extract clinical information at large scale and in real time from unstructured clinical notes is becoming a mission critical capability for IQVIA. Key data elements like tumor stage & size, Social Determinants of Health, and ejection fractions are not available in typical structured EMR records. Additionally, the Cures Act Final Rule brings the unstructured notes into play from Oct 22 in the US, with the EU likely to follow suite. The market expectation will rapidly be that CRO’s and data vendors are able to leverage unstructured notes for clinical trial recruitment, study registries, precision medicine, and manufacturing longitudinal datasets. The first step in unlocking the value of unstructured healthcare data is efficient handling of Personally Identifiable Information. Accurately anonymizing medical data is challenging when data is in multiple languages, beyond English, and for short doctor notes that pose a different challenge from proper documents with paragraphs and sentences. In this talk we show the data flow for probably the largest multi-country EMR data platform in the world, focusing on the de-identification module that allows to safely ingest doctor notes and open them up for analytics. We go into details of the Spark NLP de-identification model and pipeline built to handle German texts, share results, and summarize lessons learned, including combining rules and models for types of entities where a rules-based outperforms trained models.
The first step in unlocking the value of unstructured healthcare data is efficient handling of Personally Identifiable Information. Accurately anonymizing medical data is challenging when data is in multiple languages, beyond English, and for short doctor notes that pose a different challenge from proper documents with paragraphs and sentences.
In this talk we show the data flow for probably the largest multi-country EMR data platform in the world, focusing on the de-identification module that allows to safely ingest doctor notes and open them up for analytics. We go into details of the Spark NLP de-identification model and pipeline built to handle German texts, share results, and summarize lessons learned, including combining rules and models for types of entities where a rules-based outperforms trained models.