Healthcare Organizations Can Now Leverage Larger and More Diverse Datasets to Improve Operations and Care
John Snow Labs, the Healthcare AI and NLP company and developer of the Spark NLP library, today announced improvements to its automatic de-identification solution. The company recently established a new state-of-the-art record on the n2b2 standard de-identification benchmark, achieving an F1 score of 96.1%, and decreasing its error rate by 33%. By enabling organizations to automatically de-identify large datasets, John Snow Labs empowers product innovation and cost savings for healthcare organizations worldwide.
Providing custom de-identification required for the monetization of data, John Snow Labs’ automatic de-identification solution is already proving valuable for users. The service is based on the company’s Spark NLP for Healthcare library, built on top of the Spark big data framework, enabling the processing of millions of records on large Spark or Databricks clusters. The de-identification solution can be delivered as an end-to-end system or a software library with optional professional services.
“We are using John Snow Labs to de-identify patient notes on a massive scale and the results from the out-of-the-box de-identification models have been remarkable,” said Nadaa Taiyab, Senior Data Scientist, Tegria. “It has been simple to fine-tune models with our own annotated data and improve pipeline results by adding regular expressions and text matching where needed. Overall, the code is very modular and easy to use, making the challenges and complexities of such a large-scale project much easier to navigate.”
Healthcare providers possess vast amounts of unstructured patient-level data. This data has tremendous value, but often remains untapped due to legal and regulatory requirements. However, by removing protected health information (PHI), the data becomes usable and has the potential to create new revenue streams and spark healthcare innovation. However, this can be challenging, as stricter de-identification rules lower the risk of re-identification, but also decrease the usability of the data.
While manual removal of PHI is possible, it’s often rife with human error, and requires multiple reviews. Additionally, the larger the data set, the more labor- and cost-intensive the project. Academic literature shows that for a team with an average cost of $83 per hour total compensation, processing 135 notes per hour of an average length of 130 words, costs $0.61 per note. For large data sets consisting of millions of records, this is simply not feasible.
“Natural language processing has made it possible to automatically de-identify valuable, but otherwise unusable, unstructured patient-level data, like clinical notes, images, and scanned documents,” said David Talby, CTO, John Snow Labs. “Once de-identified, the datasets can be shared more safely and easily with researchers and builders, ushering in a new generation of accurate and innovative healthcare solutions. Without large-scale automatic data de-identification, this would not be possible at scale.”