Spark NLP For Healthcare 2.7.3 with Biobert Extraction Models, Higher Accuracy, De-Identification, New Radiology NER Model & More

09.02.2021

Veysel Kocaman

Spark NLP for Healthcare 2.7.3 is available now! If you’re an existing customer, expect an email with upgrade instructions.

In this release we introduce the following new models, features, and improvements:

Massive improvements & feature enhancements in De-Identification module
A brand-new RelationExtractionDL Annotator
Drug Normalizer
Confidence Scores in assertion output
Cosine similarity metrics in entity resolvers
AuxLabel in the metadata of entity resolvers
New Relation Extraction models
New Entity Resolver models
New Clinical Pretrained NER model
Bug fixes & general improvements
Matching the version with Spark NLP open-source v2.7.3

Improvements in De-Identification Module

Faker Library

This release introduces the ‘faker’ library in De-Identification Module to automatically generate random data like names, dates, addresses etc. so users do not have to specify dummy data (custom obfuscation files can still be used). It also improves the obfuscation results due to a bigger pool of random values.

How to use

Structured De-Identification Module

This release comes with a new annotator to handle the de-identification of structured data. It allows users to define a mapping of columns and their obfuscation policy. Users can also provide dummy data and map them to columns they want to replace values in.

How to use

Input:

Deidentified:

Introducing SOTA Relation Extraction model using BioBert

We released a brand-new end-to-end trained BERT model, resulting in massive improvements. Another new annotator (ReChunksFilter) is also developed for this new model to allow syntactic features to work well with BioBert to extract relations.

How to use

Drug Normalizer

This release comes with Drug Normalizer to standardize units of drugs and handle abbreviations in raw text or drug chunks identified by any NER model. This normalization significantly improves the performance of entity resolvers.

How to use

Assertion models to support confidence in the output

Just like NER output, now assertion models also provide confidence scores for each prediction.

New Relation Extraction Models

This release introduces 3 new relation extraction models between body parts and clinical entities and 1 new relation extraction model between date and clinical entities. These models are trained using a binary relation extraction approach for better accuracy.

re_bodypart_direction : Relation Extraction between Body Part and Direction entities.

Example:

Text: “MRI demonstrated infarction in the upper brain stem, left cerebellum and right basil ganglia”

re_bodypart_problem : Relation Extraction between Body Part and Problem entities.

Example:

Text: “No neurologic deficits other than some numbness in his left hand.”

re_bodypart_proceduretest : Relation Extraction between Body Part and Procedure, Test entities.

Example:

Text: “TECHNIQUE IN DETAIL: After informed consent was obtained from the patient and his mother, the chest was scanned with portable ultrasound.”

re_date_clinical : Relation Extraction between Date and different clinical entities.

Example:

Text: “This 73 y/o patient had CT on 1/12/95, with progressive memory and cognitive decline since 8/11/94.”

How to use

New Resolver Models using JSL SBERT

sbiobertresolve_icd10cm_augmented
sbiobertresolve_cpt_augmented
sbiobertresolve_cpt_procedures_augmented
sbiobertresolve_icd10cm_augmented_billable_hcc
sbiobertresolve_hcc_augmented

Returning auxiliary columns mapped to resolutions: Chunk entity resolver and sentence entity resolver now returns auxiliary data that is mapped the resolutions during training. This will allow users to get multiple resolutions with a single model without using any other annotator in the pipeline (In order to get billable codes otherwise there need to be other modules in the same pipeline)

We also added a cosine similarity metric in metadata to resolve entities and find the closest matches, resulting in better, more semantically correct results.

sbiobertresolve_icd10cm_augmented_billable_hcc

Example:

Input Text: “bladder cancer”

sbiobertresolve_cpt_augmented

Example:

Input Text: “ct abdomen without contrast”

New Pretrained Clinical NER Model

NER Radiology

Input Text: “Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.”

Veysel Kocaman

Our additional expert:

Veysel is the Chief Technology Officer at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science. Holding a PhD degree in ML, Dr. Kocaman has authored more than 25 papers in peer reviewed journals and conferences in the last few years, focusing on solving real world problems in healthcare with NLP. He is a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence, and big data with over ten years of experience. Veysel has broad consulting experience in Statistics, Data Science, Software Architecture, DevOps, Machine Learning, and AI to several start-ups, boot camps, and companies around the globe. He also speaks at Data Science & AI events, conferences and workshops, and has delivered more than a hundred talks at international as well as national conferences and meetups.