Simpler & More Accurate Deidentification in Spark NLP for Healthcare

11.06.2021

Veysel Kocaman

Spark NLP for Healthcare 3.1 improves the accuracy, functionality, and ease of use of the library’s data de-identification capabilities, whose are crutial for natural language processing in healthcare. All improvements come directly from customer feedback, as the library is being used in real-world projects to anonymize millions of medical notes, clinical trial documents, scanned PDF reports & DICOM images. Highlights include:

New Deidentification Named Entity Recognition Models
New column returned in DeidentificationModel
New Re-identification feature
Extended regex dictionary fuctionality in de-identification
Chunk filtering based on confidence
New de-identification pretrained pipelines

Accuracy: New Deidentification Named Entity Recognition (NER) Models

Four new NER models have been trained to identity PHI (protected health information) data that may need to be deidentified. ner_deid_generic_augmented and ner_deid_subentity_augmented models are trained with a combination of the 2014 i2b2 Deid dataset and in-house annotations as well as an augmented version of them. Compared to the same test set coming from the 2014 i2b2 Deid dataset, we achieved better accuracy and generalization on several entity labels as summarized in the following tables. We also trained the same models with glove_100d embeddings to provide more memory-friendly versions

ner_deid_generic_augmented: Detects PHI 7 entities

(DATE,NAME,LOCATION,PROFESSION,CONTACT,AGE,ID).

Models Hub Page:

https://nlp.johnsnowlabs.com/2021/06/01/ner_deid_generic_augmented_en.html

entity	ner_deid_large (v3.0.3 and before)	ner_deid_generic_augmented (v3.1.0)
CONTACT	0.8695	0.9592
NAME	0.9452	0.9648
DATE	0.9778	0.9855
LOCATION	0.8755	0.923

(MEDICALRECORD,ORGANIZATION,DOCTOR,USERNAME,PROFESSION,HEALTHPLAN,URL,CITY,DATE,LOCATION-OTHER,STATE,PATIENT,DEVICE,COUNTRY,ZIP,PHONE,HOSPITAL,EMAIL,IDNUM,SREET,BIOID,FAX,AGE)

Models Hub Page:

https://nlp.johnsnowlabs.com/2021/09/03/ner_deid_subentity_augmented_en.html

entity	ner_deid_enriched (v3.0.3 and before)	ner_deid_subentity_augmented (v3.1.0)
HOSPITAL	0.8519	0.8983
DATE	0.9766	0.9854
CITY	0.7493	0.8075
STREET	0.8902	0.9772
ZIP	0.8	0.9504
PHONE	0.8615	0.9502
DOCTOR	0.9191	0.9347
AGE	0.9416	0.9469

ner_deid_generic_glove: Small version ofner_deid_generic_augmentedand detects 7 entities.
ner_deid_subentity_glove: Small version ofner_deid_subentity_augmentedand detects 23 entities.

Example:

Python

deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \

.setInputCols(["sentence", "token", "embeddings"]) \

.setOutputCol("ner")

...

nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, 
ner_converter])

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) 


results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, 
David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, 
Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]})))

Results:

Functionality: New column returned in DeidentificationModel

DeidentificationModel now can return a new column to save the mappings between the mask/obfuscated entities and original entities. This column is optional and you can set it up with the.setReturnEntityMappings(True)method. The default value is False. Also, the name for the column can be changed using the following method;.setMappingsColumn(“newAlternativeName”)The new column will produce annotations with the following structure,

Annotation(

type: chunk,

begin: 17,

end: 25,

result: 47,

metadata:{

originalChunk - 01/13/93 //Original text of the chunk

chunk - 0 // The number of the chunk in the sentence

beginOriginalChunk - 95 // Start index of the original chunk

endOriginalChunk - 102 // End index of the original chunk

entity - AGE // Entity of the chunk

sentence - 2 // Number of the sentence

}

)

Functionality: New Re-identification feature

With the new ReidetificationModel, the user can go back to the original sentences using the mappings columns and the deidentification sentences.

Example:

reDeidentification =ReIdentification()
.setInputCols(["mappings","deid_chunks"]) 
.setOutputCol("original")

Functionality: Filtering Entities Based on Confidence

We added a new annotator ChunkFiltererApproach that allows loading a CSV file with both entities and confidence thresholds. This annotator will produce a ChunkFilterer model.

This annotator can be used to filter named entity for de-identification – but also any other type of recognized named entity, as the example below shows.

You can load the dictionary with the following propertysetEntitiesConfidenceResource().

An example dictionary is:

TREATMENT,0.7

With that dictionary, the user can filter the chunks corresponding to treatment entities which have confidence lower than 0.7.

Example:

We have a ner_chunk column and sentence column with the following data:

Ner_chunk

|[{chunk, 141, 163, the genomicorganization, {entity - TREATMENT, sentence - 0, chunk - 0, confidence - 
	0.57785}, []}, {chunk, 209, 267, a candidate gene forType II
	
		diabetes mellitus, {entity - PROBLEM, sentence - 0, chunk - 1, confidence - 0.6614286}, []}, 
	{chunk, 394, 408, byapproximately, {entity - TREATMENT, sentence - 1, chunk - 2, confidence - 0.7705}, []}, 
	{chunk, 478, 508, single nucleotide polymorphisms, {entity - TREATMENT, sentence - 2, chunk - 3, 
	confidence - 0.7204666}, []}, {chunk, 559, 581, aVal366Ala substitution, {entity - TREATMENT, sentence - 
	2, chunk - 4, confidence - 0.61505}, []}, {chunk, 588, 601, an 8 base-pair, {entity - TREATMENT, sentence - 
	2, chunk - 5, confidence - 0.29226667}, []}, {chunk, 608, 625, insertion/deletion, {entity - PROBLEM, 
	sentence - 3, chunk - 6, confidence - 0.9841}, []}]|
	
	+-------

Sentence

[{document, 0, 298, The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying 
	potassium (GIRK) channel family.Here we describe the genomicorganization of the KCNJ9 locus on chromosome 
	1q21-23 as a candidate gene forType II
	diabetes mellitus in the Pima Indian population., {sentence - 0}, []}, {document, 300, 460, The 
	gene spansapproximately 7.6 kb and contains one noncoding and two coding exons ,separated byapproximately 2.2 
	and approximately 2.6 kb introns, respectively., {sentence - 1}, []}, {document, 462, 601, We identified14 
	single nucleotide polymorphisms (SNPs), 
		including one that predicts aVal366Ala substitution, and an 8 base-pair, {sentence - 2}, []}, 
	{document, 603, 626, (bp) insertion/deletion., {sentence - 3}, []}]

We can filter the entities using the following annotator:

chunker_filter=ChunkFiltererApproach().setInputCols("sentence", "ner_chunk") \

.setOutputCol("filtered") \

.setCriteria("regex") \

.setRegex([".*"]) \  

.setEntitiesConfidenceResource("entities_confidence.csv")

Where entities-confidence.csv has the following data:

TREATMENT,0.7

PROBLEM,0.9

We can use that chunk_filter:

chunker_filter.fit(data).transform(data)

Producing the following entities:

|[{chunk, 394, 408, byapproximately, {entity - TREATMENT, sentence - 1, chunk - 2, confidence - 0.7705}, []}, 
{chunk, 478, 508, single nucleotide polymorphisms, {entity - TREATMENT, sentence - 2, chunk - 3, 
confidence - 0.7204666}, []}, {chunk, 608, 625, insertion/deletion, {entity - PROBLEM, sentence - 3, 
chunk - 6, confidence - 0.9841}, []}]|

As you can see, only the treatment entities with a confidence score of more than 0.7, and the problem entities with a confidence score of more than 0.9 have been kept in the output.

Functionality: Extended Regex Dictionary Context

The RegexPatternsDictionary can now use a regex that spawns the 2 previous token and the 2 next tokens. That feature is implemented using regex groups.

Examples:

Given the sentence The patient with ssn 123123123 we can use the following regex to capture the entittyssn (\d{9}). Given the sentence The patient has 12 yearswe can use the following regex to capture the entitty(\d{2}) years

Ease of Use: New Pretrained De-identification Pipelines

We developed aclinical_deidentificationpretrained pipeline that can be used to de-identify PHI from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR entities.

Models Hub Page: https://nlp.johnsnowlabs.com/2021/05/27/clinical_deidentification_en.html

There is also a lightweight version of the same pipeline trained with memory efficientglove_100dembeddings. Here are the model names:

clinical_deidentification
clinical_deidentification_glove

Example:

Python:

from sparknlp.pretrained import PretrainedPipeline deid_pipeline = 
PretrainedPipeline("clinical_deidentification", "en", "clinical/models") 
deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 203.120.223.13. 
The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, 
Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, 
Patient's VIN : 1HGBH41JXMN109286.")

Result:

{'sentence': ['Record date : 2093-01-13, David Hale, M.D.',

'IP: 203.120.223.13.',

'The driver's license no:A334455B.',

'the SSN:324598674 and e-mail: hale@gmail.com.',

'Name :Hendrickson, Ora MR. #719435 Date : 01/13/93.',

'PCP : Oliveira, 25 years-old.',

'Record date : 2079-11-09, Patient's VIN :1HGBH41JXMN109286.'],

'masked': ['Record date :<DATE, <DOCTOR, M.D.',

'IP: <IPADDR.',

'The driver's license <DLN.',

'the <SSN and e-mail: <EMAIL.',

'Name : <PATIENT MR. # <MEDICALRECORD Date : <DATE.',

'PCP : <DOCTOR, <AGE years-old.',

'Record date : <DATE, Patient's VIN :<VIN.'],

'obfuscated': ['Record date :2093-01-18, Dr Alveria Eden, M.D.',

'IP: 001.001.001.001.',

'The driver's license K783518004444.',

'the SSN-400-50-8849 and e-mail: Merilynn@hotmail.com.',

'Name : Charls Danger MR. # J3366417 Date : 01-18-1974.',

'PCP : Dr Sina Sewer, 55 years-old.',

'Record date : 2079-11-23, Patient's VIN :6ffff55gggg666777.'],

'ner_chunk': ['2093-01-13',

'David Hale',

'no:A334455B',

'SSN:324598674',

'Hendrickson, Ora',

'719435',

'01/13/93',

'Oliveira',

'25',

'2079-11-09',

'1HGBH41JXMN109286']}

Get Started

Live demos & Python notebooks of medical data de-identification tools
Start a free trial of Spark NLP for Healthcare
How to build a deidentification pipeline from scratch

Veysel Kocaman

Our additional expert:

Veysel is the Chief Technology Officer at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science. Holding a PhD degree in ML, Dr. Kocaman has authored more than 25 papers in peer reviewed journals and conferences in the last few years, focusing on solving real world problems in healthcare with NLP. He is a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence, and big data with over ten years of experience. Veysel has broad consulting experience in Statistics, Data Science, Software Architecture, DevOps, Machine Learning, and AI to several start-ups, boot camps, and companies around the globe. He also speaks at Data Science & AI events, conferences and workshops, and has delivered more than a hundred talks at international as well as national conferences and meetups.

High Accuracy Resolution of Medical Entities to Standard Codes Using Novel Sentence Embeddings

Veysel Kocaman

The release of Spark NLP for Healthcare 3.1 brings significant speed and accuracy improvements for the task of entity resolution, also known...

Simpler & More Accurate Deidentification in Spark NLP for Healthcare

Accuracy: New Deidentification Named Entity Recognition (NER) Models

Models Hub Page:

Functionality: New column returned in DeidentificationModel

Functionality: New Re-identification feature

Functionality: Filtering Entities Based on Confidence

Functionality: Extended Regex Dictionary Context

Ease of Use: New Pretrained De-identification Pipelines

Get Started

High Accuracy Resolution of Medical Entities to Standard Codes Using Novel Sentence Embeddings

Recommended For You