was successfully added to your cart.

Simpler & More Accurate Deidentification in Spark NLP for Healthcare

Spark NLP for Healthcare 3.1 improves the accuracy, functionality, and ease of use of the library’s data de-identification capabilities. All improvements come directly from customer feedback, as the library is being used in real-world projects to anonymize millions of medical notes, clinical trial documents, scanned PDF reports & DICOM images. Highlights include: 

  • New Deidentification Named Entity Recognition Models 
  • New column returned in DeidentificationModel 
  • New Re-identification feature 
  • Extended regex dictionary fuctionality in de-identification 
  • Chunk filtering based on confidence 
  • New de-identification pretrained pipelines 



Accuracy: New Deidentification Named Entity Recognition (NER) Models 

Four new NER models have been trained to identity PHI (protected health information) data that may need to be deidentified. ner_deid_generic_augmented and ner_deid_subentity_augmented models are trained with a combination of the 2014 i2b2 Deid dataset and in-house annotations as well as an augmented version of them. Compared to the same test set coming from the 2014 i2b2 Deid dataset, we achieved better accuracy and generalization on several entity labels as summarized in the following tables. We also trained the same models with glove_100d embeddings to provide more memory-friendly versions

  • ner_deid_generic_augmented : Detects PHI 7 entities

Models Hub Page:https://nlp.johnsnowlabs.com/2021/06/01/ner_deid_generic_augmented_en.html


entity  ner_deid_large (v3.0.3 and before)  ner_deid_generic_augmented (v3.1.0) 










LOCATION  0.8755 


  • ner_deid_subentity_augmented: Detects PHI 23 entities

Models Hub Page:



ner_deid_enriched (v3.0.3 and before)  ner_deid_subentity_augmented (v3.1.0) 
HOSPITAL  0.8519 











ZIP  0.8 


PHONE  0.8615 


DOCTOR  0.9191 


AGE  0.9416 


  • ner_deid_generic_glove: Small version of ner_deid_generic_augmented and detects 7 entities. 
  • ner_deid_subentity_glove: Small version of ner_deid_subentity_augmented and detects 23 entities.





deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \ 

      .setInputCols(["sentence", "token", "embeddings"]) \ 



nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, 

model = nlpPipeline.fit(spark.createDataFrame([[""]]).toDF("text")) 

results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, 
David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, 
Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]}))) 


Functionality: New column returned in DeidentificationModel 

DeidentificationModel now can return a new column to save the mappings between the mask/obfuscated entities and original entities. This column is optional and you can set it up with the .setReturnEntityMappings(True) method. The default value is False. Also, the name for the column can be changed using the following method; .setMappingsColumn(“newAlternativeName”) The new column will produce annotations with the following structure, 


  type: chunk,  

  begin: 17,  

  end: 25,  

  result: 47, 


        originalChunk -> 01/13/93  //Original text of the chunk 

        chunk -> 0  // The number of the chunk in the sentence 

        beginOriginalChunk -> 95 // Start index of the original chunk 

        endOriginalChunk -> 102  // End index of the original chunk 

        entity -> AGE // Entity of the chunk 

        sentence -> 2 // Number of the sentence 



Functionality: New Re-identification feature 

With the new ReidetificationModel, the user can go back to the original sentences using the mappings columns and the deidentification sentences. 


reDeidentification = ReIdentification() 

Functionality: Filtering Entities Based on Confidence 

We added a new annotator ChunkFiltererApproach that allows loading a CSV file with both entities and confidence thresholds. This annotator will produce a ChunkFilterer model. 

This annotator can be used to filter named entity for de-identification – but also any other type of recognized named entity, as the example below shows. 

You can load the dictionary with the following property setEntitiesConfidenceResource(). 

An example dictionary is: 


With that dictionary, the user can filter the chunks corresponding to treatment entities which have confidence lower than 0.7. 


We have a ner_chunk column and sentence column with the following data: 


|[{chunk, 141, 163, the genomicorganization, {entity -> TREATMENT, sentence -> 0, chunk -> 0, confidence -> 
0.57785}, []}, {chunk, 209, 267, a candidate gene forType II  

           diabetes mellitus, {entity -> PROBLEM, sentence -> 0, chunk -> 1, confidence -> 0.6614286}, []}, 
{chunk, 394, 408, byapproximately, {entity -> TREATMENT, sentence -> 1, chunk -> 2, confidence -> 0.7705}, []}, 
{chunk, 478, 508, single nucleotide polymorphisms, {entity -> TREATMENT, sentence -> 2, chunk -> 3, 
confidence -> 0.7204666}, []}, {chunk, 559, 581, aVal366Ala substitution, {entity -> TREATMENT, sentence -> 
2, chunk -> 4, confidence -> 0.61505}, []}, {chunk, 588, 601, an 8 base-pair, {entity -> TREATMENT, sentence -> 
2, chunk -> 5, confidence -> 0.29226667}, []}, {chunk, 608, 625, insertion/deletion, {entity -> PROBLEM, 
sentence -> 3, chunk -> 6, confidence -> 0.9841}, []}]| 



[{document, 0, 298, The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying 
potassium (GIRK) channel family.Here we describe the genomicorganization of the KCNJ9 locus on chromosome 
1q21-23 as a candidate gene forType II 
diabetes mellitus in the Pima Indian population., {sentence -> 0}, []}, {document, 300, 460, The 
gene spansapproximately 7.6 kb and contains one noncoding and two coding exons ,separated byapproximately 2.2 
and approximately 2.6 kb introns, respectively., {sentence -> 1}, []}, {document, 462, 601, We identified14 
single nucleotide polymorphisms (SNPs), 
             including one that predicts aVal366Ala substitution, and an 8 base-pair, {sentence -> 2}, []}, 
{document, 603, 626, (bp) insertion/deletion., {sentence -> 3}, []}] 

We can filter the entities using the following annotator: 

chunker_filter = ChunkFiltererApproach().setInputCols("sentence", "ner_chunk") \ 

            .setOutputCol("filtered") \ 

            .setCriteria("regex") \ 

            .setRegex([".*"]) \          


Where entities-confidence.csv has the following data: 



We can use that chunk_filter: 


Producing the following entities: 

|[{chunk, 394, 408, byapproximately, {entity -> TREATMENT, sentence -> 1, chunk -> 2, confidence -> 0.7705}, []}, 
{chunk, 478, 508, single nucleotide polymorphisms, {entity -> TREATMENT, sentence -> 2, chunk -> 3, 
confidence -> 0.7204666}, []}, {chunk, 608, 625, insertion/deletion, {entity -> PROBLEM, sentence -> 3, 
chunk -> 6, confidence -> 0.9841}, []}]| 


As you can see, only the treatment entities with a confidence score of more than 0.7, and the problem entities with a confidence score of more than 0.9 have been kept in the output. 



Functionality: Extended Regex Dictionary Context 

The RegexPatternsDictionary can now use a regex that spawns the 2 previous token and the 2 next tokens. That feature is implemented using regex groups. 


Given the sentence  The patient with ssn 123123123  we can use the following regex to capture the entitty ssn (\d{9}) . Given the sentence  The patient has 12 years we can use the following regex to capture the entitty (\d{2}) years 



Ease of Use: New Pretrained De-identification Pipelines 

We developed a clinical_deidentification pretrained pipeline that can be used to deidentify PHI information from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate  AGE,  CONTACT,  DATE,  ID,  LOCATION,  NAME,  PROFESSION,  CITY,  COUNTRY,  DOCTOR,  HOSPITAL,  IDNUM,  MEDICALRECORD,  ORGANIZATION,  PATIENT,  PHONE,  PROFESSION,  STREET,  USERNAME,  ZIP,  ACCOUNT,  LICENSE,  VIN,  SSN,  DLN,  PLATE,  IPADDR  entities. 

Models Hub Page:  https://nlp.johnsnowlabs.com/2021/05/27/clinical_deidentification_en.html 

There is also a lightweight version of the same pipeline trained with memory efficient glove_100dembeddings. Here are the model names: 

  • clinical_deidentification 
  • clinical_deidentification_glove 



from sparknlp.pretrained import PretrainedPipeline deid_pipeline = 
PretrainedPipeline("clinical_deidentification", "en", "clinical/models")    
deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 
The driver's license no:A334455B. the SSN:324598674 and e-mail: hale@gmail.com. Name : Hendrickson, 
Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, 
Patient's VIN : 1HGBH41JXMN109286.") 


{'sentence': ['Record date : 2093-01-13, David Hale, M.D.', 


   'The driver's license no:A334455B.', 

   'the SSN:324598674 and e-mail: hale@gmail.com.', 

   'Name : Hendrickson, Ora MR. # 719435 Date : 01/13/93.', 

   'PCP : Oliveira, 25 years-old.', 

   'Record date : 2079-11-09, Patient's VIN : 1HGBH41JXMN109286.'], 

'masked': ['Record date : <DATE>, <DOCTOR>, M.D.', 

   'IP: <IPADDR>.', 

   'The driver's license <DLN>.', 

   'the <SSN> and e-mail: <EMAIL>.', 

   'Name : <PATIENT> MR. # <MEDICALRECORD> Date : <DATE>.', 

   'PCP : <DOCTOR>, <AGE> years-old.', 

   'Record date : <DATE>, Patient's VIN : <VIN>.'], 

'obfuscated': ['Record date : 2093-01-18, Dr Alveria Eden, M.D.', 


   'The driver's license K783518004444.', 

   'the SSN-400-50-8849 and e-mail: Merilynn@hotmail.com.', 

   'Name : Charls Danger MR. # J3366417 Date : 01-18-1974.', 

   'PCP : Dr Sina Sewer, 55 years-old.', 

   'Record date : 2079-11-23, Patient's VIN : 6ffff55gggg666777.'], 

'ner_chunk': ['2093-01-13', 

   'David Hale', 



   'Hendrickson, Ora', 







Get Started 

High Accuracy Resolution of Medical Entities to Standard Codes Using Novel Sentence Embeddings

The release of Spark NLP for Healthcare 3.1 brings significant speed and accuracy improvements for the task of entity resolution, also known...