Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Simpler & More Accurate Deidentification in Spark NLP for Healthcare

Spark NLP for Healthcare 3.1 improves the accuracy, functionality, and ease of use of the library’s data de-identification capabilities, whose are crutial for natural language processing in healthcare. All improvements come directly from customer feedback, as the library is being used in real-world projects to anonymize millions of medical notes, clinical trial documents, scanned PDF reports & DICOM images. Highlights include:

  • New Deidentification Named Entity Recognition Models
  • New column returned in DeidentificationModel
  • New Re-identification feature
  • Extended regex dictionary fuctionality in de-identification
  • Chunk filtering based on confidence
  • New de-identification pretrained pipelines

Accuracy: New Deidentification Named Entity Recognition (NER) Models

Four new NER models have been trained to identity PHI (protected health information) data that may need to be deidentified. ner_deid_generic_augmented and ner_deid_subentity_augmented models are trained with a combination of the 2014 i2b2 Deid dataset and in-house annotations as well as an augmented version of them. Compared to the same test set coming from the 2014 i2b2 Deid dataset, we achieved better accuracy and generalization on several entity labels as summarized in the following tables. We also trained the same models with glove_100d embeddings to provide more memory-friendly versions

  • ner_deid_generic_augmented: Detects PHI 7 entities

Models Hub Page:

entity ner_deid_large (v3.0.3 and before) ner_deid_generic_augmented (v3.1.0)













Models Hub Page:


ner_deid_enriched (v3.0.3 and before) ner_deid_subentity_augmented (v3.1.0)











ZIP 0.8


PHONE 0.8615


DOCTOR 0.9191


AGE 0.9416


  • ner_deid_generic_glove: Small version ofner_deid_generic_augmentedand detects 7 entities.
  • ner_deid_subentity_glove: Small version ofner_deid_subentity_augmentedand detects 23 entities.



deid_ner = MedicalNerModel.pretrained("ner_deid_subentity_augmented", "en", "clinical/models") \

.setInputCols(["sentence", "token", "embeddings"]) \



nlpPipeline = Pipeline(stages=[document_assembler, sentence_detector, tokenizer, word_embeddings, deid_ner, 

model =[[""]]).toDF("text")) 

results = model.transform(spark.createDataFrame(pd.DataFrame({"text": ["""A. Record date : 2093-01-13, 
David Hale, M.D., Name : Hendrickson, Ora MR. # 7194334 Date : 01/13/93 PCP : Oliveira, 25 -year-old, 
Record date : 1-11-2000. Cocke County Baptist Hospital. 0295 Keats Street. Phone +1 (302) 786-5227."""]})))


Functionality: New column returned in DeidentificationModel

DeidentificationModel now can return a new column to save the mappings between the mask/obfuscated entities and original entities. This column is optional and you can set it up with the.setReturnEntityMappings(True)method. The default value is False. Also, the name for the column can be changed using the following method;.setMappingsColumn(“newAlternativeName”)The new column will produce annotations with the following structure,


type: chunk,

begin: 17,

end: 25,

result: 47,


originalChunk - 01/13/93 //Original text of the chunk

chunk - 0 // The number of the chunk in the sentence

beginOriginalChunk - 95 // Start index of the original chunk

endOriginalChunk - 102 // End index of the original chunk

entity - AGE // Entity of the chunk

sentence - 2 // Number of the sentence



Functionality: New Re-identification feature

With the new ReidetificationModel, the user can go back to the original sentences using the mappings columns and the deidentification sentences.


reDeidentification =ReIdentification()

Functionality: Filtering Entities Based on Confidence

We added a new annotator ChunkFiltererApproach that allows loading a CSV file with both entities and confidence thresholds. This annotator will produce a ChunkFilterer model.

This annotator can be used to filter named entity for de-identification – but also any other type of recognized named entity, as the example below shows.

You can load the dictionary with the following propertysetEntitiesConfidenceResource().

An example dictionary is:


With that dictionary, the user can filter the chunks corresponding to treatment entities which have confidence lower than 0.7.


We have a ner_chunk column and sentence column with the following data:


|[{chunk, 141, 163, the genomicorganization, {entity - TREATMENT, sentence - 0, chunk - 0, confidence - 
	0.57785}, []}, {chunk, 209, 267, a candidate gene forType II
		diabetes mellitus, {entity - PROBLEM, sentence - 0, chunk - 1, confidence - 0.6614286}, []}, 
	{chunk, 394, 408, byapproximately, {entity - TREATMENT, sentence - 1, chunk - 2, confidence - 0.7705}, []}, 
	{chunk, 478, 508, single nucleotide polymorphisms, {entity - TREATMENT, sentence - 2, chunk - 3, 
	confidence - 0.7204666}, []}, {chunk, 559, 581, aVal366Ala substitution, {entity - TREATMENT, sentence - 
	2, chunk - 4, confidence - 0.61505}, []}, {chunk, 588, 601, an 8 base-pair, {entity - TREATMENT, sentence - 
	2, chunk - 5, confidence - 0.29226667}, []}, {chunk, 608, 625, insertion/deletion, {entity - PROBLEM, 
	sentence - 3, chunk - 6, confidence - 0.9841}, []}]|


[{document, 0, 298, The human KCNJ9 (Kir 3.3, GIRK3) is a member of the G-protein-activated inwardly rectifying 
	potassium (GIRK) channel family.Here we describe the genomicorganization of the KCNJ9 locus on chromosome 
	1q21-23 as a candidate gene forType II
	diabetes mellitus in the Pima Indian population., {sentence - 0}, []}, {document, 300, 460, The 
	gene spansapproximately 7.6 kb and contains one noncoding and two coding exons ,separated byapproximately 2.2 
	and approximately 2.6 kb introns, respectively., {sentence - 1}, []}, {document, 462, 601, We identified14 
	single nucleotide polymorphisms (SNPs), 
		including one that predicts aVal366Ala substitution, and an 8 base-pair, {sentence - 2}, []}, 
	{document, 603, 626, (bp) insertion/deletion., {sentence - 3}, []}]

We can filter the entities using the following annotator:

chunker_filter=ChunkFiltererApproach().setInputCols("sentence", "ner_chunk") \

.setOutputCol("filtered") \

.setCriteria("regex") \

.setRegex([".*"]) \  


Where entities-confidence.csv has the following data:



We can use that chunk_filter:

Producing the following entities:

|[{chunk, 394, 408, byapproximately, {entity - TREATMENT, sentence - 1, chunk - 2, confidence - 0.7705}, []}, 
{chunk, 478, 508, single nucleotide polymorphisms, {entity - TREATMENT, sentence - 2, chunk - 3, 
confidence - 0.7204666}, []}, {chunk, 608, 625, insertion/deletion, {entity - PROBLEM, sentence - 3, 
chunk - 6, confidence - 0.9841}, []}]|

As you can see, only the treatment entities with a confidence score of more than 0.7, and the problem entities with a confidence score of more than 0.9 have been kept in the output.

Functionality: Extended Regex Dictionary Context

The RegexPatternsDictionary can now use a regex that spawns the 2 previous token and the 2 next tokens. That feature is implemented using regex groups.


Given the sentence The patient with ssn 123123123 we can use the following regex to capture the entittyssn (\d{9}). Given the sentence The patient has 12 yearswe can use the following regex to capture the entitty(\d{2}) years

Ease of Use: New Pretrained De-identification Pipelines

We developed aclinical_deidentificationpretrained pipeline that can be used to de-identify PHI from medical texts. The PHI information will be masked and obfuscated in the resulting text. The pipeline can mask and obfuscate AGE, CONTACT, DATE, ID, LOCATION, NAME, PROFESSION, CITY, COUNTRY, DOCTOR, HOSPITAL, IDNUM, MEDICALRECORD, ORGANIZATION, PATIENT, PHONE, PROFESSION, STREET, USERNAME, ZIP, ACCOUNT, LICENSE, VIN, SSN, DLN, PLATE, IPADDR entities.

Models Hub Page:

There is also a lightweight version of the same pipeline trained with memory efficientglove_100dembeddings. Here are the model names:

  • clinical_deidentification
  • clinical_deidentification_glove



from sparknlp.pretrained import PretrainedPipeline deid_pipeline = 
PretrainedPipeline("clinical_deidentification", "en", "clinical/models") 
deid_pipeline.annotate("Record date : 2093-01-13, David Hale, M.D. IP: 
The driver's license no:A334455B. the SSN:324598674 and e-mail: Name : Hendrickson, 
Ora MR. # 719435 Date : 01/13/93. PCP : Oliveira, 25 years-old. Record date : 2079-11-09, 
Patient's VIN : 1HGBH41JXMN109286.")


{'sentence': ['Record date : 2093-01-13, David Hale, M.D.',


'The driver's license no:A334455B.',

'the SSN:324598674 and e-mail:',

'Name :Hendrickson, Ora MR. #719435 Date : 01/13/93.',

'PCP : Oliveira, 25 years-old.',

'Record date : 2079-11-09, Patient's VIN :1HGBH41JXMN109286.'],

'masked': ['Record date :<DATE, <DOCTOR, M.D.',


'The driver's license <DLN.',

'the <SSN and e-mail: <EMAIL.',


'PCP : <DOCTOR, <AGE years-old.',

'Record date : <DATE, Patient's VIN :<VIN.'],

'obfuscated': ['Record date :2093-01-18, Dr Alveria Eden, M.D.',


'The driver's license K783518004444.',

'the SSN-400-50-8849 and e-mail:',

'Name : Charls Danger MR. # J3366417 Date : 01-18-1974.',

'PCP : Dr Sina Sewer, 55 years-old.',

'Record date : 2079-11-23, Patient's VIN :6ffff55gggg666777.'],

'ner_chunk': ['2093-01-13',

'David Hale',



'Hendrickson, Ora',







Get Started

High Accuracy Resolution of Medical Entities to Standard Codes Using Novel Sentence Embeddings

The release of Spark NLP for Healthcare 3.1 brings significant speed and accuracy improvements for the task of entity resolution, also known...