Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Effortless PHI De-Identification: Running De-identification and Obfuscation in Healthcare NLP

PHI De-Identification with State-of-the-Art NLP

De-identification for natural language processing in healthcare is a critical procedure for safeguarding Protected Health Information (PHI) within clinical notes, wherein the data is anonymized or obfuscated through the replacement of real entities with false ones.

Multi-Mode Deidentification With Spark NLP for Healthcare

Multi-Mode De-identification With Healthcare NLP

We highlighted the significance of sharing clinical data in compliance with HIPAA Privacy Rules and explored the functionalities available within the Healthcare NLP library for accomplishing this objective in the blog post Format Consistency for Entity Obfuscation in De-Identification with Spark NLP.

Now we will talk about a new feature released in the latest version, v4.3.2, of Healthcare NLP (medical natural language processing) that provides the ability to obfuscate or anonymize the PHI entities in one pass.

Why Do We Need The Application of Multiple De-identification Policies At Once

There are several deidentification policies for replacing the PHI data in Healthcare NLP. Here is a list of these options;

  • obfuscate: Replace the values with randomly generated fake ones, eg. John Snow -> Michael Willian.
  • mask_same_length_chars: Replace the value with the minus two same lengths asterisk and plus one bracket on both ends, eg. John Snow -> [*******].
  • mask_entity_labels: Replace the values with the entity labels, eg. John Snow -> <NAME>.
  • mask_fixed_length_chars: Replace the value with a fixed-length asterisk. You can also invoke setFixedMaskLength(), eg. John Snow -> ****.

The Deidentification annotator is a crucial tool within Healthcare NLP, specifically for carrying out de-identification tasks. By providing the necessary entities and specifying the desired mode of either obfuscation or mask through the setmode() parameter of this annotator, the resulting output is effectively de-identified.

In instances where a de-identified output was required using a combination of various policies, such as obfuscating both NAME and LOCATION entities while masking DATE entities with same-length characters, it became necessary to define multiple Deidentification annotators within the same pipeline for each entity label-policy pairing. Subsequently, some post-processing of the results was also required. While this requirement was very simple, the application was making the pipeline more complex and adding post-process steps was breaking the functionality of the process flow.

With the release of Spark NLP for Healthcare version 4.3.2, it is now possible to simultaneously apply multiple healthcare data de-identification policies to varying PHI entities, thanks to the newly introduced feature.

Implementing Multi-Mode Functionality in De-identification

We enhanced the Deidentification annotator by adding a new setSelectiveObfuscationModes() parameter which requires a JSON file that contains a user-defined dictionary with the policies which will be applied to the labels. If the entities are not provided in the JSON file, they will be deidentified according to the setMode() as default. It also provides the ability to skip entities that we don’t want to de-identify.

Let’s assume that we are working on a de-identification task and want to apply this policy combination to the document;

  • obfuscate PHONE entities
  • mask ID entities with entity labels
  • mask NAME entities with same length chars
  • mask ZIP and LOCATION entities with fixed-length chars
  • DO NOT de-identify (skip) DATE entities

After creating a NER pipeline that can detect all these entities, we will define a dictionary with these policy-label pairs and save it as a JSON file. We do not need to consider the label casing while creating the dictionary, this feature is not case-sensitive for the labels. This means Zip and ZIP will return the same results.

import json

sample_json= {
 "obfuscate": ["PHONE"] ,
 "mask_entity_labels": ["ID"],
 "skip": ["DATE"],
 "mask_fixed_length_chars":["zip", "location"]

with open('multi_mode.json', 'w', encoding='utf-8') as f:
    json.dump(sample_json, f, ensure_ascii=False, indent=4)

Then we will define Deidentification annotator by setting setMode('obfuscate') , and providing the path of the JSON file to setSelectiveObfuscationModes('multi_mode.json') in addition to input and output column settings. This means all the entities detected by NER models that we didn’t set specific policies will be obfuscated in the results. Also, you can invoke setFixedMaskLength() for setting the counts of fixed-length chars.

Thats all! We don’t need to define any more Deidentification annotators for the application of multi-mode de-identification policies. Let’s check how this works on a sample text:

deid = DeIdentification() \
      .setInputCols(["sentence", "token", "ner_chunk"]) \
      .setOutputCol("deidentified") \
text = '''
Record date : 2093-01-13 , David Hale , M.D . 
Name : Hendrickson Ora MR # 7194334 
PCP : Oliveira , 25 years-old Record date : 2079-11-09 
Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555 

Let’s check the results:

|sentence                                                             |deidentified                                          |
|Record date : 2093-01-13 , David Hale , M.D .                        |Record date : 2093-01-13 , [********] , M.D .         |
|Name : Hendrickson Ora MR # 7194334                                  |Name : [*************] MR #                       |
|PCP : Oliveira , 25 years-old Record date : 2079-11-09               |PCP : [******] , 22 years-old Record date : 2079-11-09|
|Cocke County Baptist Hospital , 0295 Keats Street , Phone 55-555-5555|**** , **** , Phone 97-182-9152                       |

As you can see above

  • DATE entities were skipped: 2093-01-13 => 2093-01-13, 2079–11–09=> 2079–11–09
  • NAME entities were masked with same-length chars: David Hale = > [********], Hendrickson Ora => [*************] , Oliviera : [******],
  • ID entity was masked with ID tag: 7194334 => <ID>
  • AGE entity was obfuscated since we didn’t set any policy for them and set setMode() as obfuscate25 years-old => 22 years-old
  • LOCATION entities were masked with fixed-length chars: Cocke County Baptist Hospital => **** , 0295 Keats Street => ****
  • PHONE entity was obfuscated with a fake one: 55-555-5555 => 97-182-9152


Multi-mode de-identification represents an effective solution for enhancing the process flow of de-identification tasks, thanks to its functionality, integrability, and speed advantages. By leveraging this approach, it becomes possible to implement a range of obfuscation and masking policies for different entities in a streamlined manner, while also achieving faster results due to the elimination of extra pipeline stages and post-processing requirements.

De-identification is a crucial task in the NLP world and Healthcare NLP is one of the most popular libraries for this. John Snow Labs is keeping up-to-date this library with new releases every two weeks. There will be new features in the upcoming releases, keep following us!

Healthcare NLP models are licensed, so if you want to use these models, you can watch “Get a Free License For John Snow Labs NLP Libraries” video and request one from

You can follow us on medium and Linkedin to get further updates or join slack support channel to get instant technical support from the developers of Spark NLP. If you want to learn more about the library and start coding right away, please check our certification training notebooks.

Try Healthcare NLP

See in action

Extract Social Determinants of Health Entities from Clinical Text with Healthcare NLP

The social determinants of health (SDoH) are the non-medical factors that influence health outcomes and usually one of the hardest type of...