Watch Healthcare NLP Summit 2024. Watch now.
was successfully added to your cart.

Format Consistency for Entity Obfuscation in De-Identification with Healthcare NLP

De-Identification is a process that needs to be applied to de-identify (anonymize) or obfuscate (replace with fake entities) PHI (protected health information) data from clinical notes.

Obfuscation of PHI entities

Obfuscation of PHI entities

Sharing the clinical data of the patients with third parties is so crucial according to the HIPAA de-identification standards (Privacy Rules) and as a solution, you can de-identify PHI (Protected Health Information) entities. Spark NLP for Healthcare comes with 50+ different NER models to support de-identification tasks from 6 spoken languages (English, German, Spanish, Italian, French, and Portuguese). It also comes with several pretrained pipelines that the deidentification NER models are supported with several rule-based contextual parsers to fill the gaps (i.e. when NER fails to detect a date, regex-based modules can catch that anyway). We have a Comparison of Key Medical NLP Benchmarks — Spark NLP vs AWS, Google Cloud and Azure medium article that you can find de-identification benchmarks we got against cloud providers.

The Need for Format-Preserving Data De-Identification

Preserving the original format of sensitive information, such as date, ID, and SSN, while obfuscating the actual content with fake data is crucial for maintaining the security and privacy of the information.

  • One significant reason is that any alteration of the original format may alert intruders that the information has been obfuscated and prompt them to try to decode the data to re-identify the original information.
  • Moreover, the loss of the original format of sensitive information may result in unrealistic data, which could compromise the integrity of the data, making it less trustworthy for data analysis purposes.
  • Additionally, if the obfuscation process uses different formats, it could result in NLP models learning from the wrong formats, leading to inaccurate results when applied to real-world settings.

Therefore, preserving the original formats of sensitive information is essential to ensuring the confidentiality, integrity, and reliability of the data.

Healthcare NLP v.4.3.1 is released now and we are able to keep the formats of the entities while obfuscating them if they have a format.

Implementing De-Identification with Format Consistency

Deidentification annotator in Healthcare NLP is used for de-identification tasks. It was replacing the entities with fake ones without considering their formats before, even though they have a format. And now in v4.3.1, a new setSameLengthFormattedEntities parameter was added in this annotator that obfuscates the formatted entities like PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, PLATE and LICENSE with the fake ones in the same format. The default is an empty list ([]) so you need to set the entities in a list that you want to get in the same format.

As you can see in the example below, there are MEDICALRECORD, PHONE and IDNUM entities have a format and they were obfuscated without considering their formats in default_obfuscation column that was created without setting setSameLengthFormattedEntities. But when we set it with the list of formatted entities we want to de-identify, it will obfuscate these entities keeping their formats in consistent_format_obfuscation column.

Getting Started

You need to feed sentence, token and deid_ner_chunk (extracted entities for deidentification NLP task) columns to the Deidentification annotator and it will return obfuscated version of the detected NER chunks when you setMode("obfuscate"). In this example, we just wanted to obfuscate PHONE, MEDICALRECORD and IDNUM entities in the same format as we have in our text. You can also add another Deidentification annotator in your pipeline by setting setMode("mask") parameter for getting the detected entities masked with their labels.


obfuscated = DeIdentification()\
    .setInputCols(["sentence", "token", "deid_ner_chunk"]) \
    .setOutputCol("obfuscated") \
    .setSameLengthFormattedEntities(["PHONE","MEDICALRECORD", "IDNUM"])

sample_text = """Record date: 2003-01-13
Name : Hendrickson, Ora, Age: 25
MR: #7194334
ID: 1231511863
Phone: (302) 786-5227"""


|                        sentence|                     masking|                   obfuscation|
|         Record date: 2003-01-13|         Record date: |       Record date: 2003-03-07|
|Name : Hendrickson, Ora, Age: 25|Name : , Age: |Name : Manya Horsfall, Age: 20|
|                   MR:  #7194334|         MR: |                  MR: #4868080|
|                  ID: 1231511863|                 ID: |                ID: 2174658035|
|           Phone: (302) 786-5227|               Phone:|         Phone: (467) 302-9509|


De-identification is a crucial task in the NLP world and Healthcare Natural language processing is one of the most popular libraries for this. John Snow Labs is keeping up-to-date this library with new releases every two weeks. There will be new features in the upcoming releases, keep following us!

Spark NLP for Healthcare models are licensed, so if you want to use these models, you can watch “Get a Free License For John Snow Labs NLP Libraries” video and request one from

You can follow us on medium and Linkedin to get further updates or join slack support channel to get instant technical support from the developers of Spark NLP. If you want to learn more about the library and start coding right away, please check our certification training notebooks.

Mapping Rxnorm and NDC Codes to the NIH Drug Brand Names with Healthcare NLP

RxNorm provides normalized names for clinical drugs and links its names to many of the drug vocabularies commonly used in pharmacy management...