Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Automation of Data De-identification


With evermore personal data being produced and stored by organizations, data privacy is becoming an increasing priority. Businesses have access to a lot of sensitive information about their customers, service providers, and employees and are required to protect that data in order to minimize the risks of scams or fraud.

De-identification is used to overcome data privacy challenges and keep information safe from unauthorized parties. This post explains what de-identification is, how it works and how natural language processing (NLP) is used to automate the process of removing sensitive data from datasets.

What is de-identification?

De-identification is a technique used to remove any data that could identify a person from a dataset. It’s a way of protecting the personal information that identifies an individual or company by doing away with all PII (personal identifiable information).

Personal identifiers include names, date of birth (DOB), email & IP addresses, location, social security numbers (SSN), bank account numbers (BANs) e.t.c.

De-identification is sometimes used interchangeably with Anonymization though there’s a little difference. While the first involves the explicit removal of personal identifiers, the latter focuses on the fact that the data cannot be linked to identify the individual.

The importance of de-identification in healthcare

In today’s world where there are a lot of privacy scandals, it is necessary to know and understand the importance of healthcare data de-identification.

De-identification ensures that individuals’ data are not leaked to third parties or exposed inappropriately thereby limiting potential harm.

De-identification has become increasingly popular after GDPR (General Data Protection Regulation) came into effect. But there are many cases where people’s personal health information (PHI) was compromised without their knowledge or consent because of a lack of security measures put in place by businesses. This prompted the birth of HIPAA which was set up by the US government to protect patient’s medical information.

The Health Insurance Portability and Accountability Act (HIPAA) is a United States federal statute that was enacted on August 21, 1996, to protect consumers’ health information.

This law, which sets the standards for protecting the privacy of an individual’s medical records and other health information, covers how this information may be used and disclosed, transferred, or otherwise handled. The Act mandates that protected health information (PHI) be removed from medical records before they can be transmitted to safeguard patient confidentiality.

De-identification procedures

The de-identification procedure can be conducted in two methods, according to the HIPAA de-identification standard. We have the Safe Harbor method and the Expert Determination method.

The Safe Harbor technique involves the removal of 18 personal identifiers, including the name, zip code, dates, telephone, email, URL, IP address, health plan id, bank account, and license id.

The latter method combines the preservation of specific personal identifiers (typically dates and demographics) with assurance from an expert that these identifiers could not be used to re-identify the patient.

Automating de-identification

The use of automated de-identification is on the rise due to its efficiency in comparison to manual processes. With the emergence of AI, this process can now be done with an accuracy that is equivalent to a manual process. This is important because manual de-identification of huge record databases can be time-consuming, expensive, and error-prone.

Automatic de-identification systems can de-identify text with a high degree of accuracy. For example, when it comes to datasets, de-identifying narrative texts tends to pose more challenges compared to that of tabular data. Why? The ambiguous nature of natural languages makes the de-identification of narrative texts particularly problematic.

This is because narrative texts do not have a standard layout or enumerated sets of fields and that the position and labeling of narrative report structures vary amongst providers. Now, with the help of machine learning (ML) and natural language processing (NLP), automated de-identification can be carried out in these narrative texts. We’ll have a closer look in the next paragraph.

De-identification automation approaches

Much of the data that is collected in various contexts are being stored on databases. When it comes to databases, it is worth mentioning that there are different types of data, such as structured and unstructured data. Unstructured data is data typically stored in its native format, while structured data is clearly defined and searchable. Here the de-identification process has to be applied differently for each type of data.

The process can vary from simple obfuscation or encryption to more complex processes like hashing or masking. De-identification bears the form of natural entity recognition (NER) in NLP and can be broken down into the following three categories:

  • Rule-based approach – This applies to using rules, pattern matching, dictionaries to de-identify text documents. Although this approach requires a whole lot of domain expertise and can be hard to cope with data drift, it’s quite explainable.
  • Model-based approach – Researchers have been using machine learning algorithms for addressing the lack of resilience in rules-based systems. This applies to using ML models to de-identify text. This approach which generalizes better has higher precision, and a better contextual capture.
  • Hybrid approach – This is a pragmatic balance of both approaches and is recommended. Recent developments in the field of deep learning and NLP have enabled systems to achieve better results, in particular in the field of named entities.

Combining human services and software, John Snow Labs is able to deliver de-identified data, by either replacing, masking, generalizing or obfuscating Protected Health Information (PHI). Out-of-the-box, high-accuracy Spark NLP for Healthcare models facilitate the de-identification of structured data, free text documents, DICOM documents, and PDFs, while enforcing GDPR and HIPAA compliance and maintaining linkage of clinical data across files.

Further Resources

For a better understanding of the subject, visit the John Snow Labs page on Full-service Data De-identification Tools, which also includes two great webinars for a deeper dive:

In this webinar, you’ll get to know more about de-identification, its use cases for structured and unstructured data, and on how to implement de-identification of these use cases using Spark NLP.

This webinar explains what’s required to de-identify medical records under the US HIPAA privacy rule, typical de-identification use cases for structured and unstructured data, as well as how to implement de-identification of these NLP use cases for Healthcare using Spark NLP.

Try Medical Data De-Indetification

See in action

Simpler & More Accurate Deidentification in Spark NLP for Healthcare

Spark NLP for Healthcare 3.1 improves the accuracy, functionality, and ease of use of the library’s data de-identification capabilities, whose are crutial...