Skip to main content
was successfully added to your cart.

Using Spark NLP to De-Identify Doctor Notes in the German Language


Clinical documents and doctor’s notes are significant resources for clinical and pharma research. There are many publications and examples about clinical notes de-identification using rule-based and machine learning / natural language processing (NLP) methods in the English language. Resources about de-identification in non-English languages are much more sparse.

IQVIA and John Snow Labs recently completed a project to de-identify doctor notes written in German. This blog post elaborates on IQVIA’s challenge of removing protected health information (PHI) from the notes, discuss the challenges, and how Spark NLP addressed the issues.


Learn more


Who is IQVIA?

IQVIA, formerly known as Quintiles and IMS Health, is an American multinational company serving the combined industries of health information technology and clinical research. IQVIA provides biopharmaceutical development and commercial outsourcing services, focused primarily on Phase I-IV clinical trials and associated laboratory and analytical services, including consulting services. It has a network of more than 88,000 employees in more than 100 countries and a market capitalization $40 billion. IQVIA is the world’s largest contract research organization.


The Challenge

IQVIA is operating a large data platform in major European markets. It contains 30+ years of data about 138 M patients. In order to unlock the potential of the data and perform analysis, the data needs to be de-identified first. As the first step, the case of de-identification of German clinical notes was selected.

There are significant challenges specific to the German language & the use case:

  • Diseases and medical procedures are often named after individuals – the names are the same as personal names.
  • The notes are short strings – there are no proper sentences.



The Solution

Spark NLP for Healthcare is a commercial extension of Spark NLP, the industry’s most widely used Natural Language Processing library. It includes complex algorithms and models allowing building solutions for processing language specific to the healthcare sector.


We have proceeded in the following steps:

  • Annotation guidelines: The first step of successful annotation project is building annotation guidelines. Is it a document agreed upon by the customer SMEs and prepared by the John Snow Labs annotation team, covering in detail information to be identified, including the corner cases.



  • Annotating data in Annotation Lab: The next step is the annotation of data in John Snow Labs’ Annotation Lab. First, the annotation needs to be tested considering the quality of the Annotation Guidelines by calculating inter-annotation agreement. Multiple annotators annotate the same sample, and the agremeement needs to be an order of magnitude more accurate than the expected accuracy of NLP models. Then, we annotated four sets of 200 notes, totaling 800 notes.
  • De-identification pipeline in Spark NLP for Healthcare: The elements of the production pipeline are noted in the image below.



We have trained specific NER models for the project. The training is relatively simple, but the details are beyond the scope of this blog post.


Achieving High Accuracy

Before evaluating the results, we have to consider suitable metrics, aligned with business needs. We have used “Partial Chunk Per Token.”


One model was trained for all four datasets. Recall figures range from 75% to 100%, with an average of 95%.

Industry benchmark suggests manually performed de-identification with a skilled annotation team, and the 4-eyes principle achieves a recall of around 94% [link to the blog post “Large-scale data de-identification enables healthcare data monetization”]. This figure is probably lower than most people expect. Therefore, we can conclude that the preliminary results are on par with the manual de-identification process. (In the past, IQVIA used a legacy approach with higher recall scores using a very aggressive tagging policy, i.e., compromising precision.)



When manually reviewing errors, entities not recognized by the model correspond to the challenges of the German text discussed above. The further considered steps are:

  • Treat abbreviated first names, e.g., J. Smith, E. Miller, etc.
  • Treat phone numbers with Contextual Parser.
  • Optimize the threshold to increase recall at the expense of precision.
  • Train the models on more data



IQVIA and John Snow Labs developed a system for de-identifying clinical notes in German. The data set has several challenges – in particular, the document contains short strings, does not have the usual sentence structure, and the names of diseases and procedures are German personal names.

Despite those challenges, preliminary results show recall figures on par with the well-run manual de-identification procedure.

As the next step, the system will be developed further and deployed to the IQVIA production environment.



Book a demo

Large-scale data de-identification enables healthcare data monetization

Healthcare providers and well-established players in the healthcare space possess vast amounts of unstructured patient-level data. This data has tremendous value, yet...