Don't miss the NLP Summit 2022, free and online event in October 4-6. Register for freehere.
was successfully added to your cart.

Understanding the context of clinical terms in Spark NLP

The use of NLP techniques to analyze medical reports is one of the most exciting ones as it allows the understanding of unstructured text to gain valuable insight. Instead of reading through hundreds of reports manually, NLP algorithms can analyze and extract meaning from medical notes, enabling doctors and medical workers to make data-based decisions.

Thanks to recent improvements of the Spark NLP text processing library by John Snow Labs, information from medical reports can now be extracted more accurately, specifically and at a higher speed. Supported terminologies have significantly increased, with higher accuracy and broader support of different medical terminologies than what is offered by the top-3 cloud providers.

Here’s an overview of what you can expect from the latest Spark NLP advancements and how it facilitates the analysis of medical reports at scale.

Leverage the context around a medical term

Spark NLP can be used to understand the meaning of medical reports to gain insights, such as how many patients deal with a certain health issue, which body parts are most often injured, or how severe an injury is. In a process called Entity Resolution, the system analyzes the clinical terms used in medical reports to predict UMLS codes, which are unique identifiers of specific health conditions.

The illustration below shows a sentence extracted from a medical text. The old version of the Spark NLP library was capable of extracting single words from such a sentence, but not their meaning concerning the words surrounding them. So, medical workers could learn about the number of fractions for example but they had no insights about the body parts most affected by them.

The new feature added to Spart NLP, called “SentenceChunkEmbeddings”, enables the understanding of entire sentences by analyzing how the words next to each other are related. In the example discussed above, we now can understand where the fracture happened and could get insights about how many patients had a fracture in the lower left leg in particular. As a result, the UMLS codes mentioned above are now more accurate and specific.

Fine-tune pre-trained models to understand different words with the same meaning

Another advancement of the updated Spark NLP library is the possibility to adapt the pre-trained model to the particular vocabulary a medical institution or doctor is using. This comes in handy as terms used in medical notes might not always correspond exactly with the words used in official sources. For example, if a doctor describes a patient’s eye color with the word “sapphire”, the model would understand that the color is blue.

The model can be re-trained, allowing it to adjust existing terms to words with the same meaning. The system is then able to analyze these modified terms with the same accuracy as the original terms from the main dataset.

Add proprietary, doctor-curated training dataset, based on real-world documents

The third improvement of Spark NLP deals with the terminologies of specific areas of expertise within the healthcare industry, which can now be taken into consideration as well. 

The jargon of medical institutions varies, depending on their specialty. A medical report from a cancer treatment center uses different terms than the one from a Children’s hospital for example. Spark NLP is pre-trained with datasets from official sources, including the regular clinical terms, and can reach its limit when analyzing a report with specific jargon.

Medical documentation provided by the doctors can be used to re-train the model and adjust it to very particular jargons. The tool is able to learn from these new (and large) datasets and analyze text at scale, without compromising on accuracy.


The enhanced version of the Spark NLP library processes a more accurate and specific analysis of free-text clinical notes, revealing even deeper insights for medical institutions and doctors.

New features include the ability to put clinical terms into context, understand similar terms as well as to re-train the model with new terms. As a result, we can see significant improvements in entity resolution, achieving better outputs in terms of accuracy and inference.

Moving forward, we’ll keep fine-tuning and improving the library, with the ultimate goal to leverage NLP’s full potential for the healthcare industry.

Get started with Spark NLP and more open-source state-of-the-art natural language processing libraries here.

Spark NLP in action: improving patient flow forecasting

Kaiser Permanente is one of the USA’s largest health plans, serving 12.3 million members via 39 hospitals and over 217,000 employees. This case...