Summarizing Clinical Jargon to Layman Terms at Scale

12.08.2023

David Cecchini

Data Scientist at John Snow Labs

In this post we show how to use Healthcare NLP to summarize clinical jargon in layman terms at Scale on Spark using the Summarizer annotator.

Introduction

Reading clinical notes when you are not a healthcare expert can be challenging due to the amount of clinical jargon present in the documents written by clinicians. For example, given the excerpt below:

“The patient was brought to the OR, anesthesia was applied. The patient was placed in dorsal lithotomy position. The patient was prepped and draped in the usual sterile fashion. A 23-French scope was inserted inside the urethra into the bladder. The entire bladder was visualized, which appeared to have a large tumor, lateral to the right ureteral opening.”

it may be difficult to understand terms such as “dorsal lithotomy position”. Asking GPT-3.5 to translate to layman terms, we get “positioned on their back with their legs up in stirrups” instead, which most people can understand.

With the capabilities of large language models (LLM) to obtain knowledge of the real-world by being trained on huge amount of data, these models can be trained to summarize long documents, as well as explain concepts and jargons. They only need data and examples. To delve deeper into the topic, read the article about what LLM is and how it works.

The most successful architecture of the recent developments in machine learning models for natural language processing is the transformers, the base architecture most of the large language models such as GPT-4, BLOOM, Falcon, etc. Since version 4.4.0, John Snow Labs’s Healthcare NLP comes with support for Large Language Models specialized in the healthcare domain, and since version 4.4.3, the laymen summarizer can be used to translate complex clinical notes to plain English.

In the following sections, we are going to show how to use this annotator and give some examples.

Quick introduction to Spark NLP and Healthcare NLP

John Snow Labs is a leading provider of state-of-the-art natural language processing (NLP) solutions, specializing in NLP groundbreaking products: Spark NLP, Healthcare NLP, Clinical NLP, Finance NLP, and Legal NLP. These cutting-edge technologies have revolutionized the way organizations extract valuable insights from text data in various domains, including healthcare and beyond.

Spark NLP is an open-source library built on Apache Spark, designed to empower data scientists and developers with powerful NLP capabilities. It provides a scalable and efficient framework for processing and analyzing large volumes of unstructured text data, enabling advanced text mining, sentiment analysis, named entity recognition, part-of-speech tagging, and other essential NLP tasks. With its rich suite of pre-trained models and pipelines, Spark NLP facilitates quick development and deployment of NLP solutions across diverse industries.

Healthcare NLP, another flagship offering from John Snow Labs, focuses specifically on transforming healthcare-related text data into meaningful insights. Leveraging advanced machine learning techniques and deep medical domain expertise, Healthcare NLP enables tasks such as clinical entity recognition, medical code mapping, adverse drug event detection, clinical text de-identification, and more. By unlocking the vast potential of medical records, research papers, and healthcare literature, Healthcare NLP empowers healthcare professionals, researchers, and organizations to derive valuable insights, improve patient care, and drive innovation in the healthcare industry.

All these products are backed by John Snow Labs’ commitment to quality, accuracy, and performance. With a dedicated team of data scientists and researchers, they continuously update and expand their models and pipelines to stay at the forefront of NLP advancements. Whether it’s for general-purpose NLP applications or specialized healthcare/finance/legal use cases, John Snow Labs’ products provide robust and efficient solutions, fueling data-driven decision-making and unlocking the true potential of text data analysis.

To use both the open-source Spark NLP and the licensed Healthcare NLP, it is recommended to use the library johnsnowlabs:

pip install johnsnowlabs

Then, after obtaining your license keys, it is easy to install al the libraries with:

from johnsnowlabs import nlp

nlp.install(force_browser=True)

For alternative methods of installation in different environments (airgaped, different OS, Databricks, etc.), you can check the documentation page. And you can ask for a free trial license in this link.

Building the pipeline

To build the pipeline, you can use annotators from the open-source library using the nlp module, and annotators from the healthcare library using the medical module. Not covered in this article are the other available models: finance, legal, visual, and viz.

So, let’s import the modules we use for the layman summarizer and start the spark session.

from johnsnowlabs import nlp, medical

# Start the spark session
spark = nlp.start()

To create the pipeline, we only need two stages. One to create the DOCUMENT annotation from raw texts and one for the summarizer.

doc_assembler = (
    nlp.DocumentAssembler().setInputCol("text").setOutputCol("document")
)

summarizer = (
    medical.Summarizer.pretrained(
        "summarizer_clinical_laymen", "en", "clinical/models"
    )
    .setInputCols(["document"])
    .setOutputCol("summary")
    .setMaxNewTokens(66) # Change according to how long the input 
)

Then we create a pipeline and fit it in an empty data to obtain the PipelineModel. If you are not familiar with these concepts, you can review them here.

pipeline = nlp.Pipeline(stages=[doc_assembler, summarizer])

# Get the PipelineModel
empty_df = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_df)

Now we can use the mode on Spark data frames with different example texts. Here we will use one example, and in the next section we will show the obtained results on different inputs.

example = """
The patient was brought to the OR, anesthesia was applied.
The patient was placed in dorsal lithotomy position.
The patient was prepped and draped in the usual sterile fashion.
A 23-French scope was inserted inside the urethra into the bladder.
The entire bladder was visualized, which appeared to have a large tumor, \
lateral to the right ureteral opening.
"""

spark_df = spark.createDataFrame([[example]]).toDF("text")
prediction = model.transform(spark_df)

To retrieve the prediction, we can query the dataframe:

prediction.select("summary.result").show(truncate=False)

Obtaining:

Healthcare text data can contain complex jargon used by professional clinicians, and understanding these documents can be a challenge for people outside the domain. Using the layman summarizer from Healthcare NLP can help to perform two tasks at once:

Summarize longer documents
Translate clinical jargon in plain English (layman terms)

We showed how to use the library to create pipelines that can be easily scaled on the Spark ecosystem to easily perform those tasks.

References

How useful was this post?

Try NLP in Healthcare

See in action

David Cecchini

Data Scientist at John Snow Labs

Our additional expert:

Ph.D. at Tsinghua-Berkeley Shenzhen Institute | Data Scientist

Detecting Opioid Abuse in Clinical Notes Using Healthcare NLP

Muhammet Santas

Understanding and evaluating patients’ clinical notes is crucial in providing accurate diagnosis and treatment in the field of medicine. Today, issues like...

Summarizing Clinical Jargon to Layman Terms at Scale

Introduction

Quick introduction to Spark NLP and Healthcare NLP

Building the pipeline

Other examples

Example 1

Example 2

Example 3

Example 4

Example 5

Conclusion

References

Detecting Opioid Abuse in Clinical Notes Using Healthcare NLP

Recommended For You