In this post we show how to use Healthcare NLP to summarize clinical jargon in layman terms at Scale on Spark using the Summarizer annotator.
Reading clinical notes when you are not a healthcare expert can be challenging due to the amount of clinical jargon present in the documents written by clinicians. For example, given the excerpt below:
“The patient was brought to the OR, anesthesia was applied. The patient was placed in dorsal lithotomy position. The patient was prepped and draped in the usual sterile fashion. A 23-French scope was inserted inside the urethra into the bladder. The entire bladder was visualized, which appeared to have a large tumor, lateral to the right ureteral opening.”
it may be difficult to understand terms such as “dorsal lithotomy position”. Asking GPT-3.5 to translate to layman terms, we get “positioned on their back with their legs up in stirrups” instead, which most people can understand.
With the capabilities of large language models (LLM) to obtain knowledge of the real-world by being trained on huge amount of data, these models can be trained to summarize long documents, as well as explain concepts and jargons. They only need data and examples. To delve deeper into the topic, read the article about what LLM is and how it works.
The most successful architecture of the recent developments in machine learning models for natural language processing is the transformers, the base architecture most of the large language models such as GPT-4, BLOOM, Falcon, etc. Since version 4.4.0, John Snow Labs’s Healthcare NLP comes with support for Large Language Models specialized in the healthcare domain, and since version 4.4.3, the laymen summarizer can be used to translate complex clinical notes to plain English.
In the following sections, we are going to show how to use this annotator and give some examples.
Quick introduction to Spark NLP and Healthcare NLP
John Snow Labs is a leading provider of state-of-the-art natural language processing (NLP) solutions, specializing in NLP groundbreaking products: Spark NLP, Healthcare NLP, Clinical NLP, Finance NLP, and Legal NLP. These cutting-edge technologies have revolutionized the way organizations extract valuable insights from text data in various domains, including healthcare and beyond.
Spark NLP is an open-source library built on Apache Spark, designed to empower data scientists and developers with powerful NLP capabilities. It provides a scalable and efficient framework for processing and analyzing large volumes of unstructured text data, enabling advanced text mining, sentiment analysis, named entity recognition, part-of-speech tagging, and other essential NLP tasks. With its rich suite of pre-trained models and pipelines, Spark NLP facilitates quick development and deployment of NLP solutions across diverse industries.
Healthcare NLP, another flagship offering from John Snow Labs, focuses specifically on transforming healthcare-related text data into meaningful insights. Leveraging advanced machine learning techniques and deep medical domain expertise, Healthcare NLP enables tasks such as clinical entity recognition, medical code mapping, adverse drug event detection, clinical text de-identification, and more. By unlocking the vast potential of medical records, research papers, and healthcare literature, Healthcare NLP empowers healthcare professionals, researchers, and organizations to derive valuable insights, improve patient care, and drive innovation in the healthcare industry.
All these products are backed by John Snow Labs’ commitment to quality, accuracy, and performance. With a dedicated team of data scientists and researchers, they continuously update and expand their models and pipelines to stay at the forefront of NLP advancements. Whether it’s for general-purpose NLP applications or specialized healthcare/finance/legal use cases, John Snow Labs’ products provide robust and efficient solutions, fueling data-driven decision-making and unlocking the true potential of text data analysis.
To use both the open-source Spark NLP and the licensed Healthcare NLP, it is recommended to use the library
pip install johnsnowlabs
Then, after obtaining your license keys, it is easy to install al the libraries with:
from johnsnowlabs import nlp nlp.install(force_browser=True)
For alternative methods of installation in different environments (airgaped, different OS, Databricks, etc.), you can check the documentation page. And you can ask for a free trial license in this link.
Building the pipeline
To build the pipeline, you can use annotators from the open-source library using the
nlp module, and annotators from the healthcare library using the
medical module. Not covered in this article are the other available models:
So, let’s import the modules we use for the layman summarizer and start the spark session.
from johnsnowlabs import nlp, medical # Start the spark session spark = nlp.start()
To create the pipeline, we only need two stages. One to create the DOCUMENT annotation from raw texts and one for the summarizer.
doc_assembler = ( nlp.DocumentAssembler().setInputCol("text").setOutputCol("document") ) summarizer = ( medical.Summarizer.pretrained( "summarizer_clinical_laymen", "en", "clinical/models" ) .setInputCols(["document"]) .setOutputCol("summary") .setMaxNewTokens(66) # Change according to how long the input )
Then we create a pipeline and fit it in an empty data to obtain the PipelineModel. If you are not familiar with these concepts, you can review them here.
pipeline = nlp.Pipeline(stages=[doc_assembler, summarizer]) # Get the PipelineModel empty_df = spark.createDataFrame([[""]]).toDF("text") model = pipeline.fit(empty_df)
Now we can use the mode on Spark data frames with different example texts. Here we will use one example, and in the next section we will show the obtained results on different inputs.
example = """ The patient was brought to the OR, anesthesia was applied. The patient was placed in dorsal lithotomy position. The patient was prepped and draped in the usual sterile fashion. A 23-French scope was inserted inside the urethra into the bladder. The entire bladder was visualized, which appeared to have a large tumor, \ lateral to the right ureteral opening. """ spark_df = spark.createDataFrame([[example]]).toDF("text") prediction = model.transform(spark_df)
To retrieve the prediction, we can query the dataframe:
This is a clinical note about a patient who had surgery to remove a large tumor from their bladder. The patient was given anesthesia and a scope was inserted into their bladder to look inside. The surgeon found a tumor in the bladder, which was lateral to the right ureteral opening.
With the same approach, we can run predictions on different texts. Here we show a few examples. Note that the model was trained on full clinical notes (up to 1024 tokens) and to output a summary of at most 512 tokens, but the parameter
maxNewTokens can control how long the output can be (which could depend on how long the input text is).
The examples below are short texts to show the model capabilities.
Chief Complaint: The patient presents with acute dyspnea and tachypnea upon exertion, accompanied by bilateral crackles on pulmonary auscultation. Differential diagnosis includes congestive heart failure, pulmonary edema, and exacerbation of chronic obstructive pulmonary disease (COPD). Further investigation through chest X-ray and arterial blood gas analysis is warranted.
This is a clinical note about a patient who is experiencing acute dyspnea and tachypnea, which means they have difficulty breathing while exerting. They also have crackles on their lungs when they are breathing. The diagnosis is congestive heart failure, pulmonary edema, and exacerbation of chronic obstructive pulmonary disease (COPD). The patient is being evaluated for further testing through chest X-ray and arterial blood gas analysis. COPD is a lung disease that can cause lung problems.
Assessment: The patient exhibits signs and symptoms consistent with an acute myocardial infarction. ECG findings reveal ST-segment elevation in leads II, III, and aVF, suggestive of inferior wall involvement. Urgent cardiac catheterization is indicated to assess coronary artery patency and identify potential revascularization strategies.
This is a clinical note about a patient who has a heart attack. The ECG shows that the heart is working properly, but the ST-segment in the leads II, III, and aVF are elevated, which suggests that the heart is causing the attack. The patient is also being monitored for any other issues. The ECG is a test that measures the heart’s electrical activity. The patient will need to have a cardiac catheterization to check for any problems with the heart’s electrical activity.
Plan: The patient, diagnosed with type 2 diabetes mellitus, presents with hyperglycemia, polyuria, and unexplained weight loss. Given the clinical picture and elevated HbA1c levels, initiation of insulin therapy is recommended in conjunction with dietary modifications and regular exercise. Blood glucose monitoring and patient education on self-administration techniques will be provided.
This is a clinical note about a patient with type 2 diabetes mellitus who has hyperglycemia, polyuria, and weight loss. The patient is recommended to take insulin therapy to manage their high blood sugar and high blood pressure. The doctor will also provide blood glucose monitoring and patient education on self-administration techniques. The plan includes introducing insulin therapy with dietary changes and regular exercise.
Progress Note: The patient, a known hypertensive individual, presented with a hypertensive crisis characterized by severely elevated blood pressure readings of 190/110 mmHg. Urgent intervention involved administration of intravenous antihypertensive agents, including nitroprusside, and continuous blood pressure monitoring. The patient’s blood pressure gradually normalized to within target range, warranting transition to oral antihypertensive medications.
This is a progress note about a patient who has a high blood pressure and is experiencing a hypertension crisis. The patient was given medication to help with the pressure and was monitored closely. The patient’s blood pressure was high at 190/110 mmHg, which is a dangerous level. The medication was given through an IV and monitored closely. The patient’s blood pressure gradually normalized and they were prescribed oral antihypertensive medications.
Consultation Note: The patient, referred for evaluation of chronic abdominal pain and weight loss, underwent a comprehensive gastroenterological examination. Upper endoscopy revealed erosive gastritis and duodenal ulcers, consistent with Helicobacter pylori infection. Triple therapy comprising proton pump inhibitors, clarithromycin, and amoxicillin has been initiated to eradicate the pathogen and alleviate symptoms.
This is a clinical note about a patient who was referred for evaluation of chronic abdominal pain and weight loss. The patient underwent a comprehensive gastroenterological examination and found erosive gastritis and duodenal ulcers, consistent with Helicobacter pylori infection. The patient has been given triple therapy to eradicate the pathogen and alleviate symptoms. The doctor has also started antibiotics to treat the infection.
Healthcare text data can contain complex jargon used by professional clinicians, and understanding these documents can be a challenge for people outside the domain. Using the layman summarizer from Healthcare NLP can help to perform two tasks at once:
- Summarize longer documents
- Translate clinical jargon in plain English (layman terms)
We showed how to use the library to create pipelines that can be easily scaled on the Spark ecosystem to easily perform those tasks.