Meet us at Intelligent Health UK on the 24th & 25th of May. Schedule a meeting here.
was successfully added to your cart.

Detect Radiology Related Entities with Spark NLP

In this post, we will explain how to use the new Spark NLP‘s pre-trained NER NLP model ner_radiology that can identify entities related to radiology.

Introduction

Named Entity Recognition (NER) is a dynamic field of Natural Language Processing (NLP) research and is defined as an automatic system to detect “entities” in a text. These entities depend on the application scenario and can be People, Location, and Organization for a generic news article text, or, as in the case here, entities related to radiology.

The capability to automatically identify radiology entities can help health practitioners to structure and organize big data sets at scale. Usually, without an automatic model, practitioners need to manually label the entities in the text if they want to analyze the data they have or create a model that uses this information as input.

John Snow Labs’ Spark NLP

Spark NLP is an open-source Java and Python NLP library built on top of Apache Spark from John Snow Labs. It is one of the top growing NLP production-ready solutions with support for all main tasks related to NLP in more than 46 languages.

Figure 1: Open-Source Spark NLP

Apart from the open-source library, the company maintains the licensed library specialized for healthcare solutions, SPARK NLP for Healthcare.

Figure 2: Spark NLP for Healthcare

Spark NLP for Healthcare contains a pre-trained model for Named Entity Recognition in the scope of radiology.

Spark NLP pipeline

Spark NLP is built on top of Apache Spark and is based on the concept of pipelines, where we can perform several NLP tasks in our data with a unified pipeline. For an introduction of Spark NLP functionalities, refer to [1].

Here, we will explore the Spark NLP for Healthcare capabilities for Radiology NER using the new model.

The new ner_radiology model was trained on a custom dataset comprising of MIMIC-CXR [5] and MT Radiology texts, and can identify the following entities:

  • ImagingTest: Name of an imaging test (Ultrasound)
  • Imaging_Technique: Name of the technique used
  • ImagingFindings: Diagnostic or identifications found on tests (Ovoid mass)
  • OtherFindings: Other findings
  • BodyPart: Name of anatomic part of the body (Shoulder)
  • Direction: Detail direction of the image test (Bilateral
  • Test: Name of the test (not imaging)
  • Symptom: Symptom identified
  • Disease_Syndrome_Disorder: Name of health condition (Lipoma)
  • Medical_Device: Name of medical devices
  • Procedure: Name of procedure
  • Measurements: Measurements such as 0.5 x 0.4
  • Units: Units such as cm, ml

In Spark NLP, we build a pipeline containing all the steps (stage) of the model, and in the case of NER models, we need to transform the text into a document annotator (DocumentAssembler), identify sentences (SentenceDetector), tokenize (Tokenizer), transform to embeddings vectors (WordEmbeddingsModel), identify entities (NerModel) and finally convert the entities to standard formats (NerConverter). The full pipeline in Python can be seen below.

# Import modules from pyspark.ml.pipeline import PipelineModel from sparknlp.base import DocumentAssembler from sparknlp.annotator import SentenceDetector, Tokenizer, WordEmbeddingsModel from sparknlp.annotator import NerDLModel, NerConverter document_assembler = DocumentAssembler() \         .setInputCol("text") \         .setOutputCol("document") sentence_detector = SentenceDetector()\         .setInputCols(["document"])\         .setOutputCol("sentence") tokenizer = Tokenizer()\         .setInputCols(["sentence"])\         .setOutputCol("token") embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\         .setInputCols("document", "token") \         .setOutputCol("embeddings") ner =  NerDLModel.pretrained("ner_radiology", "en", "clinical/models") \         .setInputCols(["document", "token", "embeddings"]) \         .setOutputCol("ner") ner_converter = NerConverter() \         .setInputCols(["document", "token", "ner"]) \         .setOutputCol("ner_chunk") pipeline = Pipeline(stages=[         document_assembler,         sentence_detector,         tokenizer,         embeddings,         ner,         ner_converter])

We then can initialize the pipeline on an example:

Bilateral breast ultrasound was subsequently performed, which demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue or a lipoma.

To do that, we will use two pipeline objects: the Pipeline and the LightPipeline. The first one is useful to keep all the stages in one single object and is used on spark data frames, while the second one is useful for fast predictions and can be used directly on strings (or list of strings).

# Send example to spark data frame
example = spark.createDataFrame([['''Bilateral breast ultrasound was subsequently performed, which 
demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within the 
anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to the 
adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous tissue 
or a lipoma.''']]).toDF("text")

# Initialize the model
model = nlpPipeline.fit(example)

# Make predictions on the example (full model)
result = model.transform(example)

# Create a light model for fast predictions
lmodel = LightPipeline(model)

# Use texts directly
result_light = lmodel.fullAnnotate("Bilateral breast ultrasound was subsequently performed, which 
demonstrated an ovoid mass measuring approximately 0.5 x 0.5 x 0.4 cm in diameter located within 
the anteromedial aspect of the left shoulder. This mass demonstrates isoechoic echotexture to 
the adjacent muscle, with no evidence of internal color flow. This may represent benign fibrous 
tissue or a lipoma."

To visualize the results of the light model, we can run:

for res in result_light[0]['ner_chunk']:
    print(f"Chunk: {res.result}, Entity: {res.metadata['entity']}")

Resulting:

Chunk: Bilateral breast, Entity: BodyPart Chunk: ultrasound, Entity: ImagingTest Chunk: ovoid mass, Entity: ImagingFindings Chunk: 0.5 x 0.5 x 0.4, Entity: Measurements Chunk: cm, Entity: Units Chunk: anteromedial aspect of the left shoulder, Entity: BodyPart Chunk: mass, Entity: ImagingFindings Chunk: isoechoic echotexture, Entity: ImagingFindings Chunk: muscle, Entity: BodyPart Chunk: internal color flow, Entity: ImagingFindings Chunk: benign fibrous tissue, Entity: ImagingFindings Chunk: lipoma, Entity: Disease_Syndrome_Disorder

And for the full model:

result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
      .select(F.expr("cols['0']").alias("chunk"),
              F.expr("cols['1']['entity']").alias("entities"))\
      .show(truncate=False)

Resulting:

chunk entities
Bilateral Direction
breast BodyPart
ultrasound ImagingTest
ovoid mass ImagingFindings
0.5 x 0.5 x 0.4 Measurements
cm Units
anteromedial aspect Direction
left Direction
shoulder BodyPart
mass ImagingFindings
isoechoic echotexture ImagingFindings
muscle BodyPart
internal color flow ImagingFindings
benign fibrous tissue ImagingFindings
lipoma Disease_Syndrome_Disorder

Overall, the model achieves good performance on the entities (except for the OtherFindings that is less relevant), as can be seen in the table below.

entity tp fp fn total precision recall f1
OtherFindings 8 15 63 71 0.3478 0.1127 0.1702
Measurements 481 30 15 496 0.9413 0.9698 0.9553
Direction 650 137 94 744 0.8259 0.8737 0.8491
ImagingFindings 1,345 355 324 1,669 0.7912 0.8059 0.7985
BodyPart 1,942 335 290 2,232 0.8529 0.8701 0.8614
Medical_Device 236 75 64 300 0.7588 0.7867 0.7725
Test 222 41 48 270 0.8441 0.8222 0.8330
Procedure 269 117 116 385 0.6969 0.6987 0.6978
ImagingTest 263 50 43 306 0.8403 0.8595 0.8498
Symptom 498 101 132 630 0.8314 0.7905 0.8104
Disease_Syndrome_Disorder 1,180 258 200 1,380 0.8206 0.8551 0.8375
Units 269 10 2 271 0.9642 0.9926 0.9782
Imaging_Technique 140 38 25 165 0.7865 0.8485 0.8163

The model achieves and micro accuracy of 0.8315 and a macro accuracy of 0.7524.

Conclusion

In this post, we introduced a new clinical Named Entity Resolution model that identifies entities related to radiology. If you want to try them on your data, you can ask for a Spark NLP Healthcare free trial license.

Being used in enterprise projects and built natively on Apache Spark and TensorFlow as well as offering all-in-one state-of-the-art NLP solutions, Spark NLP library provides simple, performant as well as accurate NLP notations for machine learning pipelines that can scale easily in a distributed environment.

If you want to find out more and start practicing Spark NLP, please check out the reference resources below.

Further reading

White Swan's Impressions of John Snow Labs' Clinical Named Entity Recognition

At an industry conference in 2018, our volunteer Rob Lovelock was introduced to a company called John Snow Labs, which specialized in...