Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

DICOM de-identification at scale in Visual NLP 3/3

This post will explore how Visual NLP can manipulate pixel and overlay data within DICOM images.

In the following examples, we will work with these transformers: DicomToImageV3, responsible for extracting frame images, and DicomDrawRegions, which draws rectangle regions to the frames and proves helpful in building de-identification pipelines.


DicomToImageV3 extracts images from the pixel and overlays data to the Spark DataFrame as an Image structure.

It supports the following PhotometricInterpretations:

  • MONOCHROME2: This Photometric Interpretation represents monochrome images, which are grayscale images with varying shades of gray. It is often used for medical images like X-rays and grayscale photographs.
  • RGB: RGB stands for Red, Green, and Blue. This Photometric Interpretation is used for full-color images, where each pixel is represented by three color channels: red, green, and blue. Combining these three channels in varying intensities creates a wide range of colors, making it suitable for standard color images.
  • YBR: YBR stands for YCbCr (Luminance, Chrominance Blue, Chrominance Red). It is a color space used to represent color images in a way that separates the luminance (brightness) information from the chrominance (color) information. It’s often used in medical imaging and JPEG compression.
  • YBR FULL: This extension of the YBR color space provides full-color information. It still separates luminance and chrominance but includes all color information needed for accurate color representation.
  • YBR FULL 422: This variation of YBR FULL uses 4:2:2 chroma subsampling. It reduces the amount of chrominance data while preserving good color quality, making it useful for compression without significant loss of image quality.
  • PALETTE COLOR: This Photometric Interpretation uses a color palette to represent images. Instead of storing individual color values for each pixel, it indexes a color palette to represent the colors in the image. It’s an efficient way to store and transmit color images with a limited color set, such as in GIF images.

Let’s extract frames from the one of the test DICOM file:

dicom_to_image = DicomToImageV3() \
    .setInputCols(["content"]) \
    .setOutputCol("image") \

result = dicom_to_image.transform(dicom_df)
|               image|exception|pagenum|                path|   modificationTime|length|
|{file:/Users/nmel...|         |      0|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      1|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      2|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      3|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      4|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      5|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      6|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      7|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      8|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |      9|file:/Users/nmeln...|2023-08-20 14:17:23|426776|
|{file:/Users/nmel...|         |     10|file:/Users/nmeln...|2023-08-20 14:17:23|426776|

We can see here a separate row with images for each frame. pagenum column contains the number of the frame. Let’s display frames as images using the display_images function:

display_images(result, limit=2)

DicomToImageV3 supports up to 1000 frames or more, depending on memory limitations. When working with a large number of frames during the debugging of the pipeline, it is helpful to extract only a limited number of frames. To do this, you can set the frameLimit parameter:

dicom_to_image = DicomToImageV3() \
    .setInputCols(["content"]) \
    .setOutputCol("image") \
    .setFrameLimit(1) \

result = dicom_to_image.transform(dicom_df)
|               image|exception|pagenum|                path|   modificationTime|length|
|{file:/Users/nmel...|         |      0|file:/Users/nmeln...|2023-08-20 14:17:23|426776|

To handle big files (2 or more GB), you must use path as input instead of content. This forces the user to load the file directly from the file system instead of loading it to the DataFrame.

dicom_to_image = DicomToImageV3() \
  .setInputCols(["path"]) \
  .setOutputCol("image") \


DicomDrawRegions draws regions to the frames on DICOM. It updates both pixel and overlay data. 

It supports the same PhotometricInterpretations as DicomToImageV3.

Let’s do the simplest de-identification, detect text on the image, and hide it. We can already extract frame images using DicomToImageV3. We need to set keepInput to True to be able to compare results with the original images.

dicom_to_image = DicomToImageV3() \
    .setInputCols(["content"]) \
    .setOutputCol("image") \

Next, we need to detect text. We can use ImageTextDetectorV2 here:

text_detector = ImageTextDetectorV2 \
    .pretrained("image_text_detector_v2", "en", "clinical/ocr") \
    .setInputCol("image") \
    .setOutputCol("regions") \
    .setScoreThreshold(0.5) \
    .setTextThreshold(0.2) \

As the final step, we draw filled rectangles using DicomDrawRegions:

draw_regions = DicomDrawRegions() \
    .setInputCol("path") \
    .setInputRegionsCol("regions") \
    .setOutputCol("dicom_cleaned") \
    .setRotated(True) \

To run this, we will define Spark ML Pipeline and call it:

pipeline = PipelineModel(stages=[

result = pipeline.transform(dicom_df)
|       dicom_cleaned| exception|                path|             content|
|[52 75 62 6F 20 4...|          |file:/Users/nmeln...|[52 75 62 6F 20 4...|

Let’s display original and cleaned DICOMS using display_dicom function:

display_dicom(result, "content,dicom_cleaned", show_meta=False, limit_frame=2)

Additionally, DicomDrawRegions also supports the following compressions:

  • RLELossless: RLELossless is a compression method used in DICOM for medical image storage. It operates based on Run-Length Encoding Lossless, encoding consecutive runs of identical pixel values as a count followed by the pixel value itself. This method is employed for lossless compression, ensuring that the original medical image can be perfectly reconstructed from the compressed data without any loss of quality.
  • JPEGBaseline8Bit: JPEGBaseline8Bit is a specific JPEG (Joint Photographic Experts Group) compression variant. It adheres to baseline compression with 8 bits per color channel, typically resulting in a 24-bit color depth for RGB medical images. This compression method is inherently lossy, reducing file size by discarding some image data while aiming to maintain diagnostic image quality.
  • JPEGLSLossless: JPEGLSLossless is a DICOM compression method representing a lossless image compression standard based on the JPEG-LS (Lossless JPEG) standard. In DICOM, it is used to ensure that medical images are stored and transmitted without any loss of quality. It achieves this by employing predictive coding, context modeling, and entropy coding techniques, making it suitable for medical imaging applications where preserving diagnostic image quality is paramount.

We can choose compression by setting the compression parameter and force compression of pixel data for files without compression by setting the forceCompress parameter to True:

draw_regions = DicomDrawRegions() \
    .setInputCol("content") \
    .setInputRegionsCol("regions") \
    .setOutputCol("dicom_cleaned") \
    .setRotated(True) \
    .setCompression(DicomCompression.RLELossless) \

The last stage in today’s post is storing the results in the file. To retrieve the name of the original file from the path column, let’s define a UDF function:

def get_name(path, keep_subfolder_level=0):
    path = path.split("/")
    path[-1] = path[-1].split('.')[0]
    return "/".join(path[-keep_subfolder_level-1:])

To save the DataFrame with the cleaned DICOM files using the binaryFormat data source to the output_path, we need to specify a few options:

  1. ‘type’ of the file.
  2. The ‘field’ that contains the DICOM file.
  3. A ‘prefix’ for the files.
  4. The ‘nameField’ column contains the file’s name.”
output_path = "./deidentified/"
from pyspark.sql.functions import *

result.withColumn("fileName", udf(get_name, StringType())(col("path"))) \
    .write \
    .format("binaryFormat") \
    .option("type", "dicom") \
    .option("field", "dicom_cleaned") \
    .option("prefix", "ocr_") \
    .option("nameField", "fileName") \
    .mode("overwrite") \

Jupyter Notebook with full code can be found here.

In this post, we have constructed the simplest de-identification pipeline to conceal all text in DICOM images. In the next post, we will create a more complex pipeline using NER de-identification models with Spark NLP and Spark NLP for Healthcare.


Try Visual NLP

See in action

DICOM de-identification at scale in Visual NLP 2/3

Start to work with DICOM in Visual NLP In this post, we are deeply diving into working with metadata using Visual NLP....