Watch Healthcare NLP Summit 2024. Watch now.
was successfully added to your cart.

Medical Images De-Identification

Medical imaging challenges were discussed in 2 previous blogs. The first blog gave an introductory brief about medical imaging, its different modalities, standards used (DICOM), efficiency measures, and what is meant by Picture archiving and communication system (PACS) and Radiology Information System (RIS).

Medical Images De-Identification

The second blog discussed the ontologies in radiology.

The first blog listed some of the publicly available medical image repositories which may contain thousands or millions of medical images that could be of high importance to different healthcare research organizations or institutes.

Some preparatory steps should be taken before their use. The images should pass some quality checks before being marked ready for use. One of those important quality standards is the protection of any personal health information.

The use of personal health information in research is regulated by the “Privacy Rule”.

The Privacy Rule is a Federal regulation under the Health Insurance Portability and Accountability Act (HIPAA) of 1996. [1]

It protects specific health information that could identify living or deceased individuals. The rule preserves patient confidentiality without affecting the values and the information that could be needed for different research purposes.

This blog is a trial to explain different methodologies used to de-identify health information that protects HIPAA de-identification rules with a focus on medical imaging data or DICOM data.

Before proceeding further, the reader should understand what is meant by a covered entity and a hybrid entity. The Covered Entity is the healthcare entity that transmits electronic health information in connection with a transaction for which the Department of Health and Human Services (HHS) has adopted a standard.

On the other hand, Hybrid Entity can be defined as a covered entity, that performs both covered and non-covered functions according to the Privacy Rule.

A covered entity de-identify data by removing 18 elements. Those elements are those elements that could help to identify the patient or any of his relatives or friends.

Those 18 elements include the following [1]:

  • Numbers of: Telephones, Fax, Social Security, Medical Records, Health plan beneficiary, Accounts, Certificates/licenses, vehicles license plate, Internet Protocol (IP) address, and Devices Serial.
  • Electronic mail addresses.
  • Universal Resource Locators (URLs).
  • Biometric Identifiers (e.g.: fingerprints or voiceprints).
  • Full-face photographs.
  • Any elements showing dates related to individuals (except year).
  • All elements (including year) for ages over 89. Such ages may be aggregated into the age category (90 or older).
  • With regards to geospatial data, data that comply with the following conditions should be removed:

– Subdivisions smaller than a state.

– ZIP Code, and their equivalent codes.

  • The data (supplied from the Bureau of the Census data) of the initial three digits of a ZIP Code for the following geographical units should be kept, in case they comply with the following conditions:
  • The unit contains more than 20,000 people if we combined all ZIP Codes with the same three initial digits.
  • For the units containing 20,000 or fewer people, where the ZIP Codes were changed to 000.
  • Any other unique identifying number unless otherwise used for re-identification. [1]

Sometimes statistical methods are used for de-identification purposes without the need to remove the 18 identifiers. If this is the case, the whole process and methodology must be certified and documented by a professional person. The issued certificate should be kept either in a paper-based or electronic format for at least 6 years.

An Overview of the Available Anonymization Tools

In the last 2 decades, many research institutes, organizations, and software houses tried to formalize the standards regulated by the HIPAA by developing a tool that could do the task automatically.

The following section highlights different efforts exerted in this field by providing a brief about the most prominent projects/products:

1- PrivacyGuard or DICOM Confidential (Open Source):

The main features can be summarized as follows:

  • It works on multiple platforms.
  • Provides an anonymizer.
  • Capable of importing different formats (DICOM or file system) and exporting them as DICOM, local file system, or transporting them through using SFTP.
  • Configurable removal for burned-in annotation.
  • The Privacy Policy is formulated as an XML (eXtensible Markup Language) document that defines de-identification rules.
  • The de-identification rules are implemented by Java classes and distributed in signed JAR files (libraries).

2- Sante DICOM Editor (Commercial)

  • Works on Windows only.
  • Provides an anonymizer.
  • Capable of processing a single file or multiple files in a folder.
  • Configurable removal for burned-in annotation (up to four rectangles).

3- Clinical Trials Processor (CTP) (Open Source)

  • Developed by the RSNA (Radiological Society of North America).
  • A stand-alone application.
  • Anonymization can be configured through a scripting language.
  • Suitable for clinical trials.

It has some limitations:

  • It cannot be installed as a Windows service.
  • Not capable of dealing with burned-in annotations.

4- Google Cloud Healthcare API

Google provides a tutorial that explains how to successfully finish the de-identification process in the right way.

How to Select the Best Anonymization Tool?

Search for the following features:

  • Working on different operating systems.
  • Can black-out the specific user-defined zones of the images.
  • Auto identification of DICOM files under a specific folder.
  • Capable of receiving DICOM objects sent using DICOM ‘push’
  • Successful transfer of the de-identified data using SFTP and DICOM to remote devices.
  • Availability of a DICOM encryption mechanism.
  • Generation of audit logs
  • Preserving tracking information in the DICOM header.

If the tool successfully passes the features test, most probably you can use it safely.

Ready-Made Solutions

If your research project is experiencing a tight schedule or the research team is lacking the necessary experience. Maybe using an application (whether commercial or open-source) might not be the best option for you. You can have the work done for you through any of the specialized companies in the market.

John Snow Labs is one of the well-known companies in the field. It has a team of experts who wrangled and curated thousands of valuable datasets. De-identifying DICOM data is among the team’s interest.

John Snow Labs catalog has many interesting related datasets. Among the medical imaging-related datasets, you can find a valuable dataset related to an Imaging study.

Imaging study is the representation of the content produced in a Digital Imaging and Communications in Medicine (DICOM) imaging study. A study comprises a set of series, each of which includes a set of Service-Object Pair Instances (SOP Instances – images or other data) acquired or produced in a common context. A series is of only one modality (e.g. X-ray, CT, MR, ultrasound), but a study may have multiple series of different modalities.

Besides, John Snow Labs services can be extensible to develop Natural Language Processing for medical and radiology.


[1] Department of Health and Human Services. Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule. DHHS. 2003. 32 p.

Accurate de-identification, obfuscation, and editing of scanned medical documents and images

One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results,...