Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Unified Medical Language System (UMLS)

What is UMLS?

The UMLS is a project that targets producing a group of files and software that unify many biomedical terminologies and standards for the sake of hermetic integration and interoperability between different systems.

As discussed before, there are always barriers to successfully retrieve machine-readable information. One of which is that the same concepts could be expressed in different ways among different databases and systems.

UMLS can guarantee successful interoperability of drug names, billing codes, health information, and medical terms across different computer systems. This can secure successful data mining procedures, or in producing accurate healthcare statistical reports.

The UMLS has three tools (known as “Knowledge Sources”) that can be used separately or together as showed below.


The Metathesaurus is a huge multi-lingual vocabulary database that contains information about biomedical and health-related concepts, their various names, and the relationships among them.

It is concerned with terms and codes from many vocabularies, including MeSH®, MedDRA, RxNorm, and SNOMED CT® (English and Spanish).

The 2019AB Metathesaurus contains approximately 4.26 million concepts and 15.2 million unique concept names from 211 terminologies (source vocabularies).

The Metathesaurus is populated from what is called by “Source Vocabulary”. Those source vocabularies can be derived from lists of controlled terms used in patient care, health services billing, public health statistics, or any clinical or health services research. Those lists used in building Metathesaurus need high-quality data that must be clean and presented in a standardized format.

Semantic Network

It is concerned with broad categories (semantic types) and their relationships (semantic relations). It contains 135 broad categories and 54 relationships between them (‘isa’ (is a) relationship can be considered the primary link between most of the semantic types)

All concepts in the Metathesaurus are related to at least one semantic type from the Semantic Network where it is defined with textual descriptions or through inherent information inherent in its hierarchy.

SPECIALIST Lexicon and Lexical Tools

It is concerned with parts of speech, variant information, and programs for language processing. It contains 200,000 lexical items.

Most of the healthcare research project networks need high-quality data, where the data is not only clean but must follow specific standards or be available in a specific format. Having clean data that is compatible with UMLS standards is always considered one of the constraints to any healthcare project. For a researcher, obtaining this high quality of data may consume about 60% of his/her time.

John Snow Labsis considered one of the leading organizations that offer a catalog that contains diverse datasets including many UMLS datasets. These healthcare nlp datasets are manually, and machine reviewed.

JSL catalog contains 2 interesting datasets that could be navigated to have a better understanding of the topic.

The first dataset provides the information on relationships between concepts or atoms known to the Metathesaurus for the semantic type “Antibiotic”. With regards to asymmetrical relationships, one row is assigned for each direction of the relationship.

The other dataset provides information about the entire concept structure of the Unified Medical Language System (UMLS) Metathesaurus for the semantic type “Antibiotic”.

This dataset connects different names for all the concepts for a specific Semantic Type. The Semantic Network contains 125 Semantic Types. The relation between the Metathesaurus concept and semantic type is one too many; some terms are assigned to 5 semantic types.

Accurate de-identification, obfuscation, and editing of scanned medical documents and images

One kind of noisy data that healthcare data scientists deal with is scanned documents and images: from PDF attachments of lab results,...