Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Mining the Surveillance, Epidemiology, and End Results (SEER) Registries Case Study: Oral Malignant Melanoma (OMM)

Getting to Know SEER

The Surveillance, Epidemiology, and End Results (SEER) is a Program of the National Cancer Institute (NCI). It provides information on cancer incidence and survival in the United States. National Center for Health Statistics is responsible for the mortality rates provided by SEER.

Thousands of researches leading to marvelous scientific facts and discoveries can be obtained by just analyzing data from the SEER database once we follow the correct search strategy and research methodology.

One of the benefits is extracting the Prognostic Factors for a specific medical condition and using them to predict the outcome of this condition.

Prognostic Factors

Prognosis is predicting the likelihood of the disease progress (will the signs and symptoms improve or get worse). It is concerned also with expecting the effect of the disease on the quality of life, life expectancy and the probability of developing disease-related complications. It is concerned also with the rate of improvement or worsening (how fast could this happen).

Prognostic factors are variables found to have a strong correlation to the recovery/relapse of a disease or the development of disease-related complications. They are simply the factors found to have a strong correlation with the disease outcome.

There are some points that you need to learn about and take into consideration before working with the SEER database:

1- International Classification of Disease (ICD)

Prepare and try all the relevant and possible keywords in your search strategy to obtain the optimum results. Your keywords must comply with the International classification of disease (ICD) standards. If you are dealing with Oncology researches, use the International Classification of Diseases for Oncology (ICD-O).

John Snow Labs catalog might be of great help to you. It contains complete datasets for either ICD-10 or ICD-O.

2- Determine exactly your research needs (what to extract)

The SEER database contains many fields. Not all of them might be of importance to your research. Select only the fields that are relevant to your research needs. You might only need specific data (e.g.: demographic data, the primary site for the tumor, and the tumor extent).

3- Data De-Identification

Using the data provided by SEER must comply with the HIPAA regulations and respect patient confidentiality.

Dates, geographical location, medical record numbers and physician names should be de-identified before usage in any research work.

There are different methods to do this. You can use a tailor-made Python Script, or you can google for one. Others have already developed some Python Scripts for the same purpose.

Date of birth (DOB) can be changed to age, patient names and medical record numbers are to be removed completely. Medical records number can be substituted by an index number to be used as a unique identifier for the record.

Physician names to be removed completely while hospital geolocation data is replaced by an index number (a unique identifier).

Other critical data that might reveal the identity of the patient or that might be misused by others should be completely removed (telephone numbers, fax numbers, electronic mail addresses, or social security numbers).

Data deidentification is not an easy process. It might need a complete blog to discuss.

4- Data Visualization

Your findings must be well represented to the readers. Most Oncology researches focus on determinizing the overall survival (OS) and disease-specific survival (DSS) after diagnosis. This can be represented for the readers through plotting Kaplan-Meier curves. This can be easily achieved through the Python Library “Lifelines“.

You have first to import this library to your code by typing the following code:

from lifelines import KaplanMeierFitter

Most of the Oncology researches in real-life follow the previous roadmap. SEER registries have led to a magnificent mutation for our knowledge about tumors, their causes, their prognosis, and effective treatments.

The following is one of the studies that used the SEER registries to determine the prognostic factors which affect the overall survival (OS) and disease-specific survival (DSS) for Oral Malignant Melanoma (OMM).

Case Study: Oral Malignant Melanoma (OMM)

A malignant tumor originating from neural crest-derived melanocytes in the basal layer of the oral mucous membrane. It can be described as highly aggressive with high metastatic activity.

The authors of this study [1] used the SEER database to analyze both the patient and disease characteristics to determine their effect on the overall survival (OS) and disease-specific survival (DSS) rates.

As explained before, the search keywords were determined first. They used the histologic codes from the International Classification of Diseases for Oncology (ICD-O), Third Edition that included the relevant terms and codes as follows:

malignant melanoma, not otherwise specified (8720/3), nodular melanoma (8721/3),

amelanotic melanoma (8730/3), superficial spreading melanoma (8743/3), desmoplastic melanoma, malignant (8745/3), and mucosal lentiginous melanoma (8746/3)

Next, the relevant fields were selected. The authors decided to select the “age at diagnosis, year at diagnosis, sex, race, histologic subtype, primary site, tumor extent and size from both collaborative stage and extent of disease coding methods, treatment with surgery, treatment with radiation, county socioeconomic status (SES), survival in months, and cause of death”. [1]

The following step was deidentifying the data as explained before. The final step was to plot the Kaplan-Meier curves to calculate OS and DSS estimates. Correct plotting of the curves led to determining the accurate correlation between specific factors and the progress of the disease. This correlation can be explained as follows:

Some factors have a strong correlation with decreased survival rate like lower Socioeconomic Status (SES), increased age, a greater extent of disease, and larger tumor size.

Other factors appeared to have a strong correlation with increased survival rate like the decade of diagnosis, surgical treatment, and radiation therapy.

One of the most important factors associated with the survival appeared to be the extent of the disease at the time of diagnosis.

This case study can give us an indication of how important the early diagnosis of this disease is and how important is the information we can get from SEER registries if we followed the right research methodologies.


[1] Lee RJ, Lee SA, Lin T, Lee KK, Christensen RE. Determining the epidemiologic, outcome, and prognostic factors of oral malignant melanoma by using the Surveillance, Epidemiology, and End Results database. J Am Dent Assoc [Internet]. 2017;148(5):288–97. Available from:

The Hidden Power of Folium

Although both Python and R are taking the lead as the best data science tools, we can still find a lot of...