Gene Disease Associations

$179 / year

This dataset contains the relationships between genes and diseases. These relationships were inferred due to the fact that the gene and the disease in some way share independent relationships with the same chemical; the inference was made through curation of research publications, the building of diagrams and statistical analysis.


This dataset from the Comparative Toxicogenomics Database (CTD) contains different types of standardized identifications for the gene and the disease to provide a cross-platform compatibility making able to identify the gene and the disease in major science databases and to locate the references for the research in which the inference was based. It also provides the inference score that allows determining the importance of the inference.

Chemicals are among the main environmental factors that influence health and the way these can cause disease is not totally understood. The Comparative Toxicogenomics Database (CTD) purpose is to provide a tool to generate new hypotheses on the mechanism of chemicals in the development of diseases by collecting curated data reported in the scientific literature on chemicals, genes and diseases and making inferences on the relationships of these three elements. This is accomplished through transitive inference, which happens when for example a chemical and a disease share interactions with one or more genes, thus inferring that there is a relationship between the chemical and the disease linked to a process or product of the particular genes, with this information could be inferred the mechanism of action of the chemical upon the gene to produce the disease, the genes linked to the disease, the physiopathology of the disease and other inferences. “For example, if chemical A interacts with gene B, and independently gene B is associated with disease C, then chemical A is inferred to have a relationship with disease C (via gene B).” (1) These inferences could be given in other directions, for example, a gene and a disease could share the same group of chemicals; also the inferences could have direct evidence in which there are published research with evidence of the relationship, while other inferences don’t have direct evidence in the literature and can be used to create new testable hypothesis about the mechanism of disease, initiate new research on the relationship and potentially predict disease treatment and prevention.

The CTD datasets can be used to create a tool for input of queries to obtain inferred relationships between genes, chemicals and diseases and the significance of the inferences. To prioritize inferences CTD uses the inference score, which ranks how true is the inferred relationship; this is accomplished by a network diagram where the chemicals, genes and disease are nodes and the relationships between them (inferences) are edges (lines), then the statistical analysis takes into account the number of nodes (genes, diseases or chemicals) that interact with the node of interest (gene, disease or chemical), the number of inferences with direct evidence, and the location of the node of interest using the hypergeometric clustering coefficient and common neighbor statistics. Finally, the inferences should be ranked from higher to lower inference score, being the ones with higher score the most significant ones.

1. Davis AP, Grondin CJ, Johnson RJ, Sciaky D, King BL, McMorran R, Wiegers J, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database: update 2017. Nucleic Acids Res. 2016 Sep 19;[Epub ahead of print]

Date Created


Last Modified




Update Frequency


Temporal Coverage


Spatial Coverage



John Snow Labs; Comparative Toxicogenomics Database;

Source License URL

Source License Requirements

Publicly available and free for research application but citation is required. Permission asked for commercial uses

Source Citation

Davis AP, Grondin CJ, Johnson RJ, Sciaky D, King BL, McMorran R, Wiegers J, Wiegers TC, Mattingly CJ. The Comparative Toxicogenomics Database update 2017. Nucleic Acids Res. 2016 Sep 19;[Epub ahead of print]


Toxicogenomics, Gene Disease Association, Gene Chemical Pathways, Gene and Disease Relationship, Heterogeneous Exposure Information, Comparative Toxicogenomics Database, Relationships Between Genes and Diseases, Chemical and Disease Inferences, Chemical Disease Hypotheses

Other Titles

Genes Involved in Molecular Diseases, Literature Curated Database of Genes and Disease Relationships, Genetic Base of Disease

Gene_SymbolShort-form abbreviation of the name of the gene interacting with the disease. The approved symbols for human genes are collected in the HUGO Gene Nomenclature Committee database; each name and symbol is unique for every gene and can be applied for other species.string-
Gene_IDUnique identifier for the gene of the National Center for Biotechnology Information (NCBI)’s Entrez Gene database. This Entrez Gene unique integer can be browsed in the Entrez system online to find nomenclature, sequence, products and other specific details of the gene. The identifier is species specific, a gene ID of a human gene can’t be applied to the same gene of a different species.integerrequired : 1 level : Nominal
Disease_NameName of the disease associated with the gene.stringrequired : 1
Disease_IDUnique identifier assigned to the disease by MeSH or OMIM, linked to the source record(s) for the disease. OMIM (Online Medelian Inheritance in Man) is a database of human genes and genetic disorders that displays the type of genetic variation and expression; OMIM uses a six-digit identifier for each gene or genetic disorder. MeSH is a controlled vocabulary of thousands of biomedical terms (including diseases) that serves to standardize the terminology used in published texts that belong to life sciences. Each MeSH term has a unique identifier, which can be from 7 to 8 character length. The MeSH unique identifier was changed to 10-character length after November 2013.stringrequired : 1
Direct_EvidenceType of evidence of the association published in scientific literature. Therapeutic association means that the gene actions, products or modifications over the gene have found to be a potential therapy for the disease. Marker or mechanism means that the gene has been found to intervene in the mechanism of disease development or that the gene mutation serves as a marker for the disease. ('|'-delimited list)string-
Inference_Chemical_NameName of the chemical that was inferred to be linked to the association between the gene and the diseasestring-
Inference_ScoreScore calculated for the probability of the inference. The inference score is calculated using statistics that takes into account the connectivity of the chemical with the disease, the number of genes used to make the inference of association and the connectivity of each of the genes. The higher the score the more likely the inference is true.numberlevel : Ratio
Omim_IDIdentification number(s) for the disease on OMIM database (‘|'-delimited list). OMIM (Online Medelian Inheritance in Man) is a database of human genes and genetic disorders that displays the type of genetic variation and expression; OMIM uses a six-digit identifier for each gene or genetic disorder.string-
PubMed_IDIdentification number(s) of text(s) published in PubMed database (‘|'-delimited list) as direct evidence of chemical/gene association with the disease. PubMed is a US National Library of Medicine citation database that contains millions of abstracts, references and full text links of biomedical literature from different trusted sources.string-
Gene SymbolGene IDDisease NameDisease IDDirect EvidenceInference Chemical NameInference ScoreOmim IDPubMed ID
11-BETA-HSD3100174880Abnormalities, Drug-InducedMESH:D000014Endocrine Disruptors5.1722659286
11-BETA-HSD3100174880AnemiaMESH:D000740Water Pollutants, Chemical4.2126546277
11-BETA-HSD3100174880Anemia, HemolyticMESH:D000743Water Pollutants, Chemical4.5122425172
11-BETA-HSD3100174880AsthenozoospermiaMESH:D053627Water Pollutants, Chemical5.0825179371
11-BETA-HSD3100174880Birth WeightMESH:D001724Endocrine Disruptors5.7227152464|29518214
11-BETA-HSD3100174880Breast NeoplasmsMESH:D001943Endocrine Disruptors8.6820646273
11-BETA-HSD3100174880Breast NeoplasmsMESH:D001943Water Pollutants, Chemical8.6820164002
11-BETA-HSD3100174880Cell Transformation, NeoplasticMESH:D002471Water Pollutants, Chemical4.2126210350
11-BETA-HSD3100174880Chromosome AberrationsMESH:D002869Water Pollutants, Chemical4.6820732340
11-BETA-HSD3100174880DeathMESH:D003643Water Pollutants, Chemical4.8522471926|24552493