Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

John Snow Labs’ Data Market: Data Procedures and Data Quality


In today’s world, given the exponential growth of data and its intensive use for analysis, pattern discovery or decision making, data quality and freshness are of the highest importance. At John Snow Labs, we made a mission out of solving the existing data quality issues within the healthcare and life science domains. Our experts constantly work to ensure that our datasets are clean, normalized, enriched, up-to-date and ready to use by our customers. Furthermore, our catalog is constantly expanding as new datasets are added each month.

Currently, John Snow Labs curates and publishes more than 1650 datasets and more than 150 data packages. Some of those data products are distributed for free (Core datasets – around 200 datasets) while others require a subscription fee (Healthcare, Life Science and Terminology). Subscriptions allow customer access for a year and include all relevant updates that may occur in this period. Our medical data sets and data packages are of the highest quality. They have been checked by our experts who have extensive domain knowledge and are academically certified with Ph.D. and masters.

Our commitment to data quality

John Snow Labs has very well established processes for quality assurance, both manual and automatic. Our highly skilled and experienced team of domain experts constantly work on identifying relevant and useful datasets published by trusted sources. Then, they manually curate and normalize those datasets and identify ways of linking them to existing domain standards and enrich them with additional information from other sources in order to increase their usability and added value. The resulted datasets go through three levels of validation both manual and automatic in order to eliminate any issues that might appear within the datasets. 2 Manual reviews are done by domain experts. Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints.

In a second stage, our data researchers transpose and automate this process using Autobots for ensuring the replicability of the process and always have up-to-date datasets.

We do the all time-consuming work for you!

It is well known that currently, data scientists spend the bulk of their time cleaning and preparing datasets for analysis. Our mission at John Snow Labs is to help you bypass that hurdle and let you enjoy the land of clean data!

All our data is normalized into one unified type system:

  • All dates, unites, codes, currencies look the same;
  • All null values are normalized to the same value;
  • All dataset and field names are SQL and Hive compliant.

All data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters. Metadata is provided in the open Frictionless Data standard, and its every field is normalized & validated. Our data updates support replace-on-update: outdated foreign keys are deprecated, not deleted.

In all this process domain expertise is crucial! When curating datasets, the field names, descriptions and normalized values are chosen by people who actually understand their meaning. Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset. Our data is always up to date – even when the source requires manual effort to get updates. Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution. Finally, support for data subscribers is provided directly by the domain experts who curated the data sets.

Data operations are taken very seriously at John Snow Labs, and we place great emphasis on excellent data quality, data security and data integration.

Our datasets

We have over 1000 datasets available on our Data Market that are pertinent to the healthcare field and over 350 available in the Life sciences field. Als,o we have over 350 Terminology datasets and over 200 Core datasets that we share for free to authenticated users so that everybody can take a look at the data and experience the ease of use and the quality that we enforce.

Furthermore, we provide samples for all our datasets and detailed description regarding their metadata and added value.

In order to further ease our customers’ use of datasets, we have predefined data packages that include datasets that can be easily linked because of their semantic connections. This approach has great advantages:

  • Gives great insights and context for the included datasets. Everybody knows that the power of information comes from its context. Well, our data packages offer both data and context and can be easily exploited within the BI tool of your choice.
  • Ensures a 25% saving on the subscription price when compared to the subscription fees for the individual datasets.

Choose our datasets and save more than 4000 hours of data curation a year or just schedule a Demo To find out how we can help you.


At John Snow Labs we provide thousands of datasets and data packages through our Data Market, either for free or for a reasonable subscription fee. Our domain experts clean, normalize, enrich and keep our data up-to-date in order to ensure a frictionless reuse.

Forging Partnerships for a Better World - John Snow Labs and Open Knowledge International Collaboration

Exactly a year ago, John Snow Labs made an alliance with other organizations that share the same passion and vision to make...