Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Healthcare Data Sets

2,200+ Clean, Current, Enriched, and Expert Curated Datasets for Data Scientists

Life Science
Life Science

How It Works

The data is available under two types of licenses:

Welcome to the Land of Clean Data!

Each dataset goes through 3 levels of quality review
  • 2 Manual reviews are done by domain experts
  • Then, an automated set of 60+ validations enforces every datum matches metadata & defined constraints
Data is normalized into one unified type system
  • All dates, units, codes, currencies look the same
  • All null values are normalized to the same value
  • All dataset and field names are SQL and Hive compliant
Data and Metadata
  • Data is available in both CSV and Apache Parquet format, optimized for high read performance on distributed Hadoop, Spark & MPP clusters
  • Metadata is provided in the open Frictionless Data standard, and every field is normalized & validated
Data Updates
  • Data updates support replace-on-update: outdated foreign keys are deprecated, not deleted

Welcome to Expert Curated Data!

Field names, descriptions, and normalized values are chosen by people who actually understand their meaning

Healthcare & life science experts add categories, search keywords, descriptions and more to each dataset

Both manual and automated data enrichment supported for clinical codes, providers, drugs, and geo-locations

The data is always kept up to date – even when the source requires manual effort to get updates

Support for data subscribers is provided directly by the domain experts who curated the data sets

Every data source’s license is manually verified to allow for royalty-free commercial use and redistribution

Welcome to Easy to Use Data!

Format, Download and Updates
  • Read CSV or Parquet data with one-liners from the standard libraries of Python, R, SAS, SPSS, or Spark;
  • Full download of data enables you to get the most out of your memory, database, or cluster;
  • Subscribe to dataset updates to automate them.
  • 26 out of the box integrations to the world’s most popular analytics tools, via our partnership;
  • SQL and SPARQL queries via a web UI or REST API.
Standardized and Complete Schemas
  • Need to load 1,000 datasets into a SQL or Hive DB? Create and populate all tables with one script, thanks to the complete & standardized schemas in metadata.
Enriched Metadata
  • Don’t know the jargon? Our experts curate extra search terms so that you can find ”NPPES” also by ”all US doctors” or “national providers database”.
  • Not sure what the data is about? Metadata is provided in human-readable PDF in addition to JSON.
Healthcare datasets

26 out of the box data integrations

What customers are saying

The data sets were clean, easy to access and easy to use. It was a joy to be able to use the data provided.
Eric Rothman
Co-Founder, Threat Sync
The data sets make excellent reference data and are at their most powerful when combined with unstructured data – to bring order to the chaos if you will.
Mark Pinches
The provided data sets were of good quality, clean and ready to use.The access method was extremely easy to understand, as well as the search engine.
Roxana Radu
Project Manager, The Synergyst
Many people told me the datasets were great and very easy to use.
Jason Jim
HopHacks Organizer

Frequently Asked Questions

1. General Questions

Through the Data Market, John Snow Labs offers a wide range of health and life science datasets and data packages.

John Snow Labs offers access to datasets that have been curated by a team of specialists in the health and life science domains. Thanks to the vast team expertise and experience in data acquisition, data curation, data normalization and data publishing, our datasets are cleaner, better documented, better structured and enriched with useful information than their free equivalents offered by various well established and trustworthy data publishers.

Our datasets are extremely easy to understand, use and integrate into your existing systems and tools. You can find a list of our databases on our vendors page.

Every single dataset on John Snow Labs has a fully transparent link back to its source. This means you can always verify the data as published by its original source. Transparency is the ultimate enabler of trust.

The main customers targeted by John Snow Labs Data Library are:

  • Healthcare and Life Science application providers;
  • Data integrators that want to provide data-centered services and are interested in John Snow Labs datasets;
  • SMEs that want to develop new products based on health and life science data;
  • CIOs/CEOs/CTOs healthcare related businesses;
  • Data scientists;
  • Data publishers that want to integrate their datasets with complementary health and life datasets for a richer context and relevance;

The John Snow Labs Data Library is an online data repository that allows users to access, download, and use datasets or data packages (groups of related datasets) curated by John Snow Labs team of experts. It is a quick and easy to access gateway to the John Snow Labs data catalog, a unique resource of normalized, clean and enriched collection of health and life science datasets.

The data library contains virtual products in the form of datasets and data packages that can be downloaded and used:

  • for research purposes for free and
  • for commercial purposes after paying a subscription fee.

As long as the subscription is valid the user will have a commercial license to use to the datasets and will get all available updates.

2. Data Library Functionalities

The Data Library provides a dedicated web page where the users can search for the datasets she/he is interested in and explore the available data catalog.

The search functionality works on both dataset name and dataset description. By default, all available datasets are displayed as a list of products.

The following information is available for each product, on the main shop page:

  • name of the dataset;
  • relevant short description;
  • image that identifies the name of the data package that includes the current dataset;
  • data download button for logged in users.

The Data Library provides dedicated pages for all available datasets and data packages.

The dataset details page includes the following information:

  • the dataset name;
  • license information for the logged in user;
  • direct download links for CSV data, PDF reference file, and JSON metadata file;
  • the image associated with the dataset;
  • a short description of the dataset;
  • a detailed description of the dataset;
  • a clear description of the list of fields together with typing information;
  • data preview;
  • a data package section that shortly describes the data package that includes the current dataset;
  • a related dataset section containing all datasets that are in the same accelerator as the current dataset

A data package is a group of datasets that are related. In other words, datasets included in the same package describe the same data from different points of view or describe complementary data or data that is somehow related.

3. Data Info

The datasets published on John Snow Labs Data Library are premium quality datasets already tested, optimized and customized in a ready to use format.

Extensive efforts have been invested in preparing and optimizing those datasets for immediate use:

  • They have been curated by human experts,
  • Out of the box optimized data formats for R, Python, SAS, Hadoop, Spark, SQL & BI tools;
  • Daily updates are integrated and published so the user can get automatic, versioned, clean & tested updates as they happen;
  • All data is under one license with royalty-free, commercial redistribution rights;
  • Datasets are triple checked – automatically and manually, to make sure that they are error-free and ready for production use;
  • Our datasets are clean and interoperable. For this, we are using a unified and standards-based data model – including numbers, dates, units, currency, null values, identifiers & references.

By using our datasets you will save more than 4,000 hours in data preparation (cleaning, transformation, normalization, etc.) each month.

We offer you turnkey data for analysis already tested, optimized and customized in a ready to use format for your big data, data science or visualization platform.

4. Subscriptions

A user can cancel any order which has on-hold status. On-hold status means that the payment has not been processed yet. Once the payment is computed, the user receives a commercial license agreement for the entire data catalog, the order can no longer be canceled.

An order cancellation does not imply any payment/penalty.

Any active subscription on John Snow Labs Data Library can be cancelled at any time. The cancelation of a subscription stops future renewal charges but does not result in a refund of your order.

Commercial use of the dataset(s) is still allowed until the day the current subscription expires.

The use of John Snow Labs datasets is free forever for academics, researchers, and students.

The Data Library allows users to easily buy a subscription to the entire catalog. The subscription functionality is accessible from the Data Library main page.

By clicking on the Subscribe to Data Library buttons on the Data Library main page, or on the dataset details pages, the subscription is added to your cart. Once the order is passed and the payment is confirmed the user will gain commercial rights to all datasets.

The subscription is valid for one year and entitles the user to instantly access all available data updates and an unlimited number of downloads.

The payment methods currently supported by John Snow Labs Data Library are:

  • Credit card directly on our website;
  • Bank transfer to the account received via e-mail once the order is confirmed.

Subscriptions to John Snow Labs Data Library are not returnable or refundable after purchase.

Orders with status on hold can be canceled for free.

A user can cancel any order which has on-hold status. On hold status means that the payment has not been processed yet. Once the payment is computed the order can no longer be canceled. An order cancellation does not imply any payment/penalty.

Active subscriptions can be canceled but we do not provide any reimbursement for the already paid subscriptions.

Any active subscription on John Snow Labs Data Library can be canceled at any time. The cancelation of a subscription stops future renewal charges but does not result in a refund of your order.

Commercial exploitation rights to the datasets will be valid until the day the current subscription expires.

Your list of subscriptions can be accessed in your account section of the Data Library.