Watch Healthcare NLP Summit 2024. Watch now.
was successfully added to your cart.

John Snow Labs’ State-Of-The-Art Big Data Quality Frameworks

Dataset Requirements for Compliance with State-Of-The-Art Big Data Quality Frameworks

The term data quality can best be defined as “Fitness for use”, which implies the concept of data quality is relative. Thus, data with quality considered appropriate for one use may not possess quality for another use. (1)

Data usually is collected from different sources that use various platforms at different levels of care. Data management lifecycle comprises the capture, cleansing, and storage of data from clinical and business sources.

This lifecycle mandates successful Data Integration to develop a successful Clinical and Business Intelligence model.

There is a great challenge facing Data Integration in the healthcare sector, which is the fragmented nature of data (this could be within the same healthcare organization).

Different data fragments could originate from Patient Admission/Discharge/Transfer (ADT), laboratories and investigations, past medical history, diagnosis, treatment plans/prescriptions, radiographs, various levels of care and different physicians.

All the previous data should flow in a coherent form the point of origin to the decision support end-point, where further analysis take place.


Quality Control and Data Cleansing Challenges

The process of loading the data from the source system to the data warehouse (or the storage area) is known as Extract, Transform, and Load (ETL).

Currently, data cleansing process became an essential step in the ETL process. Hence, the sequence now is Extract-Clean-Transform-Load

After data is entered at the source, they must be retrieved by either full extract or incremental extract methodologies.

Due to the fragmented nature we discussed, we can expect common troubles especially that the human factor is present (data entry process). As long as there is human data entry, we have to expect the presence of missing fields, wrong data, outdated data.

Another factor may depend on the nature of the data schema, like the presence of non-structured data (i.e., text fields) which is considered another big challenge for data integration.

The more the transformation rules include a strict logic to control the cleansing/scrubbing/conformity to data standards, the less challenges we have with regards to integration.

John Snow Labs is one of the leading organizations in applying strict data standards and providing clean data that could save your organization up to 4000 hours in data preparation each month.

Here are some recommended standards for data unification:

  • Making identifiers unique (sex categories: M-F-N/A)
  • Convert null values into standardized format (N/A)
  • Standard predeclared form for phone numbers and ZIP codes (xxx-xxxx-xxx)

The last process is Data Loading, which usually targets a database. This process is carried out in a consistent manner.


Tools and frameworks to assess data quality


The Data Quality Assessment Methods and Tools (DatQAM):

It provides a range of quality measures that depend mainly on official statistics with a focus on user satisfaction considering specific factors like relevance, sampling and non-sampling errors, production dates concerning timeliness, availability of metadata and forms for dissemination, changes over time and geographical differences and coherence (Eurostat, 2007)


The Quality Assurance Framework (QAF):

Developed by Statistics Canada (2010) and includes number of quality measures used to assess data quality. It is concerned with measuring timeliness, relevance, interpretability, accuracy, coherence and accessibility.

Health and Social Care Information Centre (HSCIC) – now called NHS Digital – used DatQAM for data quality assessments.


The Data Quality Audit Tool (DQAT)

The Data Quality Audit Tool (DQAT) is utilized by the WHO and Global Fund.

“Confidentiality” criteria have been agreed upon to be added to data quality framework which was adapted from the DQAT.

Currently, most recent data quality frameworks assume the presence of data cleansing logic that run on raw data as a part of the ETL process.

Accordingly, “Data Cleansing” was also added to most of the recent state-of-the-art data quality frameworks to assure the application of data cleansing methods.

At 2004, Price and Shanks developed the “Quality Category Information Framework” where they added different criteria to ensure high-quality data collection process. They also focused on the “Objectivity” of datasets, to ensure whether the datasets are totally independent of user or use or not.

Eppler (2001), Price and Shanks (2004) highlighted “Accessibility” criteria as one of the necessary criteria for a perfect data quality framework. Accessibility criteria includes a checklist that includes accessibility authorization and protection against misuse and bias.

Most data scientists have tight schedule, are always running out of time and stressed with deadlines. This makes applying such meticulous procedures and following such precise framework a time-consuming process.

There is no need to “reinvent the wheel” with such tough work environment while there are organizations offering ready-made, clean and standardized datasets.

Do yourself a favor, save your time and schedule a call with any of those organizations which provide such clean, matched, current & compliant data.



  • Tayi GK, Ballou DP. Examining Data Quality. Communications of the ACM 1998;41(2):54-7

Shaping The Future of Health with Datasets - John Snow Labs' December Data Catalog

John Snow Labs’ Dec-2017 Data Catalog: 74 New Datasets from Level III Alpha-Numeric 2018 HCPCS, IP Revenue Crosswalk Codes, Medicare Utilization Indicators...