A key requirement of producing data that is ready for analysis is that the data must be “good”. Over the years we have found that organizations differ on the definition of good data quality, and that these definitions fit their maturity in analytics and data science.
The maturity model analogy seems to fit here because of two reasons. First, the levels in the model build on each other, and it is impossible to get to a higher level before mastering the lower ones. Second, moving to higher levels is not just a matter of tools or algorithms, because it also requires different processes and organizational thinking.
Consider this an initial proposal – rebuttals are most welcome.
Level 1: Data Provenance
Knowing where your data came from, how it was collected, how it was already transformed, why and by whom, is the most basic requirement for any usable dataset. For example, in a clinical setting – if we do not know which lab ran the tests, who funded the research, whether blood pressure was taken standing or sitting down, or whether some patients were removed from the trial’s results before we got them – the data is practically useless.
Having clear provenance, lineage, owner and other metadata about each dataset – even before looking at the data itself – is a basic requirement prior to any data analysis. In healthcare, disclosures of conflicts of interest, funding sources, privacy and other ethical considerations is key as well.
As the saying goes – “if you torture data enough, it will tell you anything” – so at the most basic level, at least start with data that has been intended to address your analysis question.
Level 2: Basic hygiene
At this level, teams are concerned with having a uniform representation of basic data elements:
- Numbers (formatting “3,500” to “3500”)
- Null values (formatting “null”, “N/A”, “na”, “?”, “-1” and “unspecified” to a single value)
- Standard units and conversions to canonical units (formatting “30lbs” to “13.6” in kg)
- Geo-spatial points, circles, arcs and polygons
- Date & time, currencies, names, addresses, emails, Boolean and other common types
There are international standards for most of these values, although often there is more than one “standard” at play. For the data consumer, the correct choice is always “the format that my tool natively understands” – so the answer may be different if that tool is Tableau, R, pandas or Excel.
Most of the tools providing data quality metrics, data completeness or richness metrics, automated data normalization or master data management operate at this level.
Level 3: Outliers, flurries and unlikely combinations
The next level looks beyond individual data elements into descriptive statistics and likely errors. For example, a patient taking 12,345 different types of medications is most likely a data entry error, or at least an outlier than should be excluded when computing outlier-sensitive descriptive statistics like the mean and standard deviation.
Sometimes the issue is not an individual outlier but a flurry of nearly exceptions. For example, in a dataset of second-by-second web traffic to a website, a 100x increase from the historical norm over 10 minutes is far more likely the result of bots than of real user activity. Flurries appear often in time series data, and the decision to keep or remove must be made on a case by case basis.
Domain experts should also dig deeper to uncover errors due to unlikely combinations. In healthcare in particular, clinical unlikely combinations can often be identified. For example, in one case a weight of 535 pounds for a 25-year-old female was plausible on its own, but looking at other measurement & lab results of the same patient, it became fairly obvious that this was a typing error for a 53.5lbs woman. Such “clinical unlikely combinations” are more often than not data entry errors, which should be eliminated prior to downstream data analysis.
Level 3 requires deeper knowledge of statistics as well as deeper domain expertise from your DataOps team. As the examples show, at this level “fixing” data quality can no longer be fully automated away.
Level 4: Coverage gaps
The next level advances beyond describing the quality of a dataset as a universal measure – to discuss its fit for a given analytics project. This in particular relates to identifying gaps in the data provided, and finding ways to complement them.
For example, consider a project tracking vaccination rates among children in the UK. Collected data may have spatial coverage gaps (i.e. no data was collected in Wales), temporal coverage gaps (there is data for 15 years, except for 2010 due to budget constraints that year), or demographic coverage gaps (i.e. kids who are not in school are not surveyed).
The significance of such gaps depends heavily on the project. For example, if the goal is to identify kids at risk, then finding kids who are not at school may be critical; however, if the goal is to compare boys versus girls, such gaps may not undermine the whole analysis. Making the right decision and corrections requires collaboration between your analytics and DataOps teams.
Level 5: Bias & Prejudice
At the higher level, you run fully integrated projects teams that combine data science, DataOps and software engineers. People work on the same project daily, which enables finding and addressing nuanced but critical biases in data that has “passed” all the previous levels’ quality gates.
This applies most commonly to machine learning and data mining projects. For example, assume that we are looking to building an algorithm that automatically assigns ICD-10 diagnosis and procedure codes given a patient’s hospital clinical record. To train it, we can only get data from one of two hospitals in town. Both hospitals have internal medicine wards, but one specializes in (and gets most cases of) cardiology and oncology, while the other specializes in immunology, Endocrinology and geriatric medicine.
Notice that no matter which hospitals we choose – the distributions of diagnosis and procedure codes in our training data is going to be skewed from what we will observe in “the real world”, This will skew machine learning algorithms, since the a-priori distributions in the training data will be different than what will be observed online.
The choice of dataset will also cause other, less obvious biases. Each hospital will by nature experience a different distribution of ages, genders and co-morbidities – since in healthcare, all of these are related. This means that both supervised and unsupervised learning algorithms will be skewed in subtle but important ways due to these relationships and correlations.
This has important implications in practices, as we rely more and more on machines to make day-to-day decisions that impact people’s health and well-being. Your choice of training data has implicitly ignores the people not represented in it, and may overly punish or reward those who are represented based on their past behavior. This is fundamentally a data quality issue.
Knowing that such problems exist and addressing them effectively requires deep collaboration between data scientists and DataOps experts, on an ongoing basis – and is a requirement to producing machine learned models or predictive analytics that are free from undisclosed prejudice and stand the test of real-world use.