Your DataOps team is the one making sure your data scientists do not spend 50-80% of their time preparing data for analysis. Here is what this involves.
First and foremost, a DataOps expert must help you find the data you need. Just like real-world libraries, if you already know exactly what you’re looking for, you can just get it yourself, but if you don’t, you need a librarian.
You should be able to ask generic questions like “find me data that is relevant to building a healthcare anti-fraud app”, and have the librarian come back with datasets about licensed physicians, suspended licenses, pharma payments to physicians, census data matched to physicians’ addresses, grouping of physicians to peer groups by sub-specialties and more.
If you’re asking for data on drug prices, a librarian should be the expert guiding you about the nuances of prices: distributor vs. consumer pricing, Medicare limits versus street pricing, insurance vs. uninsured pricing, and billed vs. actual paid amounts. They may also offer other relevant datasets, like mappings of brand to generic medications, clinically equivalent therapies and list of drug name synonyms.
A great librarian understands your intent and shows you content you wouldn’t have thought of yourself.
Data science fundamentals are another key part of the DataOps job description.
A couple of years ago, we came across a project in which an analytics team wanted to generate a population health cost index, and to do so decided to extrapolate metrics from a large set of Medicare clinical claims. It took someone deeply familiar with the data to point out that Medicare is largely used by senior citizens, meaning that the distribution of diseases, chronic conditions and procedures they bill for is heavily skewed towards that age group – and not representative of the overall population.
Similarly, it requires deep familiarity with the specific dataset, problem being solved and machine learning fundamentals, to know that a given dataset is a poor fit for a given supervised classification problem, because the distribution of classes in the training dataset is not similar to what will be observed in production.
The ways to address such “gotchas” are either by having data scientists with strong domain specific knowledge, or DataOps expert with a strong data science background. Or both.
What do you do if the necessary data does not exist? You generate it.
Simulated data is usually required either to “smooth” gaps in data coverage, or to reduce reliance on highly sensitive data. For example, in a past project we were asked to provide a dataset of patient stories – full patient histories and inpatient visit records – that will cover the full range of adverse events that can happen within hospitals. While a substantial number of real inpatient records were available, they were not complete, were hard to be allowed into a broad study due to privacy concerns, and still did not cover all adverse events for all relevant demographics, since some of them are relatively rare.
This proved to be a fun & complex data simulation project – each new patient story had to make clinical sense (age, gender, symptoms, medications, order of events, etc.); adverse events had to happen according to their real expected distribution; and patients with no adverse events had to be added, to maintain these distributions, while also keeping the overall distributions (of demographics, specialties, co-morbidities) realistic. That was one example where producing data was harder than the data science project it was used for – and which also required substantial data research to find the relevant adverse event tables, distributions, correlations with demographics and disease states, and others.
You don’t only need the right dataset at the right quality – you also need it right in the platform you do your analysis in, in the optimal format for that platform and toolset.
Let’s assume you running a natural language analysis on ten million clinical records, and your tool of choice is Apache Hadoop or Apache Spark. Your DataOps expert should know the data formatting and access choices for these platforms, and for example recommend Parquet as the read-optimized data serialization format for that data, transform the data into that format, load it for you into the cluster, generate Hive or SparkSQL tables, and only then call you in to do your job.
On the other hand, if you are running a geo-spatial analysis about people’s access to hospital, and ElasticSearch is your platform of choice, then a very different recommendation is in order. Given several thousand hospitals at most, a viable choice would be to format the data as one index of hospitals, using GeoJSON for the geo-spatial coordinates or polygons, and load it all into memory for the analysis.
Formatting and moving data around isn’t fancy, but it’s a core part of preparing data for analysis, and hence of the DataOps job description.
Last but not least
An undertone of all the above roles is that your DataOps partner has to be a deep domain expert in the space you are working on, and also has to be part of the project team. Make sure people know what problem you’re trying to solve and why, and then raise your expectations from them.