The data scientist’s point of view on Turnkey Data
When you start an analytics project – say, an analysis of adherence to medication, or prediction of depression in chronic patients – you need a lot of data from multiple sources prepared upfront. The data has to be of high quality, up to date, and inter-operable across its different sources. It needs to be well documented, so that you easily know what each field and data elements means. And most fundamentally, it has to just be there, already formatted and loaded into your analytics platform of choice.
This data engineering aspect of DataOps – making sure that as a data scientist or analyst you spend zero time loading and formatting data and metadata – is what this post is about. Here are some requirements and best practices we have found to be useful; we’d love to learn from you what else proved useful.
Data Serialization & Formats
First and foremost, data has to be in a native format for the analyst’s tool of choice. For tabular data, this usually means CSV (comma-separated values) for tools like Excel, Tableau or Qlik; JSON format for NoSQL databases like MongoDB or Cassandra; and binary formats for large datasets on either SQL, Hadoop or Spark.
For some of these serialization formats there are standards, or at least de-facto standards, like IRC 4180 which defines the text/csv MIME type. Whatever your needs, there is an open standard for every type and variant of data element, so there is never a reason to invent your own or accept a proprietary one.
However, just applying a standard is not enough – because as a user of the data, the correction answer for your required format is “whatever works out of the box within my platform”. So if all your date and time fields are perfectly represented according to ISO 8601 but your version of SAS or SPSS reads them as strings, then that format is wrong for you. You should require that data is formatted for you in a way that’s easily accessible by your platform’s native libraries – no proprietary libraries should be needed.
The only way to make sure this consistently works is to ensure your data providers actually test the data in the format provided against your platform, in advance, and keep that testing up to date as new platform & dataset versions are released.
Optimization
Your next requirement is optimization. The format has to be optimized for your platform of choice, and minimize computation time, network bandwidth and the amount of memory used. For example, the Hadoop & Spark ecosystem supports a variety of text-based file formats, but also the newer Apache Parquet columnar storage format.
Compared to text, XML or JSON formats, Parquet can boost queries to run 30 times faster while also reducing storage needs by 75%. The exact figures depend of course on the specific dataset and query benchmark, but are substantial. Cloudera released performance metrics that showed Parquet 2.0 as the best performing data format for Impala queries, compared to all other available options including Avro, RC files and compressed text. Parquet is a binary, compressed, columnar, self-describing data format, which was designed specifically to scale for big data distributed storage & query platforms like Spark.
Parquet is not the optimal data solution for every platform and analysis. The key point is that different technologies can produce dramatic differences in efficiency (and analyst productivity), so you should require your DataOps team to optimize their choice of format for your needs, and explain their choices.
Metadata
If you ask someone to give you a dataset, and they give you a dataset, then you got half of what you asked for.
In addition to the data, the metadata – dataset description, first and last update times, keywords, schema definitions, keys and other constraints, license information, temporal and spatial coverage, data quality scores and other key facts – are as important as the data itself to make it usable.
Metadata has to be provided in both human-readable format – i.e. a nicely formatted document that a researcher can read to quickly evaluate if a dataset is even relevant for them – as well as machine-readable format, so that the dataset can be easily ingested into your data governance, lineage, catalog, discovery, security & privacy or version control tools.
The best practices for how to format metadata files are the same as they are for data. Make sure it’s clean, up to date, and versioned along with the data itself. Stick with open standards – but test that they work with your specific tool of choice. Optimize the format based on how you’re going to use it.
Metadata Formats
“The nice thing about standards is that you have so many to choose from.” – Andrew S. Tanenbaum
There are multiple (competing) standards for metadata from multiple organizations. Here are some of the prominent ones, which we have experience with:
- Open Knowledge – Data Packages and Tabular Data Packages
- Open Data Institute & W3C – Data Catalog Vocabulary
- Project Open Data at cio.gov – Common Core Metadata Schema
- The Dublin Core – Metadata Initiative
- JSON Schema & JSON Table Schema
- Open mHealth – Clinical measures schema library
In addition, in some cases the requirement is just for a database schema or data model, Hive meta-store viewer or full list of search indices. Usually, whatever is easier for you to use within your current platform is your right choice, and the one you should expect your data provider to support.
Full & Automated Data Ingestion
The final stage of preparing data for analysis is to actually load the data into your platform of choice. This should be done in an automated fashion, which prevents manual errors and allows for fast & easy updates as soon as they happen.
Your DataOps team or partner should do this for you. For example, if one hundred datasets need to be loaded into a SQL database, this requires the generation of 100 “create table” statements based on the dataset’s metadata (field names, types, constraints, etc.), execution of that script, execution of another script which loads the actual data into the database, configuration of permissions and other metadata elements, and automated validation that it all works.
If you’re using Hadoop, then the load process would involve parallel ingestion of the data in Parquet format into HDFS, and then creating a Hive meta-store external table to enable SQL-like access to that data. Other scripts and validation would be required if a data lineage tool (like Cloudera Navigator or Apache Falcon) are deployed, or if another distributed query engine (Impala, Drill, Presto, etc.) are used.
You should require your DataOps team or provider to be able to produce multiple formats of both data and metadata, since it is very likely that business-critical datasets will be used across multiple platforms or platform versions over time. Automatically generating all of them from the same source of truth provides essential piece of mind that all representations of the data & metadata will be consistent.
To API or not to API
A lot of attention over the past few years has gone to providing data via API’s – usually REST API’s. This is a practical and valuable way to share data and enable applications to easily query and consume it – but often does not work well for analytics.
Running aggregate analysis, statistical methods, machine learning or data mining algorithms often requires use of entire datasets, and very often in very different access patterns that what user-facing applications need. Many analytics platform are now optimized – or have even re-implemented distributed query & machine learning algorithms – to best utilize available disk, memory, bandwidth and the processing framework they run on.
If you have access to a big data platform (like Spark or Hadoop), an MPP database (like Netezza, Vertica or Greenplum), or any database providing in-database analytics (practically all the major commercial ones) – then you must load the entire dataset into these databases before starting your analysis. We have found this to be by far a more common use case than calling out to an API, across the analytics space and in particular in healthcare, where security & privacy requirements further limit the ability to connect to external Internet services from within your analytics platform.