A lot of the time spent in preparing data for analysis is not spent on just formatting data the right way, but also on making sure that data elements from different sources mean the same thing. For data to be inter-operable, the semantic meaning of each data elements must be consistent across all the datasets you depend on. The richer the semantic is, the deeper the data quality and the assumptions that can be made on combining datasets.
Here are some of the best practices we’ve learned and applied over the years, in curating and enriching data from dozens of sources into one fully inter-operable semantic standard. We are always looking to learn and improve, so please share your experience.
A Unified Type System
The core of your semantic standard must be a unified type system: A full, well-defined definition of numbers, strings, Booleans, date, time, location, currency and units. The type system should be fine grained, so that it can be mapped to the most specific type available in each platform where the data is used. For example, if you have a ‘short’ integer type then it will be mapped to ‘smallint’ in a SQL database, ‘short’ in ElasticSearch or ‘Number’ in JSON. There is no loss of specificity in any case.
There should be a standard for representing a null values, enumerations (selection of a value from a closed list of possibilities), and true / false values for Boolean fields. The representation of these values may change when the data is mapped to different platforms (i.e. NULL in SQL versus null in Parquet), but the meaning should be the same and fully preserved in any mappings.
Derived data types
Beyond the fundamental types that every programming language or platform supports, there should be standard definitions for commonly used data types that are derived from them. For example:
- Derivatives of strings: Email, URL, phone number, street address, hostname, IP address, etc.
- Derivatives of numbers: Currencies (32 EUR, 64 GBP), Units (10 kg, 25 m/sec).
- Derivatives of date and time: Date only, time only, duration, time interval, etc.
- Derivatives of location: Geo-point, circle, arc, rectangle, polygon, etc.
Unified and standards based data model – including numbers, dates, units, currency, null
There should be a standard way to define semantic constraints on fields of different types:
- Strings: Must have a minimum and maximum length, must match a regular expression, etc.
- Numbers: Must be between a minimum and maximum, must be divisible by some number, etc.
- Date and time: Must be within a date range, must be during a workday, must be in the past, etc.
- All: Required fields, Unique fields, Value must be from an enumerated close list of choices, etc.
Note that some established derived types specify constraints (i.e. a weight cannot be negative, a phone number must match a regular expression). As with types, it is useful for constraints to be granular, so that they can be mapped effectively to different metadata standards (see below).
However, to enable this mapping to always happen, it is recommended to avoid supporting full programming-language style constraints – i.e. “end_data >= start_date + days(7) – since this require a full-blown expression language which is not supported by most metadata and data governance tools. Stick with semantics you can implement on any platform, current and future.
Levels of Measurement
Since we are preparing data for analytics, making sure that the level of measurement of each numeric field is explicitly defined is a key feature of any analytics-ready semantic framework.
Statistics got its bad reputation by people applying methods that are not statistically valid, without understanding the underlying assumptions, getting junk results, and being surprised when those results do not reflect reality. Not specifying the level of measurement for every numeric variable, and considering all “numbers” to be the same, has this exact effect – especially since most software and BI packages automatically compute descriptive statistics and correlation between variables.
Computing a mean or standard deviation on a nominal or ordinal variable is plain wrong and delivers a meaningless result. Computing Pearson correlation between categorical fields is equally wrong and misleading. The four levels of measurement must be explicitly defined for every field to prevent this.
Note that these levels must be manually curated and cannot in general be automatically deduced. For example, there is just no way to know if a column containing the values 3, 5, 17, 31, 55, 99 is nominal (reason codes), interval (house numbers) or interval (temperatures). Automated assumptions are dangerous, since they can lead to invalid descriptive statistics and correlations, which are then often published to business analysts and decision makers.
Beyond the schema of each dataset – i.e. the name, description, type and constraints for each field – it is also essential to document key facts of the dataset as a whole:
- A short description of the dataset (to appear in search results), as well as a long description for any additional information that the publishers deemed relevant
- Temporal and spatial coverage
- Keywords and synonyms, to make the dataset more easily discoverable
- When the data was first published, when it was last updated and by whom
- Copyright and license information
Many of these facts should have a standard representation, so that it can be mapped directly to other data management platforms. For example, if the “last updated” fact is always documented (and validated) to be in the exact same format across all of your dataset, it becomes easy to index all datasets and the query for all datasets that changed in the past seven days. We recommend that you adopt strict guidelines for each fact of interest, prior to beginning to prepare datasets at scale.
Curated, not Inferred
As a best practice, you should require your DataOps team to curate and document the dataset properties and every field manually, using their domain expertise and familiarity with the dataset. There should definitely be automated validations that valid types, non-conflicting constraints and standard facts are documented, but the curation itself should be manually and not automatically inferred.
Most modern analytics libraries provide the ability to automatically infer the types of data in tabular form, and for example propose certain fields as numeric while others as categorical. However, by nature they cannot always be right – this requires domain expertise and a knowledge of the data’s meaning beyond the numbers themselves, which is what your DataOps team should provide.