Data Curation

18.09.2017

Svitlana Rybka

In our previous post we’ve covered Data Engineering as one of the levels of Maturity model of productive analytics platform.

In this blog post, we are going to examine the problem of Data Curation.

Data Curation

Use best source
License compliance
Freshness
Metadata
Representative of production use

While building a model, you need to get data based on which you will train the model. It is critical that you feed models the appropriate data for the problem you target to solve. Luckily nowadays it is possible to find almost everything on the Internet, but such availability comes with no guarantee of quality. Your efforts will be simply lost or even worst – lead to wrong results, if you do not select and use the best source. There is always a temptation to collect and utilize all data that is available, but this is not always true. Accumulated data may have gaps and you will never know if this is a missing data or if the data was temporary not available during collection or if this data should not be there by intent. Right approach here will be to ask for domain expert’s opinion, who has the domain-specific knowledge and knows how and where to get relevant data and who will guide you through these myriads of data resources. You need to clearly understand what problem your model will be solving, be able to recognize relevant data and use a source that not only you can trust, but also all your model users.

When you have found the source of data, don’t forget to check license compliance. During our experience of Data curation we’ve seen way over 100 different types of data distribution licenses that need to be carefully read and followed. Not only the number of potential licenses puts pressure on the researches, but also the fact that resources that are at the moment re-distributable under particular license can change legal rights of data reuse over time, so should always be consulted before each data publication. Be aware that if you don’t see any license requirements for particular dataset, doesn’t mean that anyone is entitled to use the data for any purpose. Pay attention that global obligations may be applied. For example, local policy, funder obligations, institutional obligations, data repository obligations, etc. Sometimes multiple licensing approach is used. In this case you will have to choose from a specified set the license under which you will use the data. Another possibility that is quite rare, but still possible, is negotiating a special license or contract that will allow you to use the data even thought it was not allowed initially. Make sure that every dataset that you are using from each different source is redistributable.

Next step would be to check the freshness of the data making sure you have the latest available data. Keeping data up to date is not a one-time activity, but a continuous process. No need to say that everything that you build on top of the used data doesn’t make any sense if you take data from outdated sources or do not take care of constant data update. If you are still not getting accurate predictions, go back to the training set, check that data is not “expired”, as well as the quality of your data.

Metadata for curated data should be correct and current, so you always know the data schema, types of fields, which fields are required and which are optional. Metadata must be sufficient to explain others what data exists, why, when, by whom and how it was generated, etc. It is very important that metadata explains not only the content of dataset, but also its context (term in one dataset may not mean the same as that term in the context of another dataset). It is necessary to understand that he same context have been used before merging different data sources. To provide high level of accuracy it is recommended that collection and review of metadata information as well as versioning and updating it is done by a human.

Representative of production use is one more characteristic that curated data must obtain. Data on which you train the model must represent how your application or service will be used in production. For example, if you train your model of shopping preferences on the data collected in USA, there will be not much effect if you use this model on European market as available products will be different, e.g. users will not be able to buy from the recommended brand or purchases suggested before 4th of July will make completely no sense for European customer.

In this blog post, we discussed levels of maturity model of productive analytics platform focusing on Data Curation.

In the future blog posts in this series, we’ll cover other levels of maturity model of productive analytics platform:

Data Quality
Data Integration
Data Security & Privacy

Please note that Data Quality as a most critical topic was covered before the series started. You can check the examples from real life and study the 5 Levels of Data Readiness for Analytics.

Furthermore, the application of Generative AI in Healthcare and Healthcare Chatbot technologies can enhance data quality and integration, maximizing the effectiveness of your analytics efforts.