Data Evolution: Tips on Keeping Reusable Data Up to Date

29.08.2016

David Talby

Chief technology officer at John Snow Labs

Having the most fresh, up to date data at your fingerprints makes analysis more relevant and accurate. When data is updated, you would like to just be able to copy the new datasets over the old ones and continue working. The ability to access and compare with previous versions is also useful.

The many complexities of making this “just work” have spawned decades of research and development in schema evolution, taxonomy management, configuration control and related areas. Here are some of the most common use cases and best practices we have found useful in the data science & analytics realm.

The Ops Part of DataOps: Knowing what has changed

The world’s information is not organized in nice, clean libraries and API’s (not until we’re done with it!). Different datasets are updated on different schedules (quarterly, monthly, daily, or whenever enough data is available), and often on an irregular basis (whenever a committee gets to meet about it, or whenever changes are deemed urgent).

There is no standard or formal way to announce changes. For example, the recent addition of 3,651 ICD-10 hospital inpatient procedure codes and about 1,900 ICD-10 diagnosis codes was announced via – drum rolls – meeting notes published on the CMS website (YouTube videos of the meeting are also available). Another example is finding out about drug shortages, for which there is no API but two mobile apps that provide notifications. Keeping up with such changes requires having an in-house domain expert who continuously monitors for and integrates these changes as soon as they come. Most companies choose to outsource this ongoing operational burden.

The Ops Grind: Clean updates every day

Some business critical datasets are updated on a daily or weekly basis: Licensed physicians, licensed drugs, clinical coding edits (LCD’s and NCD’s), newswire feeds, and multiple types of public health warnings. Keeping these datasets fresh requires automating the data extraction, cleansing, validation and formatting of the clean output.

Doing this on your own requires an automated data pipeline to which different input formats can be easily plugged, so that the cleansing, validation and publication code is shared and kept consistent across all of your data sources. The input format will almost always be custom – since there is so much variety in how data and updates are published across the ecosystem. The process should process and output both the data itself and its metadata, so any changes in the data’s schema or provenance are versioned & published with the data itself.

In simple cases, such as adding to an existing dataset or updating parts of existing records, the new version can just be uploaded and used as-is. More complex cases, like schema evolution or changes to taxonomies or referential records, are discussed below.

Another common requirement is to publish not only a full updated dataset, but also an incremental dataset – including only the changes – so that online systems can quickly ingest the changes. For example, the AERS database of reported adverse events goes back years, which makes it much faster to ingest only the most recent new entries on every update, instead of reloading the entire database.

Version Control

One best practice that should always be in place is a version control system. It should be easy for everyone to access previous versions, visually compare versions, and see a documented list of changes. There are many good choices here, which can also provide access control, auditing and backup functionality. One point to consider is that some datasets can be very large (in the many GB or TB), so file-based version control systems like the ones used for software development or document management aren’t good choices – you should look for a system that supports large files well.

Another key best practice is to always version metadata together with data. Each versioned dataset should have a pair versioned metadata file, which describes that data’s contents, provenance, schema and key facts. It is essential that these two files are always updated, versioned and published together. This way, if a field is added or removed from a dataset across versions (schema evolution), the schema of each version is always correct, and the changes are easy to see when comparing the two metadata files.

Also, sometimes the changes will not be obvious in the data – since they will relate to how the data was collected (i.e. change from collecting data face-to-face to asking over the phone), or what its coverage is (i.e. the sampled population for a survey has changed). Sometimes these “minor” changes can have dramatic impact on both descriptive and predictive analytical results from the data, and this has to be documented as part of the metadata of each version.

Terminology Changes

The last common use case that requires special treatment by your DataOps expert is when ontologies, taxonomies or terminologies are changed. Consider for example a longitudinal study on clinical outcomes of ways to treat coronary disease. The study will evaluate the effectiveness of drugs, procedures and lifestyle changes on different types of heart patients.

The challenge is that the canonical names and codes for symptoms, procedures, drugs, co-morbidities and even demographic information changes over the years. Thousands of changes happen every year – for example, in the recent version 18.1 of the MedDRA (Medical dictionary for regulatory activities) terminology, 1,323 change requests were approved and implemented. New versions with similar change activity come out twice a year.

So how do we plot the longevity of patients with a certain heart condition to a certain medication, if the names or codes for both the condition and medication have been changes 20 times over ten years? What if new synonyms were added and used, drugs became generic and rebranded, and ICD-9 became ICD-10?

Dealing with such changes well has been the subject of much academic and industry activity, and requires careful modelling. Most importantly, referential data should never be deleted – i.e. once a drug code was used and stored in a patient record, this code should never disappear (or even worse, reassigned). Instead, it should be marked as deprecated, and there should be a standard way of mapping this old values to one or more other values so that longitudinal studies can use these semantic mappings to track changing but equal values over time. Mappings are often more complex than simple equivalences since sometimes one value is expanded into multiple (more granular) values, or sometimes the reverse happens and multiple values are mapped into a single parent node.

Whether you are building or buying ontology, taxonomy or terminology datasets, make sure that the source has real domain expertise in dealing with the specific challenges of modelling updates in them correctly. You should be able to just “replace on update” such datasets, which requires the source to do the heavy lifting in the backend and provides you with an easy to use solution.

To ensure the ongoing relevance and utility of reusable data, integrating tools such as Generative AI in Healthcare and a Healthcare Chatbot can provide valuable insights and support. These technologies not only enhance data management practices but also facilitate real-time updates and accessibility, allowing organizations to stay informed and responsive in a rapidly evolving landscape.

David Talby

Chief technology officer at John Snow Labs

Our additional expert:

David Talby is a chief technology officer at John Snow Labs, helping healthcare & life science companies put AI to good use. David is the creator of Spark NLP – the world’s most widely used natural language processing library in the enterprise. He has extensive experience building and running web-scale software platforms and teams – in startups, for Microsoft’s Bing in the US and Europe, and to scale Amazon’s financial systems in Seattle and the UK. David holds a PhD in computer science and master’s degrees in both computer science and business administration.

Datasets VS Algorithms - A Breakthrough in AI 6x Faster

Mohamed Tharwat

The past years have witnessed strong emergence for different datasets and algorithms repositories. Some inquiries accompanied this emergence. An increasing amount of...