Why is data licensing so, so hard?
As a data consumer, I want to easily get and use free open data, and have a straight-forward license for paid proprietary data. If only life were that simple.
Opening data while prohibiting using it – the title of a study from March 2016 – found that 48% of surveyed USA open datasets are encumbered by copyright restrictions. This confirms what we have found at John Snow Labs – well over a hundred different licences across the datasets we have processed over the years, many sites silent on which license they use, and – much more often that you’d expect – conflicting license or copyright information within the same page or website.
In fact, most of the data sources we work with do not reuse a standard license, such as Creative Commons. When they invest in clarifying their legal language, they most often come up with their own custom license, which creates its own problem: The need of data consumers to read, understand, track and comply with many one-off requirements across the data sources they use.
Common types of data licensing clauses include:
- Attribution requirements
- Regular reporting requirements
- Privacy training of people who have access to data
- Field of use restrictions
- Internal use restrictions
- Geographic restrictions
- Redistribution restrictions
Payment models vary from per-user, per-machine, per-viewer, per-call or by organization type, size or revenue. This happens even for core healthcare datasets like CPT or SNOMED-CT, for example, which are legally required to be used in a broad array of scenarios.
This complexity often makes it extremely hard in practice to fully comply and understand costs. This is a headache for both large companies, who are taking a legal risk, as well as for startups, who find themselves surprised when asked for data license disclosures (similar to open source license disclosures) as a standard part of investors’ due diligence.
Privacy & the needs of data publishers
In healthcare, the need to deal with highly sensitive personal information, which has criminal law protections, adds additional layers of complexity to data sharing and reuse. Data re-identification is the top concern for healthcare data sharing – about twice as much than the next concerns on the list in this December 2015 survey. This makes sense given past examples of how easily records that were believed to be anonymized were re-identified, and the resulting lawsuits and bad press that followed.
Data owners have several legitimate concerns against making data public:
- They are taking a legal risk amid ever-improving technological capability to combine and infer connections between data elements, making re-identification attacks easier.
- They are taking a legal risk because the standard methods of HIPAA-compliant de-identification are evolving, and even basic questions like the meaning of consent in health data sharing are still unclear.
- Patients or other study subjects may refuse to participate if their personal data may be shared, or if the limitations on data sharing are fuzzy.
- De-identification is always a trade-off between privacy and utility. Since the degree of de-identification (as measure by k-anonymity, l-diversity, t-closeness or related privacy metrics) needs to be different based on how public the data becomes, it is sometimes simply the wrong trade-off to de-identify to a level that enables full public disclosure of the dataset.
On top of that, there is overhead in data sharing that is unrelated to privacy & security:
- Opening up to questions about data quality, and the overhead of supporting other data users.
- Developing a data management capability and the software needed to prepare & publish the data.
- Anonymizing data to enable publication, and documenting the methods used to do so.
- Correcting the data when other users point to issue, and publishing data updates as they happen.
- Paying for hosting and securing the data.
- Deciding on or writing a data license – which in the case of personal information, cannot be a full open source one.
- Bearing the overhead of checking that users comply with the terms.
Considering this list, it’s no wonder that the default choice by many is to simply not publish data – even though this causes academic, commercial and societal harm.
What data consumers need to do today
Those of us building software, analytics or data science solutions today cannot wait for the industry to fix itself. What we can do is understand the world as it is – the perspectives of both data publishers and consumers, as summarized above – and then make sure to play within the rules.
Just as using open source software is not free – you have the overhead of tracking compliance, ensuring attribution and providing disclosures to investors – using open data is also not free. In healthcare, you may also put yourself in serious legal hot risk, by neglecting security and privacy requirements.
Your action item is to make data compliance a key deliverable of your DataOps team – whether that team is in-house or outsourced. That team needs to always stay on top of license & regulatory requirements, and perform the operational tasks of reporting, attribution, disclosures and process-oriented privacy and security controls. It’s the law, it’s being a good citizen of the data ecosystem, and it also builds trust in your data partners, making them more likely to share sensitive and early-release data with you.
John Snow Labs provides full data compliance operations as an always-included part of our support package. We also help our customers find in advance datasets that fit their licensing requirements (i.e. can be used within a commercial product, in their target markets, etc.), and choose between free and paid third party datasets depending on their exact needs. We encourage other data & DataOps providers to do the same, since this both eliminates a major customer pain point and enables customers to make faster progress focusing on their core strengths – software and data science.
What data publishers need to do today
We believe that making healthcare data open and easily reusable is a moral responsibility of data owners, and provides many benefits to society. We also understand the difficulty and overhead involved in sharing data, which often make it impractical despite good intentions.
We would like to help.
Making data ready for analysis is the only thing we do.
If you have data that you would like to share:
- Under a limited policy, with strong privacy & security controls
- As a paid revenue stream (data monetization)
Then we provide a turnkey solution to achieve that. Once you provide us the data under contract, we will clean, anonymize, normalize, document, validate, reformat and publish it for you. We will then distribute it online, via our partners (at your discretion) and marketing channels. We will provide support to data consumers, apply corrections and updates to the data as they happen, host and serve the data and its documentation in multiple formats, help data consumers comply with your data use policy, and if it’s monetized also handle all aspects of billing. This can be done as a paid service, revenue share or combination of both.
Whatever path you choose to publishing your data, we hope you do, and under standard and easy to use terms, so that together we slowly evolve our industry in the right direction.