The report from the House of Lords Select Committee on Artificial Intelligence is 183 pages long. To quote from it, “All indications suggest that the present status quo will be disrupted to some extent by the upcoming General Data Protection Regulation (GDPR), which the UK is planning to adhere to regardless of the outcome of Brexit, and the new ePrivacy Regulation, which both come into force across the EU from 25 May 2018. With respect to data access, the GDPR’s introduction of a right to data portability is probably the most significant feature.”
Despite heavy criticism of the firm’s track record, Facebook creator Mark Zuckerberg has promised the United States Congress to implement controls similar to GDPR across the globe.
Internet creator Sir Tim Berners-Lee proposed Solid (a term derived from “social linked data”), a set of standards and tools that would allow internet users to choose where to keep their personal data and how it should be used.
At the other end of the spectrum, Sensetime is the technology partner enabling China’s nationwide facial recognition program.
Regulators are focusing on “data monopoly” to escape the “data poverty” bucket. These changes benefit the end user in addition to data control over ownership.
Source of Truth: Federated vs. Central
Traditional organizations that are not moving to the cloud en masse or who are cherry picking applications or silos to move to the cloud may end up with a federated data infrastructure across on-premise infrastructure, the cloud, and hybrids (public and privately managed clouds). Those organizations that fall in the all-in-the-cloud or on-premise-only do not qualify as centralized although that might be the paradigm they fall into inadvertently.
As you can see, ending up with federated or centralized data sources is almost never a choice but rather an accident or side effect. Given that, how do you ensure data quality/ governance by design? Where there is no single source of truth, product development best practices that are tools driven are most successful.
Focusing on single source of truth almost never results in the same C-suite sponsorship as cloud migration, cloud budgeting, or even data quality. Single source of truth is a means to reach some of these funded objectives. Data pipelines are a great way to ensure auditability and reproducibility of data used in any experiment that includes code, data, model, and annotations that go with the experiment.
John Snow Labs allows you to white label and use the same platform we’ve been using to provide answers to our clientele’s modeling and algorithm needs with a perpetual license. The advantage is that we are experts acknowledged for our handling of unstructured data alongside other data types. In addition, the platform is designed for environment agnostic-cloud, on-premise, and hybrid systems. Our Spark-NLP 1.5 benchmarks have just been published. For NLP questions please join our Slack channel.
Third Party Data Sets
The common refrain on the most encountered bottleneck among data scientists, analysts, and machine learning engineers gravitates around the idea that you are only as good as your data. The best of breed inference algorithms cannot reach their full potential without the training data. Likewise, cross validation cannot reach its accuracy full potential without good data quality. Data enrichment is not dependent on just internal data. You have to augment your data using reliable third party data sources. If your pipeline is dealing in C.diff drug predictions, you may need microbiome and metagenome datasets that augment your data set. John Snow Lab’s team of PhDs and MDs have been generating aggregated datasets at the highest quality for the longest time. Please visit our catalog to find data sets that can augment your modeling and algorithm needs. John Snow Labs can also find additional data sets not listed in our catalog for your use.
The standards relating to Data Quality and Governance that you apply to your first party data for handling, processing, and classification practices should not be different and always must augment those that you apply to third party data (in addition to the third party’s agreement with its first party sources). The primary takeaway from the Facebook and Cambridge Analytica hearings is this that just because you bought or sold the data does not mean that the terms and conditions expire at the time of transaction. This is particularly relevant for confidential health care and life sciences data when aggregated. A case in point is the “Golden State Killer” arrest based on a family member’s DNA. The importance of data handling doesn’t get any more apparent when degrees of separation are involved.
Logging, Audit Bots
Security, Healthcare, and DevOps practitioners generally choose to log all their data and store it permanently. Other industries prefer to meet a minimum bar by logging as little data as possible. The current state of streaming data and inference on real- time production data has given rise to “Observability” over monitoring. While “testing in production” is still a hot button issue, observability and replay are notions that the data stewards community can actively get behind. Even reproducibility is run serverless these days. No code is okay, but observability is a must.
Pulling from archived storage to recreate timelines of events is the norm as a business practice rather than a chaos engineering notion. Where IT custodians stop at defining the policy, compliance usually takes over so they meet audit requirements in wall clock time scales.
Cognitive logging bots that anticipate red flags and keep things in near-term archival is a use case of an effective bot in this space. There is a move back to Solr-based logs in place of elastic search because of time and cost.
Black Boxes and Fine Prints
System integrators today want their methods to be less opaque and bolder rather than provided in copious binders that are never read. Local Interpretable Model- Agnostic Explanations (LIME) and Open Neural Network Exchange (ONNX) Open AI are steps in the right direction to unbox the black box. The call for a Hippocratic Oath equivalent amongst data stewards should lead to accountability and that requires all the tooling we can get. There’s more than just a fine as incentive to be acknowledged here with GDPR. This is a chance to get this right with privacy and interpretability as first class citizens. Some popular GDPR tools out there include Collibra and Protegrity that are built on the notions of master data management and single source of truth as cloud native as well as environment agnostic on the last on-premise and hybrid holdouts for many healthcare and life sciences industries.
John Snow Labs Inc. is a DataOps company, accelerating progress in analytics and data science by taking on the headache of managing data and platforms. A third of the team have a PhD or MD degree and 75% have at least a Master’s degree from multiple disciplines covering data research, data engineering, data science, and security and compliance. We are a USA Delaware Corporation, run as a global virtual team located in 15 countries around the globe. We believe in being great partners, in making our customers wildly successful, and in using data philanthropy to make the world a better place.