In previous parts of this series we covered the following levels of Maturity model of productive analytics platform:

In this blog post, we are going to examine the problem of Data Security & Privacy.


Data Security & Privacy

  1. Security controls
  2. Compliance
  3. De-identification
  4. Privacy-Utility tradeoff analysis
  5. Attacking & reverse engineering models


Analytics platforms are often dealing with personally identifiable data. You cannot completely avoid it, because some private information is usually used in analysis. So you have to protect this data.

Basic security controls: role-based access, user management, encrypted communication, following security rules (e.g. when people leave the organization their access to data is stopped) – are the simplest measures that must be taken.

Compliance – follow standards required by law (like ISO, HIPAA or PCI DSS). Introduce more security controls (audit logs, minimum password strength, not using previous passwords, every week someone actually opens log and checks if there is anything suspicious).

Anonymize data and get rid of the restricted data that are not needed for analytics. If you are dealing with personal information the first candidates for de-identification are: full name, full address (maybe leave zip code), exact birthday (just keep the age). The data that identifies individuals must be stripped out (non reversibly anonymized) before the analysis.
Note that there is a problem with de-identification that during attacks data could be re-identified. For example, the famous Netflix Prize Dataset deanonymization case. Fully de-identified data means that it is not possible to identify an individual from the data itself or from that data in combination with other data. In many projects, you cannot fully de-identify.

Privacy-utility trade-off is when you check the data field by field to see what cannot be de-identified because this data is needed for your specific project. The more privacy, the less utility – datasets become less useful.

Machine learning systems are increasingly tempting targets for attacking & reverse engineering models. Models can be tampered by training them on wrong examples. This security problem was covered on OWASP Summit 2017. Another vulnerability is reverse engineering or stealing machine learning models. You build, train and invest in the model to finally provide prediction API for commercial usage (machine learning as a service), but someone can reverse engineer the model (so-called model extraction attacks) and all the efforts will be gone. Currently the common efforts to extract the model are pretty low. As it was pointed on 25th USENIX Security Symposium it takes only hundreds or maximum thousands of online queries to hijack your model. You should always remember that using your model API should be cheaper than attempting to extract your model. And this does not mean that you should bill as less as possible for each API call, this means that your security must be strong enough to prevent the stealing of the model.

To avoid issues like this and make the most of machine learning, it’s important to follow all five levels of the maturity model of productive analytics platform. Tell us what you are doing and what complications you are facing and John Snow Labs will include our solutions within your operations.