Save your spot today to the Free & Online Healthcare NLP Summit on April 4-5. Register today here
was successfully added to your cart.

Data Science

Blog about Science

Data Engineering – Maturity Model of Productive Analytics Platform

By Big Data, Data Science, DataOps

There’s always been some debate over what data operations means and still term DataOps is undergoing a process of getting commonly acceptable definition. But for sure interest to it is rapidly growing in IT communities and it is already clear that DataOps is becoming a new trend.

So why DataOps is more and more in demand? Today is no longer enough to just deploy analytics platform to production as part of the Data Engineering activities. Very soon you will face a need to maintain the data used by a system and start Data Curation initiative. Mind that the data should go through all five level of Data Quality maturity model. Looks complex enough? And this is not the end of the efforts! Don’t forget the need to do Data Integration and provide Data Security & Privacy.


data engineering


Here at John Snow Labs we’ve got quite a bit of experience in analytics platform maturity areas and we believe we know how to help you accelerate data science. We’ve summarized the most typical problems, pain points and complains that we get when we start working with new clients.

This blog post kicks off a series of blog posts in which we will give recommendations on doing your DataOps right.

In this first post, we will focus on Data Engineering.


Data Engineering

Use list or picture

  1. Deploy models to production
  2. Deploy retrain pipelines
  3. Auto-load data, models & metadata
  4. Online & offline measurement
  5. Correct concept drift

Data Scientists are good at prototyping machine learning models, but may experience hard times while deploying models to production especially when it comes to maintenance including availability, reliability and security of the platform. These two activities – building model and deploying it to production environment at scale – demand completely different skillsets. On one hand, data scientists conduct research, perform analysis and implement machine learning projects. They are busy with visualizing data and preparing reports. Most of the time they have degree in statistics. On the other hand, models’ deployment usually requires computer science education, knowledge of programming languages (Java, Scala, C++, Python) and parallel data processing frameworks (Apache Spark, Hadoop MapReduce). You are lucky if you have a person in team who is equally good at both, but most of the time there needs to be an engineer who will partner with Data Scientist to make prototyped model work smoothly on production.

To keep the model effective requires deploying retrain pipelines which is rather time-consuming if the process is not automated. This automation is also a job of the data engineer.

Another way to save time (and ultimately budget) utilizing data engineer is to setup auto-load of data, models & metadata. It was never so easy to push new data and retrain models on production.

Did you ever check how accurate is your model? Or what is the performance? A crucial component to ensuring the success of your project is being able to measure model performance. If you had online & offline measurements of the model, you would be already looking for ways to improve those indicators. The more performant your model is, the more extensive would be operational use of your model in production. Remember a very important point: do not treat performance measured on test datasets as real performance of data analytics platform. Compare performance on several independent production-like datasets. Although performance is an important measurement, you should always keep balance between spending efforts on performance optimization and improvement of accuracy of data analytics platform. Rate of correct predictions from all predictions made is an ultimate key to success. Depending on area of application and corresponding cost of error even 99% of accuracy may be not enough.

Also specialists tend to forget how crucial is to keep models up-to-date and don’t let them degrade with time. How many times did you see the concept drift (also known as dataset shift) when accuracy of the model dropped from 99% to almost 50%? Concept drift is a generic term covering changes and corresponding computational problems as time passes. These changes may be of different types and there are different adaptation techniques. Thus, a generic solution is hardly possible. A reliable detection of such changes must be used to maintain high performance and meaningful analyses of datasets.

Role of data engineer is becoming extremely important as you look for a person who will assist you with those five problems mentioned above. You may also request help from JSL specialists who have considerable experience in resolving such situations in production. How you decide to proceed further – tackle those potential problems or ignore them – is up to you. Important that you’ve got the awareness and can evaluate the risks.


In this blog post, we discussed levels of maturity model of productive analytics platform focusing on Data Engineering.

In the future blog posts in this series, we’ll cover other levels of maturity model of productive analytics platform:

  • Data Curation
  • Data Quality
  • Data Integration
  • Data Security & Privacy

Big Data VS Addiction & Gambling

By Big Data Healthcare, Data Science

Before writing this blog, I remained few days wondering whether to search and write on the role of big data in fighting addiction or on big data addiction. I decided then to give the priority to the role of big data in fighting addiction. When we talk about addiction we can include different types of addiction like drug abuse or gambling addiction.

The analysis of big data revealed to a great extent the behavioral processes associated with drugs abuse and its treatment. In addition, it revealed the most probable persons to gambling addiction.

Analyzing huge data sets can allow scientists to change data into knowledge and facts.


Drugs abuse

Big Data to Knowledge (BD2K) program, was launched in 2014 by the National Institute of Health (NIH). You can read more about the project at:

Such huge data projects need a pilot phase to test storage, accessibility and sharing features. The NIH successfully established the pilot phase which known as “Data Commons” in 2017 and expected to continue till 2020. The expected budget for the project is $55.5 Million.

There are different organizations concerned with drugs abuse like the National Institute on Drug Abuse (NIDA) and the National Advisory Council on Drug Abuse (NACDA). Both organizations recommend a set of guidelines concerning drug abuse.

The NIDA, in collaboration with the National Institute on Alcohol Abuse and Alcoholism, the National Cancer Institute, the National Human Genome Research Institute, the Office of Behavioral and Social Science Research has identified a set of measures to promote the comparable data collection across different studies.

Those measures are available in the Substance Abuse and Addiction Collection of the PhenX Toolkit available at:

PhenX contains 523 measures. Measures are grouped together based on the topic, where you can Browse through Collections and view Measures and Protocols. You may also Browse “Domains” or “Measures”.



Gambling revenues could reach $500 billion. Over half of UK citizens gamble regularly and some of them can develop gambling addiction.

[Screen capture showing the PowerCrunch application, Access date: 13 June 2017, Source:]

Power Crunch is an application developed by BetBuddy (a UK-based software house). This application depends mainly on machine learning and data mining techniques to analyze gamblers’ actions and events to detect high-risk players liable to develop addictive or inappropriate behaviors and send them personalized communications and messages including tips for sustaining safe gaming behaviors.

BetBuddy uses a Three-Tier Model for At-Risk gambling detection which has been assessed and published in peer-reviewed journals specialized in gambling addiction science. This new trend in analysis is known as “Responsible Gaming Analytics Technology”. The first tier is concerned with the analysis of the exhibited behavior. The second tier is concerned with the results of self-assessment test, while the third tier is concerned with inferred behavior (resulted from tier one and two results combination). The third tier can then lead to building predictive models.

Psychologists reported that online gaming vendors can track and analyze data to develop consumer behavior models. Online gaming vendors may need to know which games attract the gamblers more, how long they are spending on each one, and how much they can spend on each game.

PowerCrunch is not the sole application, there is also Playscan and Observer. They use almost similar methodologies to that of PowerCrunch and also can send notifications to the gambler at risk. Such applications can close the gambler account if no improvement is noticed.

Nine out of ten gamblers reported that the software helped them to control their behaviors as claimed by Ontario Lottery and Gaming Corporation (OLG).

Data Science External Teams

Data Science External Teams – Bound to Succeed

By Big Data Healthcare, Data Science, DataOps, Data Curation

All organizations, no matter how large, have limited resources. All successful companies and teams, no matter how skilled or diversified, perform certain tasks very well and others not as well. By collaborating with external teams, companies can focus on what they do best and let their partners complement them in areas where they do not have core competencies.


The Role Of External Teams

External teams play a pivotal role in the successful collaborations between the partnering organizations. If you are a part of an external team, you would have felt a continuous pressure from both sides. One side is from your own organization, to develop a long-term relationship with the customer. And on the other side is from the customer, to complete the project successfully while also adhering to their processes and standards. Following a set of BEST practices will definitely help your team achieve both these goals effectively.

There are various sizes of projects with varied needs. It is not possible to say that one type of external team will succeed in all these engagements. But each team type should lay emphasis on few important key factors for successful customer engagements in order to get their repeat business and referral. Here are the most common external team types and the most effective ways to make a long-lasting relationship with the customers.


Staff Augmentation

Equivalent to Collaboration model, the intent is to develop the in-house expert down the line. So accordingly the customer’s team need to work closely with the external team and have checkpoints where they emulate the setups and knowledge from the augmented team

Best Practices:

  • Frequent interaction with the customer’s team – This will help, first to understand the requirements well, and secondly the customers’ team also get a lot of confidence that they are in command with the tasks and it will be easier for them to get the handover of these tasks later.
  • Regular knowledge sharing – External team should impart the knowledge along with the work. This will give the customer’s team ample time to develop their internal resources
  • Adapt to Customer’s processes – As the output from external team will eventually be used and maintained by the customer, it is better to create it as per the processes/tools/systems used by the customer. This will avoid the rework due to transition from external team’s systems to customer’s system
  • Frequent integrations – Often the services/system developed by the external team is a part of a bigger system developed by the customer. It is a good practice to do the timely smaller integrations between the two systems. This will help to resolve issues earlier and reduce the rework or redesign at the later stages.
  • Knowledge of customer’s existing system – It is always helpful to understand the customer’s system even if, at a higher level. This will help the external team to visualize the bigger picture and they can also provide useful suggestions or ideas for the improvement of overall system
  • Honest and open communication – The external team needs to be open and honest in their communication. Generally there is a perception with the customer that external team knows everything, as they are the expert. But in reality there might be scenarios or use cases that the external team had not dealt earlier. So in such cases it is better to inform customer openly and ask for time to research or to take expert help.
  • Timely Retrospectives – As both the teams are interacting on the daily basis, the retrospectives tend to be sidelined with the assumption that collaboration is working fine and the both the team’s productivity is good. But retrospectives are good exercise to get the test of “What you Believe is What you Hear”. There might be some hidden surprises or perceptions that are uncovered during retrospectives. This will give the external team and the customer’s team to make any adjustment needed for better results.


Project Outsourcing

Equivalent to complete silo, customers tend to get the whole package. There might not be too much support required or the data is not confidential, or support can be easily done by the same external team or some other resources.

Best Practices:

  • Regular communication – Although the communication needs are much lesser than the “staff augmentation” model, but still there should be some regular communication. It could be in the form of weekly meetings between the “project managers” of external team and customers’ followed by a status report to all the stakeholders.
  • Stakeholder’s identification – It is a good practice for the external team to know about the other stakeholders from the customers’ end. It will help them in stakeholders’ communication to get wider acceptance and acknowledgement of their work.
  • Timely Feedback – Often there are some intermediate deliveries that are made time to time to the customers. Do make sure to gather the feedback and incorporate it in your subsequent deliveries. It is always a good habit to check and recheck the customers’ expectations with all these deliveries
  • Identify Single Point of Contact from Customer side – There would always be some queries/clarifications needed on one or other tasks, hence ask for a single point of contact person from the customer. This will help in quick and timely resolution of these queries.
  • Be aware of changes in Customer’s environment – As we all know that “change” is the only constant thing, hence the external teams should be aware of the changes happening in the customer’s environment. This will help the team to quickly adapt and align with the customers changing needs and goals
  • Jargon Free Communication – Often you will be dealing with the executive team rather than developers. So it is better to communicate to them in a language that is easier for them to understand. Curb on your instincts to show-off your technical breadth and depth which is very specific to your domain area. And instead use examples or use cases that are more aligned to their environment.
  • Adapt and Educate – Every organization has its own working style and culture. There might be some processes from the customers’ side that you need to adapt to, so the delivery gets aligned to their system. At the same time, there are some processes that you might need to introduce to your customers that will help in increased productivity. In this case educate them and show them the benefits from these improved or new processes.


By following these Best Practices, an external team can become more aligned leading to a commonality of direction and harmonization of individual energies. This way there is a shared vision as well as an understanding of how to complement each others’ efforts which will finally lead to success of your customers and eventually the successful collaboration.