Watch Healthcare NLP Summit 2024. Watch now.
was successfully added to your cart.

Successful Data Science Strategies and Early Detection of Diseases

Case Study: Arizona State University (ASU) Research Foundation

One of the main principles I learned during my work at John Snow Labs, is to learn from experts.  The main policy for any project initiation phase is to seek expert judgment. Reading different case studies, white-papers, previous trials in the same field and learning from success and failure stories are always the way for building a successful strategy.

The efforts of Arizona State University (ASU) Research Foundation and Prof. Dr. Joshua LaBaer are among the most prominent roadmaps to follow for any organization or company working in the field of Biomedical Data Science.  According to US News, ASU was ranked number among the most innovative schools in America.[1]

A successful strategy for any organization working in the field of data science especially in the domain of biomedicine must take into consideration different complicated factors like cybersecurity, data security, data integrity, available funding, data management, data storage, data visualization, data analytics, computing capabilities and the continuous development of smart devices.

Dr. LaBaer supervised the collection of 1000 Breast Cancer-related genes.  The work continued after that to reach 15,000 genes.

Huge work is running over there to tackle another life-threatening problem; namely the Pediatric Low-Grade Astrocytomas (PGLAs). PGLA is fatal and it is the most common brain cancer among children.  Its current chemotherapies have harmful side effects.  Dr. LaBaer team is working on finding better treatments and to decrease the harmful side effects of the current chemotherapy. ASU inspirational success invited others to follow the same footprints for the sake of humanity.

John Snow Labs catalogs have more than 1775 normalized datasets, most of them are freshly curated and machine and manually validated.

Majority of these datasets lie beneath the Population Health catalog.  Derived and excited by the achievements of ASU in fighting cancer, JSL team decided to make relevant high-quality curated data affordable between the hands of cancer researchers worldwide.  In the Population Health catalog, there are different curated datasets for global breast and cervical cancer mortality data.  Using such curated data in cancer research can save up to 60% of the data scientist time.

Guided and excited by the success of ASU research team, JSL team made the following training high-quality curated datasets available for all cancer researchers worldwide at a mouse click:

Brain Cancer by Tumor Site

Cancer Types Grouped by Age

Cancer Types Grouped by Site

Cancer Types Grouped by Area

Childhood Cancer Survival in England 1990 to 2016

Childhood Cancer Registry

Breast Cancer Mortality Statistics

Female Breast Invasive Cancer Incidence Data 2013

Female Breast Cancer Death Data 2013

This data package includes 9 datasets related to cancer statistics in the United States and England. These datasets include – Female breast age-adjusted invasive cancer incidence.

Many other datasets are also available, most of them are related to childhood cancer, brain cancer, and breast cancer.


Arizona State University: A Successful Case Study:

No doubt that ASU followed state-of-the-art strategies in data science and became one of the leading organizations all over the world.  ASU efforts can be considered as a case study for all interested candidates in the field of biomedical data science.

ASU applied the Next-Generation Cyber Capability (NGCC) as an approach to satisfy the computing and data needs for its research-related networks.  In addition, it applied the NimbleStorage’s predictive flash storage approach for data management.  Building a successful business model is one of the important factors in the successful strategy of ASU.

This blog can summarize and explain the successful strategy of ASU research foundation from 2 perspectives: the business model and the technology (mainly the storage and the NGCC approach).


Building a successful business model:

Any research project needs funding.  The technical needs for the project may implicate the need for huge funding that could be beyond the abilities of the research institute.  Seeking the right merge or partnership could be a suitable solution.

The Mill startup is an organization dedicated to fund and finance researchers for shares in the patents.  Translational Genomics Research Institute (TGen) is a non-profit genomics research institute concerned with genetic discoveries and development for smarter diagnostics and therapeutics.  TGen was already in a deal with NimbleStorage. After the preliminary trials, High-performance Computing Group (HPC) and NimbleStorage agreed on a visionary plan to support whatever small business output that could come out of The Mill startup. The expected output was 4 small business projects, one of them was related to the development of smarter development for smarter cancer diagnostics and therapeutics.  Finally, NimbleStorage created an on-premise cloud at ASU, where the researcher can be granted access for a low cost.


Choosing the right technology approach

Predicting and preventing real-time performance problems due to the overwhelming data growth, ASU took the decision to use NimbleStorage’s predictive flash storage.

NimbleStorage headquarter is in San Jose, California. 8000 users distributed over 50 countries chose NimbleStorage; a solution that gathers predictive analytics with flash performance.  The technology is based on 2 main technologies:

– Unified Flash Fabric: a technique that combines all flash and adaptive flash arrays together, where the arrays leverage the CASL (Cache-Accelerated Sequential Layout) to improve the performance.

The array has an eight Terabytes cache, with an all-flash shelf capacity that is equivalent to 600 raw Terabytes.

– Infosight Predictive Analytics: Cloud-based monitoring and management system, where the client’s infrastructure is monitored to predict and prevent real-time performance troubles.


Next-Generation Cyber Capability (NGCC)

Having more than 90,000 students and 3.000 faculties, ASU had to develop its own data science strategy.

This strategy must take into consideration the nature of genomic research with its advanced computing needs and overwhelming data growth, cybersecurity, network infrastructure, data management, storage, data integrity, and integration. NGCC architecture and nature depends on using cloud-based storage in addition to local and virtual resources.

Integrating physical and logical abilities to perform as a single unit, is the main aim developed by Dr. Kenneth Buetow(Director Computational Sciences and Informatics Program, Complex Adaptive Systems Initiative, Arizona State University).

The physical infrastructure supports daily computing needs through different components connected through a high-speed connection to huge data storage capacity.  This architecture configuration is based on the harmonious interaction between 3 clusters as follows:

  • The first cluster is a large one which has a fast processor and mild size memory capacity.
  • The second cluster is smaller than the first one, but it has access to a larger shared memory.
  • The third one is composed of nodes connected through high-speed links, each with a big memory and data storage capacity.



Again, science and technology are not the only needed talents for success.  Business education is also an important factor for success. Human resources, cost, and time management plans are important components for any project management plan.  It can determine the success or failure of any project to great extent.

I case of NGCC, exceptional and rare talents are needed which makes the mission more complicated.  Moreover, intercommunication between different department is needed.  As NGCC depends to a great extent on cost-effective on-demand capabilities which depends on a human-factor and hermetic co-ordination to ensure correct deployment.

The business needs of the NGCC were met through the development of different roles which can be summarized as follows:

  • Program Manager: responsible for the successful delivery of the whole of the proposed roles and responsibilities throughout the whole lifecycle of the project
  • Project Manager: responsible for day-to-day operations
  • Business Manager: to monitor and oversee the budget and sure that it is going within the permissible limits
  • Administrative Assistant: responsible for the time management plan
  • Writer: dedicated for writing external communication and the development of the needed training materials and documentation
  • Communication staff-member: responsible for the website content writing



According to Jay A. Etchings (Former ASU director of research computing), the first 3 years of implementation using the previous strategy yielded an outstanding success.  The planned time for a study that focuses on the life span of 100 tumors for 12 types was expected to be 120,000 days.  The actual time after the successful migration to the new Apache Spark/NGCC System was only 20 minutes.[2]

As there are models and standards for success, there are also criteria for failure. Another way to achieve success in your project is to know the reasons of failure and avoid them.  I believe reading Phil Simon’s book “Why new systems fail”[3]is important for anyone to have a complete vision before determining the final strategy for any data science project.



[1] Compass USNC, See the Most Innovative Schools Methodology. The 10 Most Innovative Universities in America [Internet]. U.S. News & World Report. U.S. News & World Report; [cited 2019Feb1]. Available from:

[2] Etchings J. Strategies in biomedical data science: driving force for innovation. Hoboken, NJ: John Wiley & Sons, Inc.; 2017.

[3] Simon P. Why new systems fail: an insider’s guide to successful IT projects. Boston, MA: Course Technology/Cengage Learning; 2011.

Big Data is the New Battleground in Healthcare

Big data is transforming the world in numerous ways. Many industries changed when analysts and marketers starting collecting data from online interactions. The...