Spark NLP – the best thing since Numpy

John Snow Labs’  NLP is an open source text processing library built on top of Apache Spark and its Spark ML library. Its goal is to provide easy API for NLP annotations allowing a scalable approach within a distributed large-scale environment.

World' Best Performance & Scale

When compared to spaCy:

  • Spark-NLP was 38 times faster to train on 100kb of data​

  • Spark-NLP was 80 times faster to train on 2.6mb of data

Read more

Frictionless Reuse

Spark NLP Standard package –  ready to use Open Source library.

 

Healthcare extensions also exist – Pre-trained NLP Models​.

Read more

Enterprise Grade

Production-grade codebase.

Active development with frequent releases.

Growing community.

Read more

Why is language understanding hard?

Human Language is:

Nuanced

Fuzzy

Contextual

Medium specific

Domain specific

Healthcare Language is even harder with more specific needs:

Specialized Core Annotators

- Part of speech, spell checking, …

Different Vocabularies Need Harmonization

- Ontologies, relationships, word embeddings, …

Dedicated ML & DL Models

- Named entity recognition, entity resolution, …

  • Built on the Spark ML API’s

  • Apache 2.0 Licensed

  • Active development & support

Our Choice of Architecture

Clinical coders must often read through 100+ pages of documentation for a single patient – resulting in mistakes and missed revenue. Applying OCR, summarization, clinical entity resolution and case complexity classification enables more cases to be done faster and accurately.
Assigning Evaluation & Management clinical codes requires understanding of fuzzy concepts such as the complexity of clinical decision making and the physical exam. Automated this coding task from free-text visit summaries requires accurate NLP & ML models which are accurate, consistent and explainable.
When fraud is suspected based on analyzing medical claims, the next step is to read the encounter notes. This is a slow and expensive manual process – unless it can be automated by applying question answering, semantic similarity and entity resolution models at scale.
Clinical coders must often read through 100+ pages of documentation for a single patient – resulting in mistakes and missed revenue. Applying OCR, summarization, clinical entity resolution and case complexity classification enables more cases to be done faster and accurately.
Assigning Evaluation & Management clinical codes requires understanding of fuzzy concepts such as the complexity of clinical decision making and the physical exam. Automated this coding task from free-text visit summaries requires accurate NLP & ML models which are accurate, consistent and explainable.
When fraud is suspected based on analyzing medical claims, the next step is to read the encounter notes. This is a slow and expensive manual process – unless it can be automated by applying question answering, semantic similarity and entity resolution models at scale.
Accurately predicting how many adverse events are expected to occur due to a drug – as well as the type of events and their severity – is becoming increasingly more feasible. This combines traditional machine learning and deep learning techniques with NLP pipelines that mine better information from reports and academic papers.
This project builds a continuously updating knowledge graph that maps the relationships between researchers, diseases, therapies and genes. It required the combination of domain-specific NLP models and embeddings trained on PubMed, current ontologies and terminologies, and a clinical team for labeling and measurement.
Participating in a clinical trial requires a patient to match a long list of eligibility criteria. This cannot be done using structured data from EHR systems and hence requires either lengthy manual processes – or advanced domain-specific NLP models.
Less than 10% of safety events are formally reported – but mining progress notes can uncover medication changes other consequences of such events. Entity and fact extraction models enable a much better estimation & classification of safety events.
Prior authorization is now required by many US payers for dozens of procedure codes. Automated question answering from pre-auth request forms reduces costs and enables patients to get the treatment they need faster.
Predicting how many hospital beds and nurses of each certification will be needed is important to providing quality service and avoiding gridlock. Using features from free-text emergency room notes significantly improved accuracy over time series models that only used structured data.

Benchmarks: Training

Run on a desktop PC, Linux Mint with 16GB RAM, local SSD drives & Intel core i5-6600K processor running 4 cores at 3.5GHz.

Data has been taken from the National American Corpus (http://www.anc.org), utilizing the MASC 3.0.2 written corpora from the newspaper section.

Pipeline has Sentence Boundary, Tokenization & Part of Speech.

Spark-NLP was 38 times faster to train on 100kb of data.

Spark-NLP was 80 times faster to train on 2.6mb of data.

Benchmarks: Scaling

Spark-NLP against itself:

  • 2.5x speedup with a 4-node cluster
  • Zero code changes

Spark-NLP scales as Spark does: 1 to 3 orders of magnitude faster depending on cluster setup

Not compares to spaCy, since it cannot leverage a cluster