Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

The NLP Evolutionary Story: Solutions and Use Cases for 2021 and the Future

Between June 10 and August 13 2021, John Snow Labs and Gradient Flow conducted the 2021 NLP Industry Survey across 50+ countries. This survey confirmed some of the findings from the 2021 AI in Healthcare Survey Report. It also revealed some interesting new findings around how healthcare organizations approach NLP for a range of applications, and plan to do so in over the coming years.

This article explores some of these findings and also makes some predictions about the future of NLP.

NLP: A Key Foundational Technology for Healthcare

According to the 2021 AI in Healthcare Survey Report, NLP is one of the fastest-growing foundational technologies in healthcare. One key reason for accelerated NLP adoption is that NLP models now support the development of cutting-edge AI applications for healthcare where unstructured textual data is a critical input.

The 2021 NLP survey reiterates the growing importance of NLP as a foundational technology, with 60% of tech leaders saying that their NLP budgets have grown by at least 10% over the previous year. In 2020, only 53% said the same thing. These facts are not surprising, considering that the number of real-world use cases for healthcare natural language processing are continuously increasing thanks to:

    • Better pre-trained medical language data models
    • Explainable and interpretable NLP
    • Healthcare-specific word and sentence embeddings
    • Open source tools like Spark NLP, the Annotation Lab and NLP Server

Critical Features in NLP Solutions: Accuracy, Production Readiness and Scalability

For tech leaders and healthcare organizations, these three qualities are the key to evaluating and selecting NLP solutions for their needs:

    • Accuracy
    • Production readiness
    • Scalability

Accuracy – which refers to the effectiveness of pre-trained models included with NLP libraries – is particularly important, since almost all healthcare NLP projects and language applications involve pipelines where the results from a previous task and model are used downstream. So, any inaccuracies in the model will generate errors and distortions that will impact the next stages of the NLP application, and eventually, the integrity, reliability and usefulness of results.

Production readiness also, with 24% of technical leaders citing it as an important evaluation criterion while selecting NLP solutions or libraries. A production-ready NLP solution can be easily deployed. Equally important, it can work with real-world use cases, handle large amounts of data, and provide reliable results.

Scalability is also important while selecting an NLP solution, reflecting the growth in technological maturity and NLP budgets. It also indicates that organizations are increasingly open to leveraging NLP to make sense of the vast amounts of unstructured data they typically generate.

Popular NLP Use Cases in Healthcare

In 2019, the most popular applications of NLP technologies in healthcare were:

    • Document classification
    • Named Entity Recognition (NER)
    • Sentiment analysis
    • Knowledge graphs
    • De-identification

In particular, document classification and NER were the most popular NLP applications for healthcare organizations that were further along the NLP maturity curve. These two applications remained the most popular healthcare NLP use cases in 2020 as well.

NER refers to locating and identifying key entities, such as persons, locations, company names, product names, etc. within text. It enables users to segment named entities into predefined categories. With NER, healthcare providers can extract meaningful information from unstructured text like clinical notes and reports to improve clinical assertion status detection, clinical entity resolution, relation extraction, and data de-identification.

NER is an important component of many language applications, and is often the first step in many information extraction projects. In both 2019 and 2020, mature companies that are further along the NLP adoption curve are using NER at a higher rate compared to early-stage companies. NER is increasingly applied alongside another use case that is drawing attention due to the rise of AI: entity linking and knowledge graphs.

In 2020, 48% of large companies used NLP for document classification. This particular use case is also popular with small and medium companies, as well as with companies that are still in the early stage or exploratory phase of NLP adoption.

In the U.S., over a billion clinical documents are produced each year, with 60% of them in unstructured form, such as dictated notes or paper records. Most computer-based record systems cannot measure or analyze such data, much less make it ready for human consumption. NLP-based document classification classifies text in unstructured clinical narratives into categories, making it easier to manage and sort, and reducing the human burden of handling such data.

In 2020, NLP use also expanded in highly visible healthcare applications, such as the analysis of COVID research. In the coming years, other NLP applications, such as Q&A and natural language generation will grow, with many use cases already emerging in these areas.

Popular NLP Solutions Used in Healthcare

In 2020 and 2021, the number of available NLP solutions continues to grow. Their adoption is also on an upswing.

NLP Cloud Services

Per the 2021 NLP Industry Survey, 85% of respondents used at least one of these NLP cloud services:

    • Google Cloud Natural Language AI
    • AWS Comprehend
    • Azure Text Analytics

Google Cloud NLP remains the most popular cloud NLP service for the second straight year. IBM Watson NLU is popular with another 19% of responding organizations.

And yet, the use of such services is not without its challenges. One is that organizations face difficulties tuning and customizing models for their specific use cases, domains and applications. This is usually because general-purpose models are trained on open datasets like Wikipedia, news or media sources, or datasets used for benchmarking specific NLP tasks. This is why NER models trained on media sources perform poorly when used in healthcare-specific areas.

The cost of these services is another big challenge, with 33% of organizations saying that such services are too expensive for their specific use cases. This finding makes sense since pricing strategies vary greatly among providers. For a small number of documents, most services are fairly cost-effective. But as organizations process more documents on a regular basis, the cost can go up substantially. In particular, cost is a huge concern for companies that regularly process a high volume of documents of 500,000+ per month. Unless cloud NLP providers start offering discounts for high-volume NLP use cases, more organizations will shift to open source solutions to meet their NLP needs.

Other common challenges when using cloud NLP services are:

    • Low accuracy
    • Lack of required functionality
    • Lack of support for the human language the organization’s text is in
    • Organization is unable or unwilling to share data with a third party

NLP Libraries

Spark NLP is the most popular library in the survey, used by 31% of respondents. It is especially popular among users that are already using big data frameworks like Apache Spark. John Snow Labs, the creator of Spark NLP, are also constantly improving Spark NLP for Python users. By August 2021, Spark NLP was being downloaded 260,000 per week because it:

    • Offers production-grade, scalable, and trainable versions of the latest NLP research
    • Delivers extremely high accuracy (93%)
    • Includes thousands of fully-trained state-of-the-art NER multi-lingual models in 40+ languages including English
    • Includes multilingual sentence embeddings for 100+ languages
    • Supports the import of models from TFHub and Hugging Face for Java and Scala ecosystems
    • Is optimized for training domain-specific NLP models
    • Is supported by a fast-growing community

Furthermore, Spark NLP is easy to use, enables fast training and tuning of state-of-the-art models, and is highly scalable so that no code changes are required to scale a pipeline to a Spark cluster. Spark NLP has a large representation within healthcare, which is why it is so popular in this sector.

In addition, another 53% users also use at least one of these NLP libraries popular within the Python ecosystem:

    • Hugging Face (which contains 16,000+ models built on JAX)
    • spaCy
    • Natural Language Toolkit (NLTK)
    • Gensim
    • Flair

Other NLP libraries that are popular with mature, early stage and exploring organizations are:

    • Stanford CoreNLP/Stanza
    • Allen NLP
    • Rasa NLU

In practice, Python NLP developers use some libraries for quick prototyping and experimentation, and others for production workloads and projects. Generally speaking, most users typically use two or more of these libraries together in their NLP pipelines.


From the 2021 NLP Industry Survey, it’s clear that NLP has a bright future in healthcare, especially around specific use cases like Q&A, NLG and abstractive summarization. For organizations evaluating NLP solutions, scalability, production readiness and accuracy will remain important features. Price will also be a key factor, and will drive the adoption of open source NLP tools.

The limited feasibility of re-training general-purpose models for specific use cases will remain a challenge, until tools become available to simplify tuning and make it more cost-efficient. Also, the demand for annotation solutions like John Snow Labs’ Annotation Library, and production-ready, healthcare-specific NLP libraries like Spark NLP will grow to support an increasing number of NLP use cases in healthcare.

Explore how our team uses Spark NLP and its libraries in real NLP case studies.

Try NLP in Healthcare

See in action

Creating a Clinical Knowledge Graph with Spark NLP and neo4j

The knowledge graph represents a collection of connected entities and their relations. A knowledge graph that is fueled by machine learning utilizes...