Spark NLP 2.5 delivers state-of-the-art accuracy for spell checking and sentiment analysis

12.05.2020

Maziyar Panahi

Principal AI / ML Engineer and a Senior Team Lead

John Snow Labs is thrilled to announce the immediate availability of the new major version of Spark NLP 2.5 with spell checking and sentiment analysis – the world’s most widely used natural language processing library in the enterprise. The library can be used from Python, Java, and Scala API’s and comes with over 150 pre-trained models & pipelines.

“When we started planning for Spark NLP 2.5 a few months ago, the world was a different place. We have been blown away by the use of Natural Language Processing for early outbreak detections, question-answering chatbot services, text analysis of medical records, monitoring efforts to minimize the spread of COVID-19, and many more.” – said Maziyar Panahi, a lead contributor to Spark NLP.

Spark NLP 2.5 is another milestone in John Snow Labs’ quest to provide the open-source community with the most accurate NLP algorithms & models ever invented. By making the most recent academic advances available as a production-grade, scalable, and trainable software library, the global data science community can make faster progress towards putting AI to good use. Here are the major accuracy enhancing capabilities this new release makes available.

ALBERT and XLNet embeddings

“Beyond BERT” embeddings have been part of Spark NLP for Healthcare for a while and are now coming to the open-source package. ALBERT is “a Lite BERT” and provides almost the same accuracy as BERT (for example when used for named entity recognition) while requiring only about 6% of the memory. You can use it in memory-limited edge devices, or when loading models quickly on startup is a priority.

XLNet is a more advanced contextual embedding architecture than BERT and is known to perform particularly well on tasks like question answering. It is now available within Spark NLP – and the library takes care of the engineering heavy lifting required for cashing, distributing, tokenizing, and reusing it across NLP pipelines.

Spark NLP already has native support for word, chunk, sentence, and document encodings. The Universal Sentence Encoder has been part of the library since 2.4 and measures (well) semantic similarity between sentences.

New Contextual Spell Checker

This is a whole new, trainable, deep-learning-based spell checking algorithm that takes into account a word’s context before recommending how to correct it:

“I will call my siter.” [sister]

“Due to bad weather, we had to move to a different siter.” [site]

“We travelled to three siter in the summer.” [sites]

“During the summer we have the best ueather.” [weather]

“I have a black ueather jacket, so nice.” [leather]

“I introduce you to my sister, she is called ueather.” [Heather]

See how the model handles single vs. plural nouns and personal names well (these examples use the pre-trained English model). This model delivers a word error rate of 8.09% for fully automatically correction in the Holbrook benchmark. This is the best we are aware of – compare with a 20.24% error rate that JamSpell attains on the same benchmark.

New Deep-Learning Sentiment Analysis

The SentimentDL annotator applies contextual embeddings and a state-of-the-art deep learning architecture to training multi-class sentiment analysis models. Two pre-trained models – on IMDB reviews with an accuracy of 91% and on “Twitter sentiment 140 – 1.6 million tweets” with an accuracy of 89% are also part of this release.

SentimentDL can also handle neutral statements (in addition to positive and negative ones) and returns a ratio between 0 and 1 for how positive (or negative) a statement is.

Document Classification

The deep-learning Document Classification annotator now supports classifying between 100 classes (up from 50 in the previous release). It also comes with two new pre-trained models – trained with the TREC-6 and TREC-50 benchmark datasets for question classification.

The Spark NLP community has been rapidly growing – with monthly downloads growing by over 50% just from January to April 2020. This release grows this community substantially – by providing direct support for 14 new languages and adding 87 new out-of-the-box NLP models. As always, we thank our community for their feedback, bug reports, and contributions that made this release possible.

With the advancements in Spark NLP, particularly in accuracy for tasks like spell-checking and sentiment analysis, the integration of Generative AI in Healthcare and the use of a Healthcare Chatbot can significantly improve patient interactions and enhance the overall quality of healthcare services.

Schedule a Demo

Spark NLP Text Annotator

See in action

Maziyar Panahi

Principal AI / ML Engineer and a Senior Team Lead

Our additional expert:

Maziyar Panahi is a Principal AI / ML engineer and a senior Team Lead with over a decade-long experience in public research. He leads a team behind Spark NLP at John Snow Labs, one of the most widely used NLP libraries in the enterprise. He develops scalable NLP components using the latest techniques in deep learning and machine learning that includes classic ML, Language Models, Speech Recognition, and Computer Vision. He is an expert in designing, deploying, and maintaining ML and DL models in the JVM ecosystem and distributed computing engine (Apache Spark) at the production level. He has extensive experience in computer networks and DevOps. He has been designing and implementing scalable solutions in Cloud platforms such as AWS, Azure, and OpenStack for the last 15 years. In the past, he also worked as a network engineer in high-level places after he completed his Microsoft and Cisco training (MCSE, MCSA, and CCNA). He is a lecturer at The National School of Geographical Sciences teaching Big Data Platforms and Data Analytics. He is currently employed by The French National Centre for Scientific Research (CNRS) as IT Project Manager and working at the Institute of Complex Systems of Paris (ISCPIF).

State-of-the-art Natural Language Processing at Scale. David Talby - April 13, 2020

David Talby