John Snow Labs is thrilled to announce the immediate availability of the new major version of Spark NLP 2.5 with spell checking and sentiment analysis – the world’s most widely used natural language processing library in the enterprise. The library can be used from Python, Java, and Scala API’s and comes with over 150 pre-trained models & pipelines.
“When we started planning for Spark NLP 2.5 a few months ago, the world was a different place. We have been blown away by the use of Natural Language Processing for early outbreak detections, question-answering chatbot services, text analysis of medical records, monitoring efforts to minimize the spread of COVID-19, and many more.” – said Maziyar Panahi, a lead contributor to Spark NLP.
Spark NLP 2.5 is another milestone in John Snow Labs’ quest to provide the open-source community with the most accurate NLP algorithms & models ever invented. By making the most recent academic advances available as a production-grade, scalable, and trainable software library, the global data science community can make faster progress towards putting AI to good use. Here are the major accuracy enhancing capabilities this new release makes available.
ALBERT and XLNet embeddings
“Beyond BERT” embeddings have been part of Spark NLP for Healthcare for a while and are now coming to the open-source package. ALBERT is “a Lite BERT” and provides almost the same accuracy as BERT (for example when used for named entity recognition) while requiring only about 6% of the memory. You can use it in memory-limited edge devices, or when loading models quickly on startup is a priority.
XLNet is a more advanced contextual embedding architecture than BERT and is known to perform particularly well on tasks like question answering. It is now available within Spark NLP – and the library takes care of the engineering heavy lifting required for cashing, distributing, tokenizing, and reusing it across NLP pipelines.
Spark NLP already has native support for word, chunk, sentence, and document encodings. The Universal Sentence Encoder has been part of the library since 2.4 and measures (well) semantic similarity between sentences.
New Contextual Spell Checker
This is a whole new, trainable, deep-learning-based spell checking algorithm that takes into account a word’s context before recommending how to correct it:
“I will call my siter.” [sister]
“Due to bad weather, we had to move to a different siter.” [site]
“We travelled to three siter in the summer.” [sites]
“During the summer we have the best ueather.” [weather]
“I have a black ueather jacket, so nice.” [leather]
“I introduce you to my sister, she is called ueather.” [Heather]
See how the model handles single vs. plural nouns and personal names well (these examples use the pre-trained English model). This model delivers a word error rate of 8.09% for fully automatically correction in the Holbrook benchmark. This is the best we are aware of – compare with a 20.24% error rate that JamSpell attains on the same benchmark.
New Deep-Learning Sentiment Analysis
The SentimentDL annotator applies contextual embeddings and a state-of-the-art deep learning architecture to training multi-class sentiment analysis models. Two pre-trained models – on IMDB reviews with an accuracy of 91% and on “Twitter sentiment 140 – 1.6 million tweets” with an accuracy of 89% are also part of this release.
SentimentDL can also handle neutral statements (in addition to positive and negative ones) and returns a ratio between 0 and 1 for how positive (or negative) a statement is.
The deep-learning Document Classification annotator now supports classifying between 100 classes (up from 50 in the previous release). It also comes with two new pre-trained models – trained with the TREC-6 and TREC-50 benchmark datasets for question classification.
The Spark NLP community has been rapidly growing – with monthly downloads growing by over 50% just from January to April 2020. This release grows this community substantially – by providing direct support for 14 new languages and adding 87 new out-of-the-box NLP models. As always, we thank our community for their feedback, bug reports, and contributions that made this release possible.