Home » Spark NLP

State of the Art Python NLP

John Snow Labs' NLP is an open source text processing library for Python, Java, and Scala. It provides production-grade, scalable, and trainable versions of the latest research in natural language processing.

Get Started For Free

Most Widely Used in
the Enterprise

Widely deployed production-grade codebase.

New releases every 2 weeks since 2017.

Growing community.

State of the Art
Accuracy

First production-grade versions of novel deep learning NLP research.

Use pre-trained models to train to fit your data.

Unmatched
Speed & Scale

Spark NLP was 80x faster than spaCy to train locally on 2.6MB of data.

Scale to a Spark cluster with zero code changes.

The most widely used NLP library in the Enterprise, by far

Gradient Flow NLP Survey, 2021

Why JohnSnowLab`s Natural Language Processing?

Accuracy

Spark NLP delivered the best performing accuracy on multiple public academic benchmarks.

To the left are F1 scores for the Named Entity Recognition task on the CoNLL 2003 dataset.

Scalability

Zero code changes are needed to scale a pipeline to any spark cluster.

Speed

Optimized builds for the latest chips from Intel, (CPU) Nvidia (GPU), Apple (M1/M2), and AWS (Graviton) enable the fastest training & inference of state-of-the-art models.

This benchmark compares the speed of image transformers inference on the 34k ImageNet dataset on a single machine. Spark NLP is 34% faster than Hugging Face when running on a single CPU, and 51% faster than Hugging Face on a single GPU.

Out Of The Box Functionality

Entity Recognition

Algorithms

Split Text

Sentence Detector
Deep Sentence Detector
Tokenizer
nGram Generator

Understand Grammar

Stemmer
Lemmatizer
Part of Speech Tagger
Dependency Parser

Information Extraction

Algorithms

Clean Text

Spell Checking
Spell Correction
Normalizer
Stopword Cleaner

Find in Text

Text Matcher
Regex Matcher
Date Matcher
Chunker

Sentiment Analysis

Content

Transformers

GloVeELMOBERTALBERTXLNetUSESmall BERTELECTRABioBERTLaBSE

Pre-trained Models

250+

Pretrained

Information Extraction

Content

46 Languages

Pre-trained Pipelines

90+

Pretrained

Trainable & Tunable

Scalable to a Cluster

Fast Inference

Hardware Optimized

Community

Trainable to understand your language

Spark NLP is optimized for training domain-specific NLP models, so you can adapt it to learn the nuances of jargon and documents you must support.

Speed

Optimized builds for the latest chips from Intel, (CPU) Nvidia (GPU), Apple (M1/M2), and AWS (Graviton) enable the fastest training & inference of state-of-the-art models.

Introducing Spark NLP at Top Level AI Conferences

Frequently Asked Questions

To use Spark NLP in Python, follow these steps:

1. Installation:

pip install spark-nlp

if you don’t have PySpark you should also install the following dependencies:

pip install pyspark numpy

2. Initialize SparkSession with Spark NLP:

import sparknlp spark = sparknlp.start()

3. Use Annotators: Spark NLP offers a variety of annotators (e.g., Tokenizer, SentenceDetector, Lemmatizer). To use them, first create the appropriate pipeline.

Example using a Tokenizer:

from sparknlp.base import DocumentAssembler from sparknlp.annotator import Tokenizer documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token") pipeline = Pipeline(stages=[documentAssembler, tokenizer])

4. Transform Data: Once you have a pipeline, you can transform your data.

result = pipeline.fit(data).transform(data)

5. Explore and Utilize Models: Spark NLP offers pre-trained models for tasks like Named Entity Recognition (NER), sentiment analysis, and more. You can easily plug these into your pipeline and customize as needed.

6. Further Reading: Dive deeper into the official documentation for more detailed examples, a complete list of annotators and models, and best practices for building NLP pipelines.

Short answer: 100%! Free forever inculding any commercial use.

Longer answer: Yes, Spark NLP is an open-source library and can be used freely. It’s released under the Apache License 2.0. Users can use, modify, and distribute it without incurring costs.

Both spaCy and Spark NLP are popular libraries for Natural Language Processing, but Spark NLP shines when it comes to scalability and distributed processing. Here are some key differences between the two:

1. Scalability & Distributed Processing:

Spark NLP: Built on top of Apache Spark, it’s designed for distributed processing and handling large datasets at scale. This makes it especially suitable for big data processing tasks that need to run on a cluster.
spaCy: Designed for processing data on a single machine and it’s not natively built for distributed computing.

2. Language Models & Pretrained Pipelines:

Spark NLP: Offers over 18,000 diverse pre-trained models and pipelines for over 235 languages, making it easy to get started on various NLP tasks. It also makes it easy to import your custom models from Hugging Face in TensorFlow and ONNX formats. Spark NLP also offeres a large number of state-of-the-art Large Language Models (LLMs) like BERT, RoBERTa, ALBERT, T5, OpenAI Whisper, and many more for Text Embeddings (useful for RAG), Named Entity Recognition, Text Classification, Answering, Automatic Speech Recognition, and more. These models can be used out of the box or fine-tuned on your own data.
spaCy: Provides support for multiple languages with its models and supports tasks like tokenization, named entity recognition, and dependency parsing out of the box. However, spaCy doesn’t have any Models Hub and the number of offered models out of the box is limited.

3. Licensing & Versions:

Spark NLP: The core library is open-source under the Apache License 2.0, making it free for both academic and commercial use.
spaCy: Open-source and released under the MIT license.

Spark NLP provides a range of models to tackle various NLP tasks. These models are often pre-trained on large datasets and can be fine-tuned or used directly for inference. Some of the primary categories and examples of Spark NLP models include:

1. Named Entity Recognition (NER):

Pre-trained models for recognizing entities such as persons, organizations, locations, etc.
Specialized models for sectors like healthcare to detect medical entities.

2. Text Classification:

Models for tasks like sentiment analysis, topic classification, and more.

3. Word Embeddings:

Word2Vec, GloVe, and BERT embeddings.
Models to generate embeddings for words or sentences, useful in many downstream tasks.

4. Language Models:

Models like BERT, ALBERT, and ELECTRA are available pre-trained and can be fine-tuned for specific tasks.

5. Dependency Parsing:

Models that analyze the grammatical structure of a sentence and determine relationships between words.

6. Spell Checking and Correction:

Models that can detect and correct spelling mistakes in the text.

7. Sentence Embeddings:

Models to generate vector representations for entire sentences, such as Universal Sentence Encoder.

8. Translation and Language Detection:

Models to detect the language of a given text or translate text between languages.

9. Text Matching:

Models that can be used for tasks like textual similarity, paraphrase detection, etc.

10. Pretrained Pipelines:

Ready-to-use pipelines that combine multiple models and annotators for common tasks, allowing users to quickly start processing text without building a custom pipeline.

For the latest list of models, detailed documentation, and instructions on how to use them, visiting the Official Spark NLP Models Hub would be beneficial.

Prebuilt versions of Spark NLP can be obtained through multiple channels, depending on your development environment and platform:

1. PyPI (for Python Users): You can install Spark NLP using pip, the Python package installer.

pip install spark-nlp

2. Maven Central (for Java/Scala Users): If you are using Maven, you can add the following dependency to your pom.xml:

com.johnsnowlabs.nlp spark-nlp_2.12 LATEST_VERSION

Make sure to replace LATEST_VERSION with the desired version of Spark NLP.

3. Spark Packages: For those using the spark-shell, pyspark, or spark-submit, you can include Spark NLP directly via Spark Packages:

--packages com.johnsnowlabs.nlp:spark-nlp_2.12:LATEST_VERSION

4. Pre-trained Models & Pipelines: Apart from the library itself, Spark NLP provides a range of pre-trained models and pipelines. These can be found on the Spark NLP Model Hub.

Always make sure to consult the official documentation or the GitHub repository for the latest instructions and versions available.