Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

State of the Art Python NLP

John Snow Labs' NLP is an open source text processing library for Python, Java, and Scala. It provides production-grade, scalable, and trainable versions of the latest research in natural language processing.
Enterprise
Most Widely Used in
the Enterprise

Widely deployed production-grade codebase.

New releases every 2 weeks since 2017.

Growing community.

Read more

Art Accuracy
State of the Art
Accuracy

First production-grade versions of novel deep learning NLP research.

Use pre-trained models to train to fit your data.

Read more

Unmatched Speed Scale
Unmatched
Speed & Scale

Spark NLP was 80x faster than spaCy to train locally on 2.6MB of data.

Scale to a Spark cluster with zero code changes.

Read more

The most widely used NLP library in the Enterprise, by far

Gradient Flow NLP Survey, 2021
NLP library

Why JohnSnowLab`s Natural Language Processing?

Accuracy

Spark NLP delivered the best performing accuracy on multiple public academic benchmarks.

To the left are F1 scores for the Named Entity Recognition task on the CoNLL 2003 dataset.

Scalability

Zero code changes are needed to scale a pipeline to any spark cluster.

Spark NLP: Scability
Spark NLP: Speed

Speed

Optimized builds for the latest chips from Intel, (CPU) Nvidia (GPU), Apple (M1/M2), and AWS (Graviton) enable the fastest training & inference of state-of-the-art models.

This benchmark compares the speed of image transformers inference on the 34k ImageNet dataset on a single machine. Spark NLP is 34% faster than Hugging Face when running on a single CPU, and 51% faster than Hugging Face on a single GPU.

Out Of The Box Functionality

Entity Recognition
John Snow Labs
Algorithms
Split Text
  • Sentence Detector
  • Deep Sentence Detector
  • Tokenizer
  • nGram Generator
Understand Grammar
  • Stemmer
  • Lemmatizer
  • Part of Speech Tagger
  • Dependency Parser
Information Extraction
John Snow Labs
Algorithms
Clean Text
  • Spell Checking
  • Spell Correction
  • Normalizer
  • Stopword Cleaner
Find in Text
  • Text Matcher
  • Regex Matcher
  • Date Matcher
  • Chunker
Sentiment Analysis
Open Source Ai Platform
Content
Transformers
GloVeELMOBERTALBERTXLNetUSESmall BERTELECTRABioBERTLaBSE
Pre-trained Models
250+
Pretrained
Information Extraction
Open Source Ai Platform
Content
46 Languages
AI Platform Architecture
Pre-trained Pipelines
90+
Pretrained
Trainable & Tunable
John Snow Labs
Scalable to a Cluster
John Snow Labs
Fast Inference
John Snow Labs
Hardware Optimized
John Snow Labs
John Snow Labs
Community
John Snow Labs

Trainable to understand your language

Spark NLP is optimized for training domain-specific NLP models, so you can adapt it to learn the nuances of jargon and documents you must support.

Spark NLP: Trainable chart
Curated Health Datasets
Spark NLP

Speed

Optimized builds for the latest chips from Intel, (CPU) Nvidia (GPU), Apple (M1/M2), and AWS (Graviton) enable the fastest training & inference of state-of-the-art models.

This benchmark compares the speed of image transformers inference on the 34k ImageNet dataset on a single machine. Spark NLP is 34% faster than Hugging Face when running on a single CPU, and 51% faster than Hugging Face on a single GPU.

Introducing Spark NLP at Top Level AI Conferences

Frequently Asked Questions

To use Spark NLP in Python, follow these steps:

1. Installation:

pip install spark-nlp

if you don’t have PySpark you should also install the following dependencies:

pip install pyspark numpy

2. Initialize SparkSession with Spark NLP:

import sparknlp spark = sparknlp.start()

3. Use Annotators: Spark NLP offers a variety of annotators (e.g., Tokenizer, SentenceDetector, Lemmatizer). To use them, first create the appropriate pipeline.

Example using a Tokenizer:

from sparknlp.base import DocumentAssembler from sparknlp.annotator import Tokenizer documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document") tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token") pipeline = Pipeline(stages=[documentAssembler, tokenizer])

4. Transform Data: Once you have a pipeline, you can transform your data.

result = pipeline.fit(data).transform(data)

5. Explore and Utilize Models: Spark NLP offers pre-trained models for tasks like Named Entity Recognition (NER), sentiment analysis, and more. You can easily plug these into your pipeline and customize as needed.

6. Further Reading: Dive deeper into the official documentation for more detailed examples, a complete list of annotators and models, and best practices for building NLP pipelines.

Short answer: 100%! Free forever inculding any commercial use.

Longer answer: Yes, Spark NLP is an open-source library and can be used freely. It’s released under the Apache License 2.0. Users can use, modify, and distribute it without incurring costs.

Both spaCy and Spark NLP are popular libraries for Natural Language Processing, but Spark NLP shines when it comes to scalability and distributed processing. Here are some key differences between the two:

1. Scalability & Distributed Processing:

  • Spark NLP: Built on top of Apache Spark, it’s designed for distributed processing and handling large datasets at scale. This makes it especially suitable for big data processing tasks that need to run on a cluster.
  • spaCy: Designed for processing data on a single machine and it’s not natively built for distributed computing.

2. Language Models & Pretrained Pipelines:

  • Spark NLP: Offers over 18,000 diverse pre-trained models and pipelines for over 235 languages, making it easy to get started on various NLP tasks. It also makes it easy to import your custom models from Hugging Face in TensorFlow and ONNX formats. Spark NLP also offeres a large number of state-of-the-art Large Language Models (LLMs) like BERT, RoBERTa, ALBERT, T5, OpenAI Whisper, and many more for Text Embeddings (useful for RAG), Named Entity Recognition, Text Classification, Answering, Automatic Speech Recognition, and more. These models can be used out of the box or fine-tuned on your own data.
  • spaCy: Provides support for multiple languages with its models and supports tasks like tokenization, named entity recognition, and dependency parsing out of the box. However, spaCy doesn’t have any Models Hub and the number of offered models out of the box is limited.

3. Licensing & Versions:

  • Spark NLP: The core library is open-source under the Apache License 2.0, making it free for both academic and commercial use.
  • spaCy: Open-source and released under the MIT license.

Spark NLP provides a range of models to tackle various NLP tasks. These models are often pre-trained on large datasets and can be fine-tuned or used directly for inference. Some of the primary categories and examples of Spark NLP models include:

1. Named Entity Recognition (NER):

  • Pre-trained models for recognizing entities such as persons, organizations, locations, etc.
  • Specialized models for sectors like healthcare to detect medical entities.

2. Text Classification:

  • Models for tasks like sentiment analysis, topic classification, and more.

3. Word Embeddings:

  • Word2Vec, GloVe, and BERT embeddings.
  • Models to generate embeddings for words or sentences, useful in many downstream tasks.

4. Language Models:

  • Models like BERT, ALBERT, and ELECTRA are available pre-trained and can be fine-tuned for specific tasks.

5. Dependency Parsing:

  • Models that analyze the grammatical structure of a sentence and determine relationships between words.

6. Spell Checking and Correction:

  • Models that can detect and correct spelling mistakes in the text.

7. Sentence Embeddings:

  • Models to generate vector representations for entire sentences, such as Universal Sentence Encoder.

8. Translation and Language Detection:

  • Models to detect the language of a given text or translate text between languages.

9. Text Matching:

  • Models that can be used for tasks like textual similarity, paraphrase detection, etc.

10. Pretrained Pipelines:

  • Ready-to-use pipelines that combine multiple models and annotators for common tasks, allowing users to quickly start processing text without building a custom pipeline.

For the latest list of models, detailed documentation, and instructions on how to use them, visiting the Official Spark NLP Models Hub would be beneficial.

Prebuilt versions of Spark NLP can be obtained through multiple channels, depending on your development environment and platform:

1. PyPI (for Python Users): You can install Spark NLP using pip, the Python package installer.

pip install spark-nlp

2. Maven Central (for Java/Scala Users): If you are using Maven, you can add the following dependency to your pom.xml:

com.johnsnowlabs.nlp spark-nlp_2.12 LATEST_VERSION

Make sure to replace LATEST_VERSION with the desired version of Spark NLP.

3. Spark Packages: For those using the spark-shell, pyspark, or spark-submit, you can include Spark NLP directly via Spark Packages:

--packages com.johnsnowlabs.nlp:spark-nlp_2.12:LATEST_VERSION

4. Pre-trained Models & Pipelines: Apart from the library itself, Spark NLP provides a range of pre-trained models and pipelines. These can be found on the Spark NLP Model Hub.

Always make sure to consult the official documentation or the GitHub repository for the latest instructions and versions available.

preloader