the Enterprise
Widely deployed production-grade codebase.
New releases every 2 weeks since 2017.
Growing community.
Widely deployed production-grade codebase.
New releases every 2 weeks since 2017.
Growing community.
First production-grade versions of novel deep learning NLP research.
Use pre-trained models to train to fit your data.
Spark NLP was 80x faster than spaCy to train locally on 2.6MB of data.
Scale to a Spark cluster with zero code changes.
Spark NLP delivered the best performing accuracy on multiple public academic benchmarks.
To the left are F1 scores for the Named Entity Recognition task on the CoNLL 2003 dataset.
Zero code changes are needed to scale a pipeline to any spark cluster.
Optimized builds for the latest chips from Intel, (CPU) Nvidia (GPU), Apple (M1/M2), and AWS (Graviton) enable the fastest training & inference of state-of-the-art models.
This benchmark compares the speed of image transformers inference on the 34k ImageNet dataset on a single machine. Spark NLP is 34% faster than Hugging Face when running on a single CPU, and 51% faster than Hugging Face on a single GPU.
Spark NLP is optimized for training domain-specific NLP models, so you can adapt it to learn the nuances of jargon and documents you must support.
Optimized builds for the latest chips from Intel, (CPU) Nvidia (GPU), Apple (M1/M2), and AWS (Graviton) enable the fastest training & inference of state-of-the-art models.
This benchmark compares the speed of image transformers inference on the 34k ImageNet dataset on a single machine. Spark NLP is 34% faster than Hugging Face when running on a single CPU, and 51% faster than Hugging Face on a single GPU.
To use Spark NLP in Python, follow these steps:
1. Installation:
pip install spark-nlp
if you don’t have PySpark you should also install the following dependencies:
pip install pyspark numpy
2. Initialize SparkSession with Spark NLP:
import sparknlp
spark = sparknlp.start()
3. Use Annotators: Spark NLP offers a variety of annotators (e.g., Tokenizer, SentenceDetector, Lemmatizer). To use them, first create the appropriate pipeline.
Example using a Tokenizer:
from sparknlp.base import DocumentAssembler
from sparknlp.annotator import Tokenizer
documentAssembler = DocumentAssembler().setInputCol("text").setOutputCol("document")
tokenizer = Tokenizer().setInputCols(["document"]).setOutputCol("token")
pipeline = Pipeline(stages=[documentAssembler, tokenizer])
4. Transform Data: Once you have a pipeline, you can transform your data.
result = pipeline.fit(data).transform(data)
5. Explore and Utilize Models: Spark NLP offers pre-trained models for tasks like Named Entity Recognition (NER), sentiment analysis, and more. You can easily plug these into your pipeline and customize as needed.
6. Further Reading: Dive deeper into the official documentation for more detailed examples, a complete list of annotators and models, and best practices for building NLP pipelines.
Short answer: 100%! Free forever inculding any commercial use.
Longer answer: Yes, Spark NLP is an open-source library and can be used freely. It’s released under the Apache License 2.0. Users can use, modify, and distribute it without incurring costs.
Both spaCy and Spark NLP are popular libraries for Natural Language Processing, but Spark NLP shines when it comes to scalability and distributed processing. Here are some key differences between the two:
1. Scalability & Distributed Processing:
2. Language Models & Pretrained Pipelines:
3. Licensing & Versions:
Spark NLP provides a range of models to tackle various NLP tasks. These models are often pre-trained on large datasets and can be fine-tuned or used directly for inference. Some of the primary categories and examples of Spark NLP models include:
1. Named Entity Recognition (NER):
2. Text Classification:
3. Word Embeddings:
4. Language Models:
5. Dependency Parsing:
6. Spell Checking and Correction:
7. Sentence Embeddings:
8. Translation and Language Detection:
9. Text Matching:
10. Pretrained Pipelines:
For the latest list of models, detailed documentation, and instructions on how to use them, visiting the Official Spark NLP Models Hub would be beneficial.
Prebuilt versions of Spark NLP can be obtained through multiple channels, depending on your development environment and platform:
1. PyPI (for Python Users): You can install Spark NLP using pip, the Python package installer.
pip install spark-nlp
2. Maven Central (for Java/Scala Users): If you are using Maven, you can add the following dependency to your pom.xml:
Make sure to replace LATEST_VERSION
with the desired version of Spark NLP.
3. Spark Packages: For those using the spark-shell, pyspark, or spark-submit, you can include Spark NLP directly via Spark Packages:
--packages com.johnsnowlabs.nlp:spark-nlp_2.12:LATEST_VERSION
4. Pre-trained Models & Pipelines: Apart from the library itself, Spark NLP provides a range of pre-trained models and pipelines. These can be found on the Spark NLP Model Hub.
Always make sure to consult the official documentation or the GitHub repository for the latest instructions and versions available.