John Snow Labs is pleased to announce the availability of its Natural Language Processing software library for Apache Spark. The provides simple, high performing & accurate NLP annotations for machine learning pipelines, which scale easily in a distributed environment.


The John Snow Labs NLP Library is built on top of Apache Spark ML, providing three advantages:

  1. Unmatched runtime performance, since processing is done directly on Spark DataFrames without any copying and taking full advantage of Spark’s caching, execution planning and optimized binary data format.
  2. Frictionless reuse of existing Spark libraries, including distributed topic modelling, word embeddings, n-gram calculation, string distance calculations and more.
  3. Higher productivity by using a unified API across the Natural Language Understanding, Machine Learning & Deep Learning parts of a data science pipeline.



High-Performance NLP for Apache Spark

The NLP library is written in Scala, and includes Scala and Python APIs libraries. It has no dependency on any other NLP or ML library. The code has been reviewed by Databricks’ machine learning engineers for fit to Spark ML’s current and future design. The library is released as open source under the Apache 2.0 license.

“With JSL-NLP, we’re delivering on the promise to enable customers to take advantage of the latest open source technology and academic breakthroughs in data science, all within a high performance, enterprise-grade code base.”, said the founding team. In addition, “JSL-NLP encompasses a wide range of highly efficient Natural Language Understanding tools for text mining, question answering, chat bots, fact extraction, topic modelling or Search, running at a scale and performance that has not been available to date.”


John Snow Labs will continue sponsoring the development of the NLP library. The company provides commercial support, indemnification and consulting. This provides the library with long-term financial backing, a funded active development team, and a growing stream of real-world projects that drives robustness and roadmap prioritization.

Visit the Spark-NLP GitHub Repository to clone the code base or contribute to the project. Github’s issue tracker is used to manage code requests, bugs and features. The team is looking for contributors of all kinds, from general feedback to coding new algorithms.

The NLP Quickstart Guide on the project’s homepage provides full documentation on installing, using and extending NLP pipelines and annotators.