was successfully added to your cart.

A Guide to Installing John Snow Labs NLP Libraries in Air-Gapped Databricks Workspaces

In this article, we introduce a solution to set up John Snow Labs NLP Libraries in air-gapped Databricks workspaces with no internet connection and download & load the pretrained models from John Snow Labs Models Hub.

Note: If you want to use Spark NLP or JohnSnowLabs libraries in other Air-gapped environments, you should refer to the guidelines presented in this article.


In today’s data-driven landscape, Natural Language Processing (NLP) has become a critical component for extracting insights from text data. Apache Spark, coupled with the Spark NLP library, offers a powerful framework for performing advanced NLP tasks at scale. Spark NLP comes with 20000+ pretrained pipelines and models in more than 250+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a Databricks cluster. However, for organizations with stringent security requirements or limited internet access, setting up such tools in an air-gapped environment can pose unique challenges.

This article delves into the intricacies of installing and configuring Spark NLP within air-gapped Databricks workspaces. In an air-gapped scenario, where the workspace lacks direct internet connectivity, tasks like downloading pretrained models take on a different complexity. We will explore step-by-step procedures to overcome these obstacles, enabling you to harness the capabilities of Spark NLP while maintaining security and compliance.

From establishing the initial setup to procuring and loading pretrained models, we will guide you through the entire process. By the end of this article, you’ll have a comprehensive understanding of how to deploy Spark NLP in air- gapped Databricks workspaces, unlocking the potential of NLP even in the most restricted environments.

Setting Up Spark NLP in Air-Gapped Databricks Clusters

When working with air-gapped Databricks clusters, where internet access is restricted, configuring the environment to accommodate Spark NLP requires a tailored approach. One effective strategy is to utilize Databricks Custom Runtimes, which allows you to package libraries and dependencies within your cluster, facilitating the use of Spark NLP without the need for external internet connectivity.

Utilizing Databricks Custom Runtimes:

· Understanding Custom Runtimes:

Databricks Custom Runtimes enables you to create a custom environment that includes the necessary libraries and dependencies required for Spark NLP. This approach eliminates the need to download libraries from external sources, making it ideal for air-gapped scenarios.

· Building the Custom Runtime:

To enable the use of Spark NLP in an air-gapped Databricks workspace, you need to create a custom runtime environment using Databricks Container Services which lets you specify a Docker image when you create a cluster.

This step involves building a Docker image that encapsulates the necessary dependencies and configurations. To get started, simply copy and paste the following code snippets into a new Docker file.

		# syntax = docker/dockerfile:experimental
		FROM databricksruntime/standard:10.4-LTS
		ENV DEBIAN_FRONTEND noninteractive
		RUN apt-get update && apt-get install -y jq
		RUN mkdir -p /databricks/jars
		RUN /databricks/python3/bin/pip3 install lxml spark-nlp-display tensorflow torch johnsnowlabs-for-databricks==${JOHNSNOWLABS_VERSION}
		# install community version
		RUN --mount=type=secret,id=license if [ -f "/run/secrets/license" ]; then export SPARKNLP_VERSION=$(jq -r .PUBLIC_VERSION /run/secrets/license); \
			else export SPARKNLP_VERSION=$(/databricks/python3/bin/python3 -c "from johnsnowlabs import settings; print(settings.raw_version_nlp)" | tail -n 1); fi && \
			/databricks/python3/bin/pip3 install --upgrade spark-nlp==${SPARKNLP_VERSION} && \
			wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-${SPARKNLP_VERSION}.jar -P /databricks/jars
		# install spark OCR && download OCR jar
		RUN --mount=type=secret,id=license,mode=0444 export OCR_VERSION=$(jq -r .OCR_VERSION /run/secrets/license) && \
		if [ ! -z "$OCR_VERSION" ] ; then \
		/databricks/python3/bin/pip3 install spark-ocr==${OCR_VERSION} --extra-index-url https://pypi.johnsnowlabs.com/$(jq -r .SPARK_OCR_SECRET /run/secrets/license) && \
		wget https://pypi.johnsnowlabs.com/$(jq -r .SPARK_OCR_SECRET /run/secrets/license)/jars/spark-ocr-assembly-${OCR_VERSION}.jar -P /databricks/jars; fi
		# install healthcare version && download HC jar
		RUN --mount=type=secret,id=license,mode=0444 export HEALTHCARE_VERSION=$(jq -r .JSL_VERSION /run/secrets/license) && \
		if [ ! -z "$HEALTHCARE_VERSION" ] ; then \
		/databricks/python3/bin/pip3 install spark-nlp-jsl==${HEALTHCARE_VERSION} --extra-index-url https://pypi.johnsnowlabs.com/$(jq -r .SECRET /run/secrets/license) && \
		wget https://pypi.johnsnowlabs.com/$(jq -r .SECRET /run/secrets/license)/spark-nlp-jsl-${HEALTHCARE_VERSION}.jar -P /databricks/jars; fi

Then you need to build a new docker image from this Docker file.

If you want to use Spark NLP community version only, you can use the following command in terminal to build the docker image.

DOCKER_BUILDKIT=1 docker build  -t jsl_db_runtime:4.4.4

Note: if you want to install specific version of library you can set JOHNSNOWLABS_VERSION build arguments like:

DOCKER_BUILDKIT=1 docker build --build-arg JOHNSNOWLABS_VERSION=5.0.0 -t jsl_db_runtime:5.0.0

Otherwise, if you want to use one of johnsnowlabs enterprise editions including Healthcare, Legal, Finance or Visual NLP for more advanced NLP tasks, you’ll need an Airgap Databricks license that you can request by filling in this hform or by contacting support@johnsnowlabs.com. After receiving the license, copy it to the same directory as Docker file and run the following command in terminal.

DOCKER_BUILDKIT=1 docker build --secret id=license,src=license.json  -t jsl_db_runtime:4.4.4 .

· Push the Custom Runtime Doker Image to a docker repository:

Once you’ve crafted your custom Docker image containing the requisite setup for Spark NLP in an air-gapped Databricks workspace, the next step is to securely push this image to a repository. This repository will act as a centralized storage hub for your Docker images.

This process is supported with the following registries:

Other Docker registries that support no auth or basic auth are also expected to work. For example you can have your own docker registry on premise using official docker registry image.

· Configuring the Cluster:

After pushing the Docker image, it’s time to configure your Databricks cluster to leverage this tailored environment. By specifying your custom Docker image, you ensure that your cluster can access the required resources even in an offline, air-gapped setting. Follow these instructions to seamlessly integrate your custom runtime environment into your Databricks cluster configuration. Follow these steps to configure and create a new cluster with the built docker image.

  1. On the Create Cluster page, specify a Databricks Runtime Version that supports Databricks Container Services.
  2. Under Advanced options, select the Docker tab.
  3. Select Use your own Docker container.
  4. In the Docker Image URL field, enter your custom Docker image that you pushed for example your url can be like johnsnowlabs/jsl_db_runtime:x.x.x.
  5. Select the authentication type. If you have basic auth on your docker registry you need to set username and password otherwise ignore this step.
  6. As last step you need to set spark configurations and license in cluster environment variables. To do so,
    1. Under Advanced options, select the Spark tab. In the Spark config file add the following lines
      		serializer org.apache.spark.serializer.KryoSerializer
      		spark.kryoserializer.buffer.max 1000M
      		spark.sql.legacy.allowUntypedScalaUDF true
    2. If you use Johnsnowlabs enterprise NLP editions,
      1. Upload your license json file to some path in your workspace.
      2. Under Advanced options, select the Spark tab. In the Environment variables field add the following environment variable and set its value to the uploaded license path.

As an optional step, if you have semi air-gapped environment and you want to allow models download from John Snow Labs Models Hub to your local environment, you can add the AWS credentials included in your license json file as environment variables. Also, you need to whitelist https://s3.console.aws.amazon.com/s3/buckets/auxdata.johnsnowlabs.comon your firewalls.

Load the pretrained models & pipelines in Air-Gapped Databricks Workspace

This section suggests the strategies and steps required to successfully load and utilize these pretrained models and pipelines within the constraints of an air-gapped Databricks environment, enabling you to harness advanced NLP capabilities for your projects.

1. As a first step you need to download the required models and pipelines in a connected environment, you can follow one of these paths to download the models:

– Use the library to download and sync your local models automatically. For example, you can use the following Python code to download Bert-based onto NER model:

from johnsnowlabs import nlp
		spark = nlp.start()
		from sparknlp.annotator import NerDLModel
		NerDLModel.pretrained("onto_bert_base_cased", "en")

Note: If you’re not sure what models or pipelines to download, johnsnowlabs has a that you can use to do almost any NLP tasks. You can use these notebooks and run them on your local environment to download the models and pipelines that is required for a specific NLP task.

– Also, you can use and download models manually and upload them manually to your Databricks workspace.

2. After running the above code, model will be downloaded in ~/cache_pretrained directory. You need to upload this directory to your air-gapped databricks workspace.
One way that you can do is using databricks cli, you can run the following command to upload cache_pretrained directory to your workspace.

databricks fs cp -r ~/cache_pretrained dbfs:/root/cache_pretrained

Also, you can upload directory though UI, REST API or upload to a cloud object storage like s3 bucket and mount it on DBFS. Then change Spark NLP cache directory by setting the following spark configuration (in the cluster configuration under advanced options)

 sparknlp.settings.pretrained.cache_folder /dbfs/path/to/models

Use Spark NLP in Air-Gapped Databricks Workspace

Once you have successfully uploaded the necessary pretrained models and pipelines into your air-gapped Databricks workspace, you’re poised to unleash the capabilities of Spark NLP and johnsnowlabs libraries on your text data. You can start the created cluster and run the models and pipelines that you just uploaded to your workspace and enjoy the benefits of safe, scalable, and fast NLP operations, unlocking the potential of Spark NLP and johnsnowlabs libraries within your air-gapped Databricks workspace.


By following these steps, you’ve successfully harnessed the capabilities of Spark NLP in an air-gapped Databricks workspace, effectively leveraging pretrained models for sophisticated NLP tasks. This guide not only enables the utilization of advanced NLP techniques in isolated environments but also provides a pathway to achieving high compliance standards. Furthermore, this approach lays the foundation for scalable and sustainable text analysis, offering a strategic advantage in managing expanding data needs without compromising security.


John Snow Labs' Healthcare Data Library with 2,400+ Curated Datasets Is Generally Available on the Databricks Marketplace

John Snow Labs Debuts Comprehensive Healthcare Data Library on Databricks Marketplace: Over 2,400 Expertly Curated, Clean, and Enriched Datasets Now Accessible, Amplifying...