A Guide to Installing John Snow Labs NLP Libraries in Air-Gapped Databricks Workspaces

15.09.2023

Mahmood Bayeshi

Software Engineer at John Snow Labs

In this article, we introduce a solution to set up John Snow Labs NLP Libraries in air-gapped Databricks workspaces with no internet connection and download & load the pretrained models from John Snow Labs Models Hub.

Note: If you want to use Spark NLP or JohnSnowLabs libraries in other Air-gapped environments, you should refer to the guidelines presented in this article.

Introduction

In today’s data-driven landscape, Natural Language Processing (NLP) in clinical, legal, finance has become critical for extracting insights from text data. Apache Spark and the Spark NLP library offer a powerful framework for performing advanced NLP tasks at scale. Spark NLP comes with 20000+ pretrained pipelines and models in more than 250+ languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a Databricks cluster. However, for organizations with stringent security requirements or limited internet access, setting up such tools in an air-gapped environment can pose unique challenges.

This article delves into the intricacies of installing and configuring Spark NLP within air-gapped Databricks workspaces. In an air-gapped scenario, where the workspace lacks direct internet connectivity, tasks like downloading pretrained models take on a different complexity. We will explore step-by-step procedures to overcome these obstacles, enabling you to harness the capabilities of Spark NLP while maintaining security and compliance.

From establishing the initial setup to procuring and loading pretrained models, we will guide you through the entire process. By the end of this article, you’ll have a comprehensive understanding of deploying Spark NLP in air-gapped Databricks workspaces, unlocking the potential of NLP even in the most restricted environments.

Setting Up Spark NLP in Air-Gapped Databricks Clusters

Configuring the environment to accommodate Spark NLP requires a tailored approach when working with air-gapped Databricks clusters, where internet access is restricted. One effective strategy is to utilize Databricks Custom Runtimes, which allows you to package libraries and dependencies within your cluster, facilitating the use of Spark NLP without needing external internet connectivity.

Utilizing Databricks Custom Runtimes:

· Understanding Custom Runtimes:

Databricks Custom Runtimes enables you to create a custom environment that includes the necessary libraries and dependencies required for Spark NLP. This approach eliminates downloading libraries from external sources, making it ideal for air-gapped scenarios.

· Building the Custom Runtime:

To enable the use of Spark NLP in an air-gapped Databricks workspace, you must create a custom runtime environment using Databricks Container Services, which lets you specify a Docker image when creating a cluster.

This step involves building a Docker image that encapsulates the necessary dependencies and configurations. To start, copy and paste the following code snippets into a new Docker file.

		# syntax = docker/dockerfile:experimental
	
		ARG JOHNSNOWLABS_VERSION=4.4.4
	
		FROM databricksruntime/standard:10.4-LTS
	
		ARG JOHNSNOWLABS_VERSION
	
		ENV DEBIAN_FRONTEND noninteractive
		RUN apt-get update && apt-get install -y jq
	
		RUN mkdir -p /databricks/jars
	
		RUN /databricks/python3/bin/pip3 install lxml spark-nlp-display tensorflow torch johnsnowlabs-for-databricks==${JOHNSNOWLABS_VERSION}
	
		# install community version
		RUN --mount=type=secret,id=license if [ -f "/run/secrets/license" ]; then export SPARKNLP_VERSION=$(jq -r .PUBLIC_VERSION /run/secrets/license); \
			else export SPARKNLP_VERSION=$(/databricks/python3/bin/python3 -c "from johnsnowlabs import settings; print(settings.raw_version_nlp)" | tail -n 1); fi && \
			/databricks/python3/bin/pip3 install --upgrade spark-nlp==${SPARKNLP_VERSION} && \
			wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/jars/spark-nlp-assembly-${SPARKNLP_VERSION}.jar -P /databricks/jars
	
		# install spark OCR && download OCR jar
		RUN --mount=type=secret,id=license,mode=0444 export OCR_VERSION=$(jq -r .OCR_VERSION /run/secrets/license) && \
		if [ ! -z "$OCR_VERSION" ] ; then \
		/databricks/python3/bin/pip3 install spark-ocr==${OCR_VERSION} --extra-index-url https://pypi.johnsnowlabs.com/$(jq -r .SPARK_OCR_SECRET /run/secrets/license) && \
		wget https://pypi.johnsnowlabs.com/$(jq -r .SPARK_OCR_SECRET /run/secrets/license)/jars/spark-ocr-assembly-${OCR_VERSION}.jar -P /databricks/jars; fi
	
		# install healthcare version && download HC jar
		RUN --mount=type=secret,id=license,mode=0444 export HEALTHCARE_VERSION=$(jq -r .JSL_VERSION /run/secrets/license) && \
		if [ ! -z "$HEALTHCARE_VERSION" ] ; then \
		/databricks/python3/bin/pip3 install spark-nlp-jsl==${HEALTHCARE_VERSION} --extra-index-url https://pypi.johnsnowlabs.com/$(jq -r .SECRET /run/secrets/license) && \
		wget https://pypi.johnsnowlabs.com/$(jq -r .SECRET /run/secrets/license)/spark-nlp-jsl-${HEALTHCARE_VERSION}.jar -P /databricks/jars; fi

Then, you need to build a new docker image from this Docker file.

If you want to use the Spark NLP community version only, you can use the following command in the terminal to build the docker image.

DOCKER_BUILDKIT=1 docker build  -t jsl_db_runtime:4.4.4

Note: if you want to install a specific version of the library, you can set JOHNSNOWLABS_VERSION to build arguments like:

DOCKER_BUILDKIT=1 docker build --build-arg JOHNSNOWLABS_VERSION=5.0.0 -t jsl_db_runtime:5.0.0

Otherwise, if you want to use one of johnsnowlabs enterprise editions, including Healthcare, Legal, Finance, or Visual NLP for more advanced NLP tasks, you’ll need an Airgap Databricks license that you can request by filling in this form or by contacting support@johnsnowlabs.com. After receiving the license, copy it to the same directory as the Docker file and run the following command in the terminal.

DOCKER_BUILDKIT=1 docker build --secret id=license,src=license.json  -t jsl_db_runtime:4.4.4 .

· Push the Custom Runtime Doker Image to a docker repository:

Once you’ve crafted your custom Docker image containing the requisite setup for Spark NLP in an air-gapped Databricks workspace, the next step is to push this image to a repository securely. This repository will act as a centralized storage hub for your Docker images.

The following registries support this process:

Docker Hub with no auth or basic auth.
Amazon Elastic Container Registry (Amazon ECR) with IAM (except for Commercial Cloud Services (C2S)).
Azure Container Registry with basic auth.

Other Docker registries that support no auth or basic auth are also expected to work. For example, you can have your own docker registry on-premise using the official docker registry image.

· Configuring the Cluster:

After pushing the Docker image, it’s time to configure your Databricks cluster to leverage this tailored environment. Specifying your custom Docker image ensures your cluster can access the required resources even in an offline, air-gapped setting. Follow these instructions to seamlessly integrate your custom runtime environment into your Databricks cluster configuration. Follow these steps to configure and create a new cluster with the built docker image.

On the Create Cluster page, specify a Databricks Runtime Version that supports Databricks Container Services.
Under Advanced options, select the Docker tab.
Select Use your own Docker container.
In the Docker Image URL field, enter the custom Docker image you pushed. For example, your URL can be like johnsnowlabs/jsl_db_runtime:x.x.x.
Select the authentication type. If you have basic auth on your docker registry, you must set your username and password; otherwise, ignore this step.
As the last step, you must set spark configurations and license in cluster environment variables. To do so,
1. Under Advanced options, select the Spark tab. In the Spark config file, add the following lines.
```
		serializer org.apache.spark.serializer.KryoSerializer
		spark.kryoserializer.buffer.max 1000M
		spark.sql.legacy.allowUntypedScalaUDF true
```
2. If you use Johnsnowlabs enterprise NLP editions,
  1. Upload your license JSON file to some path in your workspace.
  2. Under Advanced options, select the Spark tab. Add the following environment variable to the Environment variables field and set its value to the uploaded license path.
```
SPARK_NLP_LICENSE_FILE=/path/to/license.json
```

As an optional step, if you have semi air-gapped environment and you want to allow models to download from John Snow Labs Models Hub to your local environment, you can add the AWS credentials included in your license JSON file as environment variables. Also, you need to whitelist https://s3.console.aws.amazon.com/s3/buckets/auxdata.johnsnowlabs.com on your firewalls.

Load the pretrained models & pipelines in Air-Gapped Databricks Workspace

This section suggests the strategies and steps required to successfully load and utilize these pretrained models and pipelines within the constraints of an air-gapped Databricks environment, enabling you to harness advanced NLP capabilities for your projects.

1. As a first step, you need to download the required models and pipelines in a connected environment. You can follow one of these paths to download the models:

– Use the library to download and sync your local models automatically. For example, you can use the following Python code to download Bert-based onto the NER model:

from johnsnowlabs import nlp
		spark = nlp.start()
		from sparknlp.annotator import NerDLModel
		NerDLModel.pretrained("onto_bert_base_cased", "en")

Note: If you’re unsure what models or pipelines to download, johnsnowlabs has a that you can use to do almost any NLP tasks. You can use these notebooks and run them on your local environment to download the models and pipelines that are required for a specific NLP task.

– Also, you can use and download models manually and upload them manually to your Databricks workspace.

2. After running the above code, the model will be downloaded in ~/cache_pretrained directory. You need to upload this directory to your air-gapped Databricks workspace.
One way that you can do this is using Databricks. You can run the following command to upload the cache_pretrained directory to your workspace.

databricks fs cp -r ~/cache_pretrained dbfs:/root/cache_pretrained

Also, you can upload a directory through UI, REST API, or upload to a cloud object storage like an s3 bucket and mount it on DBFS. Then change the Spark NLP cache directory by setting the following spark configuration (in the cluster configuration under advanced options)

 sparknlp.settings.pretrained.cache_folder /dbfs/path/to/models

Use Spark NLP in Air-Gapped Databricks Workspace

Once you have successfully uploaded the necessary pretrained models and pipelines into your air-gapped Databricks workspace, you’re poised to unleash the capabilities of Spark NLP and johnsnowlabs libraries on your text data. You can start the created cluster and run the models and pipelines that you just uploaded to your workspace and enjoy the benefits of safe, scalable, and fast NLP operations, unlocking the potential of Spark NLP and johnsnowlabs libraries within your air-gapped Databricks workspace.

Conclusion

By following these steps, you’ve successfully harnessed the capabilities of Spark NLP in an air-gapped Databricks workspace, effectively leveraging pretrained models for sophisticated NLP tasks. This guide enables the utilization of advanced NLP techniques in isolated environments and provides a pathway to achieving high compliance standards. Furthermore, this approach lays the foundation for scalable and sustainable text analysis, offering a strategic advantage in managing expanding data needs without compromising security.

References

Mahmood Bayeshi

Software Engineer at John Snow Labs

Our additional expert:

Mahmood Bayeshi is an experienced Software Engineer with a strong academic background in Software Engineering, backed by a Bachelor's degree, and possesses over 8 years of professional experience, with a primary focus on Spark and Spark NLP. Beyond this specialization, He brings a wealth of software engineering skills to the table. Proficient in languages like Java, Scala, and Python, He excels in architecting and developing complex software solutions. With expertise in Big Data technologies, cloud-native solutions, and scalable distributed frameworks like Spark, Mahmood Bayeshi is well-equipped to create scalable and efficient applications that push the boundaries of technology. His commitment to staying current with industry trends and adaptability positions him as a valuable asset, consistently delivering innovative software solutions that leverage the potential of Spark and Spark NLP to the fullest.

John Snow Labs' Healthcare Data Library with 2,400+ Curated Datasets Is Generally Available on the Databricks Marketplace

Ida Lucente

John Snow Labs Debuts Comprehensive Healthcare Data Library on Databricks Marketplace: Over 2,400 Expertly Curated, Clean, and Enriched Datasets Now Accessible, Amplifying...