Spark OCR is a new generation Optical Character Recognition (OCR) products and it allows for text extraction from images. It uses the power of artificial intelligence-based machine learning technology and scalability that gives Spark.
- Spark OCR will accumulate knowledge using Deep Learning algorithms.
- With Spark you could scale your applications and process billions of documents.
- With Databricks it’s easy to develop and put your projects to production.
It’s a really modern, scalable, and flexible OCR NLP solution for your business.
In this post, we will demonstrate how we built an Optical Character Recognition (OCR) pipeline using the Spark OCR library in the Databricks notebooks.
We will create an example notebook, which load image documents as Spark DataFrame, recognize text, and show results.
This post contains two parts:
- Spark OCR installation to Databricks
- Creating and executing the notebook
What is Spark OCR and Databricks?
Spark OCR provides a set of Spark ML transformers/estimators that help users create and use OCR pipelines. It is built on top of Apache Spark.
Databricks is the Unified Data Analytics Platform. It presents a web-based platform for working with Spark, that provides automated cluster management and IPython-style notebooks.
Before you start, make sure that you have:
- Spark OCR jar file
- Spark OCR python wheel file
- License key
The installation process includes the following steps:
- Installing Spark OCR library to Databricks and attaching it to the cluster
- Same step for Spark OCR python wheel file
- Adding license key
Installing libraries to Databricks
Install Spark OCR jar file or Spark OCR python wheel file to driver and executor nodes. Use Databricks library management. (More details here).
- Right-click the Workspace folder where you want to store the library.
- Select Create > Library.
- Upload jar and Python Wheel files.
Attaching the libraries to the cluster
You can create the cluster or use an existing one. (For the new cluster please follow this instruction).
Use 6.4 version of the runtime, which includes Spark 2.4.5.
Navigate to the Spark OCR library details page attach it to your cluster. Either you can set `Install automatically on all clusters` flag to have Spark OCR in all your clusters.
Adding license key
The next step is setting license key as an environment variable.
- On the cluster configuration page, click the Advanced Options toggle.
- Click the Spark tab.
- Set the JSL_OCR_LICENSE environment variable in the Environment Variables field.
More details you can found on Databricks documentation page related environment variables.
Creating the notebook
In the previous step, we added the Spark OCR jar file and Spark OCR python wheel file libraries to Databricks attached them to your cluster and set the license key.
Now let’s create the Python notebook. The full example you can download here.
First check that’s Spark OCR installed to your cluster. Just try to import Spark OCR transformers:
from sparkocr.transformers import *
It should work fine. Otherwise make sure that you installed properly 2 libraries in the previous step: Spark OCR jar file and Spark OCR python wheel file.
The example notebook has the following sections:
- Data preparation
- Spark OCR Pipeline definition
- Pipeline execution and displaying results
Download and unzip the example image files using following script:
if [! -d “$OCR_DIR“]; then
Images located in /dbfs/tmp/ocr/images folder. This dataset contains 18 TIF images.
Let’s display folder using dbutils:
Note: We can’t process here data from S3 directly from public bucket. Databricks has custom implementation of S3 file system for Hadoop. It doesn’t support anonymous credential provider.
Note: From private S3 bucket you could read data directly using following code:
imagesPath = “s3a://dev.johnsnowlabs.com/ocr/datasets/news.2B/0/*.tif”
images_example_df = spark.read.format(“binaryFile”).load(imagesPath).cache()
The following is the code to create images DataFrame. It uses read function with `BinaryFormat` parameter as format. (Images folder also contains text files, so for read only `tif` files defined mask: `*.tif`).
images_path = “/tmp/ocr/images/*.tif”
images_example_df = spark.read.format(“binaryFile”).load(images_path).cache()
To preview images from DataFrame use `BinaryToImage` transformer and `display_images` function:
from sparkocr.transformers import BinaryToImage
from sparkocr.databricks import display_images
Note: `display_images` is displaying image and the metadata for it.
For DataFrame images representation Spark OCR uses following schema:
image: struct (nullable = true)
| |– origin: string (nullable = true)
| |– height: integer (nullable = false)
| |– width: integer (nullable = false)
| |– nChannels: integer (nullable = false)
| |– mode: integer (nullable = false)
| |– resolution: integer (nullable = true)
| |– data: binary (nullable = true)
More details about the fields could be found in Spark OCR documentation.
Spark OCR pipeline definition
Spark OCR based on Spark ML. So, we need to define transformers and build pipelines.
This code defines the Pipeline for recognition text from image. It uses `BinaryToImage` and `ImageToText` transformers:
from sparkocr.transformers import *
from sparkocr.databricks import
from pyspark.ml import PipelineModel
# Transform binary data to struct image format
binary_to_image = BinaryToImage()
# Run OCR
ocr = ImageToTex()
pipeline = PipelineModel(stages=[
Pipeline execution and displaying results
Here is a code for running our pipeline by calling transform method with DataFrame as parameter.
result = pipeline().transform(images_example_df)
Note: This operation takes some time for processing 18 images on single node cluster.
Notebook with a full example you can download here.
More examples you can find in Spark OCR workshop repo.