New Spark OCR 3.12: Handwritten Text Recognition and Spark 3.2 support

20.04.2022

Alberto Andreotti

Senior data scientist on the Spark NLP team

This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples.

Added to the ImageTextDetectorV2

New parameter ‘mergeIntersects’: merge bounding boxes corresponding to detected text regions, when multiple bounding boxes that belong to the same text line overlap.
New parameter ‘forceProcessing’: now you can force processing of the results to avoid repeating the computation of results in pipelines where the same results are consumed by different transformers.
New feature: sizeThreshold parameter sets the expected size for the recognized text. From now on, text size will be automatically detected when sizeThreshold is set to -1.

Added to the ImageToTextV2

New parameter ‘usePandasUdf’: support PandasUdf to allow batch processing internally.
New support for formatted output, and HOCR.

ocr.setOutputFormat(OcrOutputFormat.HOCR)

ocr.setOutputFormat(OcrOutputFormat.FORMATTED_TEXT)

Support for Spark 3.2

We added support for the latest Spark version, check the installation instructions below. Improved documentation on the website.

New Models

ocr_small_printed: Text recognition small model for printed text based on ImageToTextV2
ocr_small_handwritten: Text recognition small model for handwritten text based on ImageToTextV2
ocr_base_handwritten: Text recognition base model for handwritten text based on ImageToTextV2

New notebooks

+ SparkOcrImageToTextV2OutputFormats.ipynb, different output formats for ImageToTextV2

Get & Install it here

Try OCR tool for healthcare

See in action

Alberto Andreotti

Senior data scientist on the Spark NLP team

Our additional expert:

Alberto Andreotti is a senior data scientist on the Spark NLP team at John Snow Labs, where he implements state-of-the-art NLP algorithms on top of Spark. He has a decade of experience working for companies and as a consultant, specializing in the field of machine learning. Alberto has written lots of low-level code in C/C++ and was an early Scala enthusiast and developer. A lifelong learner, he holds degrees in engineering and computer science and is working on a third in AI. Alberto was born in Argentina. He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.