was successfully added to your cart.
NLP LibrarySpark OCR

New Spark OCR 3.12: Handwritten Text Recognition and Spark 3.2 support

By April 20, 2022No Comments

This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples.

 

 

Added to the ImageTextDetectorV2

  • New parameter ‘mergeIntersects’: merge bounding boxes corresponding to detected text regions, when multiple bounding boxes that belong to the same text line overlap.
  • New parameter ‘forceProcessing’: now you can force processing of the results to avoid repeating the computation of results in pipelines where the same results are consumed by different transformers.
  • New feature: sizeThreshold parameter sets the expected size for the recognized text. From now on, text size will be automatically detected when sizeThreshold is set to -1.

 

 

Added to the ImageToTextV2

  • New parameter ‘usePandasUdf’: support PandasUdf to allow batch processing internally.
  • New support for formatted output, and HOCR.

ocr.setOutputFormat(OcrOutputFormat.HOCR)

ocr.setOutputFormat(OcrOutputFormat.FORMATTED_TEXT)

 

Unknown
Unknown-2

 

 

Support for Spark 3.2

  • We added support for the latest Spark version, check installation instructions below.

Improved documentation on the website.

 

 

New Models

  • ocr_small_printed: Text recognition small model for printed text based on ImageToTextV2
  • ocr_small_handwritten: Text recognition small model for handwritten text based on ImageToTextV2
  • ocr_base_handwritten: Text recognition base model for handwritten text based on ImageToTextV2

 

 

New notebooks

+ SparkOcrImageToTextV2OutputFormats.ipynb, different output formats for ImageToTextV2

 

 

Spark 3.2

 

Get & Install it here