Don't miss the NLP Summit 2022, free and online event in October 4-6. Register for freehere.
was successfully added to your cart.

New Spark OCR 3.12: Handwritten Text Recognition and Spark 3.2 support

This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples.

Added to the ImageTextDetectorV2

  • New parameter ‘mergeIntersects’: merge bounding boxes corresponding to detected text regions, when multiple bounding boxes that belong to the same text line overlap.
  • New parameter ‘forceProcessing’: now you can force processing of the results to avoid repeating the computation of results in pipelines where the same results are consumed by different transformers.
  • New feature: sizeThreshold parameter sets the expected size for the recognized text. From now on, text size will be automatically detected when sizeThreshold is set to -1.

Added to the ImageToTextV2

  • New parameter ‘usePandasUdf’: support PandasUdf to allow batch processing internally.
  • New support for formatted output, and HOCR.

ocr.setOutputFormat(OcrOutputFormat.HOCR)

ocr.setOutputFormat(OcrOutputFormat.FORMATTED_TEXT)

Unknown
Unknown-2

Support for Spark 3.2

  • We added support for the latest Spark version, check the installation instructions below. Improved documentation on the website.

New Models

  • ocr_small_printed: Text recognition small model for printed text based on ImageToTextV2
  • ocr_small_handwritten: Text recognition small model for handwritten text based on ImageToTextV2
  • ocr_base_handwritten: Text recognition base model for handwritten text based on ImageToTextV2

New notebooks

+ SparkOcrImageToTextV2OutputFormats.ipynb, different output formats for ImageToTextV2

Get & Install it here

End-to-End No-Code Development of Visual NER Models for PDFs and Images

This video shows how available Visual NER models can be used for predictions, how data can be corrected and how visual models...