This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples.
Added to the ImageTextDetectorV2
- New parameter ‘mergeIntersects’: merge bounding boxes corresponding to detected text regions, when multiple bounding boxes that belong to the same text line overlap.
 - New parameter ‘forceProcessing’: now you can force processing of the results to avoid repeating the computation of results in pipelines where the same results are consumed by different transformers.
 - New feature: sizeThreshold parameter sets the expected size for the recognized text. From now on, text size will be automatically detected when sizeThreshold is set to -1.
 
Added to the ImageToTextV2
- New parameter ‘usePandasUdf’: support PandasUdf to allow batch processing internally.
 - New support for formatted output, and HOCR.
 
ocr.setOutputFormat(OcrOutputFormat.HOCR)
ocr.setOutputFormat(OcrOutputFormat.FORMATTED_TEXT)


Support for Spark 3.2
- We added support for the latest Spark version, check the installation instructions below. Improved documentation on the website.
 
New Models
- ocr_small_printed: Text recognition small model for printed text based on ImageToTextV2
 - ocr_small_handwritten: Text recognition small model for handwritten text based on ImageToTextV2
 - ocr_base_handwritten: Text recognition base model for handwritten text based on ImageToTextV2
 
New notebooks
+ SparkOcrImageToTextV2OutputFormats.ipynb, different output formats for ImageToTextV2





























