
By all accounts, John Snow Labs has created the most accurate software in history to extract facts from unstructured text.
Why Spark OCR?




Use Python and customize 15+ Image Transformers to optimize accuracy and target specific regions and data fields in irregular documents & images

Go beyond reading text to recognize named entities, correct spelling, de-identify data – and generate new PDF or DICOM documents that highlight these results









What’s in the box
Text or PDF
Scanned PDF
Image
DICOM
Binarizer
Adaptive Tresholding
Erosion
Layout analyzer
Skew corrections
Scaler
Adaptive scaler
Split Regions
Noise Scorer
Remove objects
Morphology opening
Cropper
Extract text from images
Extract data from tables
Entity Recognition
De-identification
Structured data
Highlighted entities
De-identified text, PDF or DICOM
Images & Regions










Spark OCR in Action
in scanned PDFs
End-to-end example of regular NER pipeline: import scanned images from cloud storage, preprocess them for improving their quality, recognize text using Spark OCR, correct the spelling mistakes for improving OCR results and finally run NER for extracting entities.










scanned documents
Correct the skewness of your scanned documents will highly improve the results of the OCR. Spark OCR is the only library that allows you to finetune the image preprocessing for excellent OCR results.








By using image segmentation and preprocessing techniques Spark OCR recognizes and extracts text from natural scenes.










Removing the background noise in a scanned document will highly improve the results of the OCR. Spark OCR is the only library that allows you to finetune the image preprocessing for excellent OCR results.








Recognize text from DICOM format documents. This feature explores both the text on the image and the text from the metadata file.








