Stepping Up Information Extraction Capabilities for Virginia Tech with Spark OCR

15.07.2022

Rahul Awati

PMP-certified project manager

John Snow Labs is well known for helping healthcare and life science companies build, deploy, and operate AI products and services with its Spark NLP, one of the most widely used NLP libraries. Leveraging its state-of-the-art Spark NLP technology, the company has developed Spark OCR, a high-accuracy text recognition product for real-world noisy images and made it available under a free license for academic researchers, educators, and students to drive information revolution in academia.

Spark OCR is a commercial extension of Spark NLP that recognizes optical characters from photos and medical images and uses PDF OCR for scanned PDF documents. Its goal is to enable people to simply reuse, duplicate, and improve production-grade, cutting-edge NLP in research and teaching. It supports picture pre-processing capabilities such as adaptive thresholding and denoising, skew detection, adaptive scaling, layout analysis, etc., to improve text recognition. The free license Spark OCR comes loaded with features, including pre-trained models. It comes with a software library built on top of Apache Spark.

Spark OCR’s standout features

Image preprocessing algorithms: Spark OCR has image preprocessing algorithms that are used to increase the quality of the image to analyze noisy images more accurately. You can suppress unwanted distortions and enhance text recognition results by improving the image cropping, color correction, layout analysis, and so on that are required when you are starting from low-quality images.

Text recognition by combining NLP and OCR pipelines: Because Spark OCR and Spark NLP are tightly coupled, users can combine them to extract text from photos, extract data from PDF and tables, recognize and highlight named entities in PDF documents, and mask sensitive text to de-identify images. Spark OCR strives to add borderless tables, dark and noisy backdrops, unusual table layouts, multilingual text, and international number and currency formats to its list of supported features. The software delivers extremely high-accuracy (93%) in text recognition. Furthermore, Spark OCR supports English, German, Spanish, Russian, and Arabic languages, making it possible to obtain a multilingual text. The analysis you are performing runs on your own infrastructure; no data is sent to John Snow Labs or any third party.

Skew Detection & correction: Skewing of scanned pictures is an inevitable phenomenon, and detecting it is critical for document recognition systems. The skew of the scanned document image indicates how far the text lines deviate from the horizontal or vertical axis. Using Spark OCR, you can automatically correct skewness in your scanned documents, leading to improved results.

DICOM to text: Digital Imaging and Communications in Medicine (DICOM) Standard establishes a non-proprietary data transfer protocol, digital image format, and file structure for biomedical pictures and image-related data. With the help of Spark OCR, you can extract text from DICOM images.

Spark OCR drives efficiencies for Virginia Tech

Virginia Tech is a public land-grant research university with its main campus in Blacksburg, Virginia. The university is known for its strong research activity, both technical and non-technical. John Snow Labs counts the university as one of its most prestigious customers. The University Libraries used Spark OCR in their data extraction projects and achieved excellent results. One example is a collection of 5,100 scanned physical note cards that describe a curated historic collection of costumes and textiles. Due to the various layouts of the information on the note cards, the University Libraries found it challenging to transfer the information typed on the note cards into a digital format.

Virginia Tech’s technical challenge is retrieving text from structured data on a corresponding note card, including formatting information. Using a MRCNN, the University Libraries first detected the layout and used Spark OCR on the different parts of the layout. “Spark OCR’s pre-trained models and algorithms make this task a breeze,” says Chreston Miller, PhD in Computer Science and Applications, Data and Informatics Consultant, Engineering, Data Services, University Libraries, Virginia Tech. Furthermore, he noticed how efficient transfer learning and image augmentation approaches helped CNN’s attain highly accurate results which are predictable. The university has also used the ImageScaler to scale up the pictures to increase text recognition accuracy.

Dr. Miller leveraged Spark OCR in another highly crucial project. One of the humanities students of Virginia Tech wanted to transcribe 50 historic newspapers into digital text. Dr. Miller used Spark OCR very successfully to extract the desired text from these newspapers. What’s fascinating is that they already had a pipeline in place from the previous project, which allowed them to generate project-quality results in one day using Spark OCR.

Spark OCR is also the go-to tool for Dr. Miller to perform OCR on collections contracts for the University Libraries. These contracts represent agreements the University Libraries have with vendors who provide access to electronic content, such as online journals. The interest was to identify which collections allowed Text and Data Mining (TDM) to legally be performed on the collection. Spark OCR allowed Dr. Miller to automtically read scans of hundreds of pages of legal contracts which allows one to search through the now accessible text for TDM terms and conditions.

Making many more projects possible

Spark OCR is a proven tool for helping researchers in the humanities, law, finance, medicine, and other disciplines extract the precise amount of data from any type of file.
It is a simple solution for facilitating accurate surgical information extraction – getting facts out of noisy and unusual documents. Just a few lines of code are needed to run machine learning pipelines in a distributed environment.

Spark OCR’s number of real-world use cases is growing rapidly across industries, and university projects are no exception. It improves the quality of outcomes by digesting vast quantities of unstructured data and interpreting it in meaningful ways. In this sector, Spark OCR helps universities uncover valuable insights that are used to make better decisions.

Get & Install it here

Try OCR tool for healthcare

See in action

Rahul Awati

PMP-certified project manager

Our additional expert:

Rahul Awati is a PMP-certified project manager, technology enthusiast and writer with over a decade of experience in the IT industry. He's worked on several large IT infrastructure projects spanning across storage, compute and networks. In his writing, he's covered topics including enterprise networking, cybersecurity, artificial intelligence, robotic process automation and cloud. He holds a master's degree in computer applications.

New Spark OCR 3.12: Handwritten Text Recognition and Spark 3.2 support

Alberto Andreotti

This release comes with new models for Handwritten Text Recognition, Spark 3.2 support, bug fixes, and notebook examples. Added to the ImageTextDetectorV2...