Question Answering in Visual NLP: A Picture is Worth a Thousand Answers

03.05.2023

Alberto Andreotti

Senior data scientist on the Spark NLP team

Lights, camera, action! Welcome to the future of information extraction with Visual NLP by John Snow Labs, where “OCR-Free” multi-modal AI models are stealing the show. Imagine a world where computer vision and NLP join forces to extract high-accuracy information from forms, tables, or diagrams, regardless of image quality or the need for training. Today, we’ll dive into the thrilling world of Document Visual Question Answering and introduce you to the Donut model, a state-of-the-art AI model that’s taking the stage by storm.

A New Era of Information Extraction: Document Visual Question Answering

Gone are the days of relying solely on traditional OCR systems to extract information from images. Visual Question Answering represents a new generation of AI models that combine computer vision and NLP to provide higher accuracy information extraction. This dynamic duo can handle poor quality images and deliver results without training or tuning models, thanks to zero-shot learning. The result? Faster, more accurate extraction, and happier end users and developers.

Meet the Donut Model: Your New BFF for Visual NLP

The Donut model (https://arxiv.org/abs/2111.15664) is the new kid on the block, and it’s making quite a splash. This innovative model achieves state-of-the-art accuracy in various benchmarks, revolutionizing the way we think about AI-powered information extraction.

The Donut model’s unique approach allows it to understand both the visual and textual elements of a document, combining these aspects to answer questions directly from an image. This groundbreaking method paves the way for more accurate, efficient, and reliable data extraction.

For example, consider this image:

How would you answer these two questions:

When is the coffee break?
Who is giving the introductory remarks?

The answers are obvious from the image — even though they are not written there. As a human, you have the common sense to know that this is part of an agenda to an event, so with all likelihood the coffee break is between 11:14 to 11:39am, and Lee A. Waller is giving the introductory remark. This requires a combination of:

Seeing — you see the two columns align.
Reading — you can’t answer the questions without knowing how to read.
Common sense — you can tell that this is an agenda for an event without further hints, and that on an agenda times state when an event will happen, and names refer to speakers.

The donut model can give you a correct answer to these questions, directly from the image, without any training or tuning, even without OCR!. It also does that in four languages.

Visual NLP + Donut Model = A Match Made in Heaven

Integrating the Donut model with John Snow Labs’ Visual NLP (https://www.johnsnowlabs.com/visual-nlp/) creates a powerful combination that brings numerous concrete benefits to the table:

Efficient handling of complex documents: The combined strengths of the Donut model and Visual NLP allow for efficient processing and understanding of long, intricate, and unusually formatted documents, making data extraction more reliable. Easily combine the Donut with other models like Table Recognition and Document Classifiers to apply VQA on the specific sections you care about.
Superior accuracy: The integration of the Donut model with Visual NLP leads to exceptional accuracy in data extraction, reducing the prevalence of errors and improving overall data quality. Apply Image Enhancement before VQA, and have access to purpose specific fine-tuned models.
Streamlined workflows: The integration simplifies the implementation process, allowing developers to focus on other tasks while leveraging the power of the Donut model within Visual NLP. Leverage Visual NLP serving options like LightPipelines or Data Streaming.
Enhanced scalability: By combining the Donut model with Visual NLP’s powerful distributed processing capabilities, users can scale up their data extraction processes to handle large volumes of documents more efficiently. Scale up computation through Apache Spark to millions of records without changing your pipeline.
Better adaptability: The combination of Donut and Visual NLP provides improved adaptability, allowing users to tackle a wider range of extraction tasks and handle different types of documents with ease. Use out-of-the-box data ingestion layers to consume DocX, PDFs, and many others!

Donut Model in Action: A Picture-Perfect Example

Ready to see the Donut model in action? Let’s take a look at the code required to run Donut inside Visual NLP and explore a concrete example of an image that the model can answer questions from directly.

Define the pipeline

binary_to_image = BinaryToImage()\
    .setOutputCol("image") \
    .setImageType(ImageType.TYPE_3BYTE_BGR)

visual_question_answering = VisualQuestionAnswering()\
    .pretrained("docvqa_donut_base_opt", "en", "clinical/ocr")\
    .setInputCol(["image"])\
    .setOutputCol("answers")\
    .setQuestionsCol("questions")

# OCR pipeline
pipeline = PipelineModel(stages=[
    binary_to_image,
    visual_question_answering
]).setOutputCol(“image”)

Call the pipeline

%%time
from pyspark.sql.functions import explode
results = pipeline.transform(image_and_questions).cache()
results.select(results.answers).show(truncate=False)
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
|answers
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
|[ When is the Coffee Break? -> 11:34 to 11:39 a.m., 
   Who is giving the Introductory Remarks? -> lee a. waller, trrf vice presi- ident|
+ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
CPU times: user 12.1 ms, sys: 10.7 ms, total: 22.8 ms

And there’s more!

This was a super simplified example for us to build intuition around the topic, if you want to explore more complex use cases, I encourage you to take a look at the two notebooks listed here

VisualQuestionAnsweringOnInvoices.ipynb explains how to process invoices out of a collection of documents. It first classifies the documents into different categories, it keeps the invoices and then asks some questions about the total amount billed.

VisualQuestionAnsweringOnTables.ipynb: similarly this one deals with documents from the Australian Stock Exchange, trying to automatically determine the amount of shares a company’s director has acquired or sold.

Try Visual NLP tool

See in action

Alberto Andreotti

Senior data scientist on the Spark NLP team

Our additional expert:

Alberto Andreotti is a senior data scientist on the Spark NLP team at John Snow Labs, where he implements state-of-the-art NLP algorithms on top of Spark. He has a decade of experience working for companies and as a consultant, specializing in the field of machine learning. Alberto has written lots of low-level code in C/C++ and was an early Scala enthusiast and developer. A lifelong learner, he holds degrees in engineering and computer science and is working on a third in AI. Alberto was born in Argentina. He enjoys the outdoors, particularly hiking and camping in the mountains of Argentina.