Visually Labeling PDF Documents in the Annotation Lab

29.09.2021

Nabin Khadka

Data Scientist at John Snow Labs

Our team worked hard to release Annotation Lab v.2.1.0 right before our NLP Summit. The main improvements added by this version are:

A new project configuration: “Visual NER Labeling” which provides the skeleton for text annotation on scanned images.
Project Owners or Project Manager can train open-source models too.
The UI components and navigation of Annotation Lab – as a SPA – continues to improve its performance
The application has an increased performance (security and bug fixes, general optimizations).

Here are details of features included in this release.

Visual NER Labeling

The most exciting feature of this release is the introduction of a new type of annotation project called Visual NER Labeling. Annotating text included in image documents (e.g. scanned documents) is a common use case in many verticals but comes with several challenges. With the new Visual NER Labeling config, we aim to ease the work of annotators by allowing them to simply select text from an image and assign the corresponding label to it.

This feature is powered by Spark OCR 3.5.0, thus a valid Spark OCR license is required for accessing it.

Here is how this can be used:

Upload a valid Spark OCR license. See how to do this here.
Create a new project, specify a name for your project, add team members if necessary, and from the list of predefined templates (Default Project Configs) choose “Visual NER Labeling”.
Update the configuration if necessary and save it. While saving the project, a confirmation dialog is displayed to let you know that the Spark OCR pipeline for Visual NER is being deployed.
Import the tasks you want to annotate (images).
Start annotating text on top of the image with simple clicks and drags.
Export annotations in your preferred format.

Create Visual NER Labeling Project

Annotate

Annotations by Project Owners

With this release, we have done some additional work to make sure that Project Owners and Managers can see the proper status of tasks by taking into account their own completions.

The reason is – Project Owners themselves seem to do annotation and review. And because of this, the task status didn’t show correctly.

In the previous release, an option called “View as” was introduced for users with multiple roles (single person with roles like Manager, Annotator, Reviewer) which was difficult to keep track of. So on the Task List page, an icon is added next to the “View as” option to quickly allow users to choose the role they want.

Creating a new completion based on an existing completion has been a very important feature for annotators, especially when they want to correct a submitted completion and resubmit it after correcting it. But for analytics purposes, it was difficult to identify the source of the clone. For this, we added an extra key-value pair in the completions JSON.

All the default Project Configs and the associated text were changed, to make it more inclined towards the healthcare subjects. Import and Tasks page is reworked for making it part of the single-page application. If you want to extract data from PDF, you can use Visual NLP as well.

Learn More

Try OCR tool for healthcare

See in action

Nabin Khadka

Data Scientist at John Snow Labs

Our additional expert:

Nabin Khada leads the team building the Annotation Lab at John Snow Labs. He has 7 years of experience as a software engineer, covering a broad range of technologies from web & mobile apps to distributed systems and large-scale machine learning.