Empowering Spark NLP for underrepresented languages

24.11.2021

Stepheni Hass

As in many fields of science, the vast majority of AI and NLP research and tools are developed in English; and other languages with a similar amount of users are underrepresented and fall behind

By leveraging its internal resources as well as the power of its fast growing community, John Snow Labs is addressing the problem of underrepresented languages and facilitating the use of the Spark NLP open source library for over 190 languages.

The problem of underrepresented languages in NLP

Most NLP models are trained with English datasets and other languages remain underrepresented. This comes in handy for the 17% of the world’s population who understands English and are looking for novel research on the topic, and those who’d like to use the models to analyze their own text written in English.

The problem is, most data around the globe is not generated in English. As an example, doctors at a local hospital in Istanbul probably write their reports in Turkish and not in English. As most NLP models are trained on English datasets, the data collected in Turkish is widely ignored and models trained in the language are missing or trained on poor datasets. Opportunities to use NLP to understand text in Turkish are therefore limited.

The rise of multilingual NLP models

To close the language gap in NLP, industry and researchers are increasingly focusing on NLP libraries in other languages, such as French, German, Swedish etc. However, with over 7000 spoken languages in the world, many languages are underrepresented and without the right approach, they will remain in the dark for many more years.

John Snow Labs started to train its Spark NLP library in non-English languages a couple of years back, using datasets in languages such as Chinese, German, French, Italian, Spanish or Russian. The increasing amount of languages used to train the tool led to more and more data scientists and engineers joining the Spark NLP community and downloading the library able to perform multilingual tasks. The increase in languages had a direct correlation with the number of Github downloads.

A community approach to facilitate NLP tasks in multiple languages

Internal team members at John Snow Labs, as well as community members from across the world, contributed to training the model on multiple datasets in a wide range of languages. Fresh datasets were used to re-train the model in existing languages and develop it further by adding complete new languages to it.

Thanks to a fast growing Spark NLP community and a very multilingual team at John Snow Labs, the number of represented languages increased from 40 to 190 in the past 1.5 years. Additionally, languages with a very different structure were introduced, such as Arabic, which is read from right to left.

Some of these languages, including Arabic, Chinese, Finnish, German, French, Italian, Spanish, Dutch, Russian, are fully supported by the library and are able to perform all of the following rule-based tasks:

Stop words removal -> filtering out irrelevant words before processing the analysis of the text, such as prepositions, pronouns, conjunctions, etc.
Lemmatization -> method that switches any kind of a word to its base root mode.
Part-of-Speech (POS) -> assigning parts of speech to each word, such as noun, verb, adjective etc.
Word and sentence embeddings -> detecting text vectors for sentiment analytics, text classifications, Named Entity Recognition (NER)
Sentence boundaries detection -> detecting where one sentence ends and another begins
Language translation
Spell checking

Other languages, such as Filipino, are still limited and able to perform some of the tasks mentioned above. John Snow Labs is closely collaborating with the community to amplify the available tasks for each language, asking the members for their needs to perform tasks in specific languages. This helps the team prioritize and focus on improving the library according to what the community really needs.

Looking into the future

There’s no doubt that NLP models trained in English still perform at a higher precision than those trained in multilingual languages. With few high-quality datasets available in some languages it can be challenging to reach results above 74% precision.

However, if a doctor had to decide whether to use an NLP model resulting in 74% precision or not to have one in place at all, they would probably choose the first option in order to gain valuable insights and automate tedious tasks.

What is more, the precision resulting from using Spark NLP in multiple languages is improving with every iteration the model is being trained, just as it happened with the English version of the model. That said, we need to start training the models on different languages now, so we can constantly improve their results and get them to where we’d like them to be in the future.

Stepheni Hass

Our additional expert:

Marketing and brand evangelist who is equally analytical and creative, with a strong focus on collaboration. Highly experienced with proven results in marketing, events planning and execution, project management, business development, team building, operations, and much more. I'm a driven, high-performing business unicorn with a passion for connecting people and building communities!

Redesigned Setup Page and Support for Multipage PDF Annotation in the Annotation Lab

Nabin Khadka

We are very excited to announce the release of AnnotationLab v2.3.0. This version redesigns the Project Setup Page and introduces the annotation...