Home » Case study Classification of Unstructured Documents into the Environmental, Social & Governance (ESG Document Classification)

ESG Document Classification

Classification of Unstructured Documents into the Environmental, Social & Governance (ESG) Taxonomy using Spark NLP

Read the full case study

INDUSTRY: Finance

Introduction: “There is an immense amount of unstructured data generated every day that can affect companies and their position in the market. As this information continuously grows, it’s a critical task for decision makers to process, quantify and analyze this data to identify opportunity and risk. One of the important indicators in this kind of analysis is ESG (environmental, social and governance) rating, which identifies issues for a company in these critical areas.

This White Paper does this automatically for documents continuously ingested from over world news. The models have been deployed in production as part of a big data analytics platform of a leading data provider to the financial services industry.”

Challenge: “The goal of ESG classification is to automate the process for analyzing data records to identify ESG issues. Natural Language Processing (NLP) techniques are used to automatically assign ESG tags to the target unstructured data records. Artificial Intelligence (AI) models with properly curated datasets can accomplish this task. The analysis yields data record keyword distribution over the three main criteria: environmental, social and governance.

The main stages of the project workstream were:

1.Create a taxonomy of signals and sub-signals that capture when ESG related events happen.
2.Work with content experts to find content for each signal or sub- signal combination for training and validation datasets. Borrow or expand concepts from other sources for the taxonomy and work with content experts to come to a final taxonomy. Create a negative dataset with documents not containing ESG signals.
3.Identify keywords and phrases that can be used to aid content searching to capture data for each signal and sub-signal.
4.Train machine learning models for identifying and tagging title and content body (unstructured text) with actual signals or sub-signal items from the taxonomy.
5.Provide precision and recall metrics for each signal or sub-signal and iteratively optimize the model to achieve desired results.
6.Deploying the models into production so that they become a hardened, reliable, and scalable component of the big data analytics platform.”

Solution: “Challenges:

– Overlapping sub-signals
Tagging articles is very subjective to an individual’s perspective. In the ESG taxonomy, some sub-signals are very close meaning and can be mislabeled not only by a model, but by a content expert as well.

– Low Quality Data:
Another challenge for this project is to deal with sub-signals with less tagged articles and articles with corrupted bodies. If models were to be trained on those articles, they will not perform with expected accuracy.

– The need to deploy the models quickly to production reliably and at scale – John Snow Labs’ Spark NLP framework was chosen to train the models. The data annotation, preparation, model creation, experimentation, and deployment were also done by the John Snow Labs. During the project, we defined target ESG taxonomy to train AI models. We also labeled the existing dataset and trained machine learning models for identifying and tagging title and content body (unstructured text) with actual labels from the taxonomy.”

“Result: “Solutions:

– Overlapping sub-signals
To eliminate this problem, articles should be tagged correctly for all sub-signal. For the models, training should use a weighted average of the relevance value for each sub-signal or asn adaptive weighted value if annotators have different level of qualification. The aggregated result will give a better picture of model accuracy.

– Low Quality Data
The data preprocessing consisted in removing short articles and articles with corrupted bodies. Additionally, we analyzed incorrect model predictions to make sure annotators labeled the initial record correctly. To make sure we are not overfitting the model, the random state of train/test/validation split was changed on every data regeneration.

To perform this analysis effectively and process a massive number of data sources, John Snow Labs’ Spark NLP has been used to automatically analyze incoming documents and detect material ESG events. The goal of this machine learning pipeline is to automatically identify ESG material events in unstructured data records and tag them correctly.”

Our algorithm with optimal model selection and joined hierarchy models achieves 84.6% F1 score metric on the text dataset and 82.3% on the validation dataset for predicting the signal with highest probability.

John Snow Labs provided its customer excellent results by using advanced machine learning and NLP techniques. This was combined with a proven end-to-end delivery process – from annotations through data science to a large-scale production deployment.