Automating Responsible AI: Integrating Hugging Face and LangTest for More Robust Models

Ali Tarik Mirik

In the ever-evolving landscape of natural language processing (NLP) in biomedical, healthcare, and other fields, staying at the forefront of innovation is not just an aspiration; it’s a necessity. Imagine having the capability to seamlessly integrate state-of-the-art models from the Hugging Face Model Hub, harness diverse datasets effortlessly, rigorously test your models, compile comprehensive reports, conduct successful overlaps with medical LLMs and other language models, and ultimately supercharge your NLP projects. It may sound like a dream, but in this blog post, we’ll reveal how you can turn this dream into a reality.

Whether you’re a seasoned NLP practitioner seeking to enhance your workflow or a newcomer eager to explore the cutting edge of NLP, this blog post will guide you. Join us as we dive deep into each aspect of this synergy, unveiling the secrets to boosting your NLP projects to unprecedented levels of excellence.

This blog will explore the integration between Hugging Face, your go-to source for state-of-the-art NLP models and datasets, and LangTest, your NLP pipeline’s secret weapon for testing and optimization. Our mission is clear: to empower you with the knowledge and tools needed to elevate your NLP projects to new heights. Throughout this blog, we’ll dive deep into each aspect of this powerful integration, from leveraging Hugging Face’s Model Hub and datasets to rigorous model testing with LangTest. So, prepare for an enriching journey towards NLP excellence, where the boundaries of innovation are limitless.

from langtest import Harness

harness = Harness(
  task="text-classification",
  model={
    "model": "charlieoneill/distilbert-base-uncased-finetuned-tweet_eval-offensive",
    "hub": "huggingface"
  }
)

LangTest Library

LangTest is a versatile and comprehensive tool that seamlessly integrates with various NLP resources and libraries. With support for John Snow Labs, Hugging Face, and spaCy, LangTest allows users to work with their preferred NLP frameworks, making it an indispensable asset for practitioners across various domains. Moreover, LangTest extends its capabilities beyond just frameworks, offering compatibility with numerous LLM sources like OpenAI and AI21, enabling users to harness the power of diverse language models in their NLP pipelines. This versatility ensures that LangTest caters to the diverse needs of NLP professionals, making it a valuable tool in their quest for NLP excellence

LangTest seamlessly supports Hugging Face models and datasets, streamlining the integration of these powerful NLP resources into your projects. Whether you need state-of-the-art models for various NLP tasks or diverse, high-quality datasets, LangTest simplifies the process. With LangTest, you can quickly access, load, and fine-tune Hugging Face models and effortlessly integrate their datasets, ensuring your NLP projects are built on a robust foundation of innovation and reliability.

Leveraging Hugging Face Model Hub

The Hugging Face Model Hub is one of the biggest collections on ML, boasting an impressive array of pre-trained NLP models. Whether you’re working on text classification, named entity recognition, sentiment analysis, or any other NLP task, a model awaits you in the Hub. The Hub brings together the collective intelligence of the NLP community, making it easy to discover, share, and fine-tune models for your specific needs.

You load and test any model with the supported tasks on the HF Model Hub in LangTest in only one line. Just specify the hub as ‘huggingface’ and give the model name and you are ready to go!

Using Hugging Face Datasets in LangTest

Having access to high-quality and diverse datasets is as essential as having the right models. Fortunately, the Hugging Face ecosystem extends beyond pre-trained models, offering a wealth of curated datasets to enrich your NLP projects. In this part of our journey, we’ll explore how you can seamlessly integrate Hugging Face datasets into LangTest, creating a dynamic and data-rich environment for your NLP endeavors.

1. The Hugging Face Datasets Advantage:

Dive into the world of Hugging Face Datasets, where you’ll discover a vast collection of datasets spanning many languages and domains. These datasets are meticulously curated and formatted for easy integration into your NLP workflows.

2. Data Preparation with LangTest:

While having access to datasets is crucial, knowing how to prepare and use them effectively is equally vital. LangTest comes to the rescue by providing seamless data loading capabilities. You can integrate any dataset into your workflow by specifying a few parameters in one line. The parameters are:

“source” is just like the hub parameter and sets the origin of the data.
“data_source” is the name or path of the dataset.
“split”, “subset”, “feature_column”, and “target_column” can be set according to your data, and LangTest data handlers will load the data for you.

from langtest import Harness

harness = Harness(
  task="text-classification",
  model={
    "model": "charlieoneill/distilbert-base-uncased-finetuned-tweet_eval-offensive",
    "hub": "huggingface"
  },
  data={
    "source":"huggingface",
    "data_source":"tweet_eval",
    "feature_column":"text",
    "target_column":"label",
  }
)

Model Testing with LangTest

It’s not just about selecting the right model and datasets; it’s about rigorously testing and fine-tuning your models to ensure they perform at their best. This part of our journey takes us into the heart of LangTest, where we’ll explore the crucial steps of model testing and validation.

LangTest allows you to test in many categories: accuracy, robustness, and bias. You can check the test lists here and inspect them in detail. You can also view examples of these tests from our tutorial notebooks. In this blog, we will focus on robustness testing and try to improve our model by re-training it on the LangTest augmented dataset.

Preaparing for Testing

You can check the LangTest documentation or tutorials for a detailed config view. We are using ‘lvwerra/distilbert-imdb’ model from HF Model Hub and ‘imdb’ dataset from HF Datasets. We are continuing with robustness testing for this model.

from langtest import Harness

harness = Harness(
    task = "text-classification",
    model={"model":'lvwerra/distilbert-imdb', "hub":"huggingface"},
    data={
        "source": "huggingface",
        "data_source": "imdb",
        "split":"test[:100]"
    },
    config={
        "tests": {
            "defaults": {"min_pass_rate": 1.0},
            "accuracy": {"min_micro_f1_score": {"min_score": 0.7}},
            'robustness': {
                'titlecase':{'min_pass_rate': 0.60},
                'strip_all_punctuation':{'min_pass_rate': 0.96},
                'add_typo':{'min_pass_rate': 0.96},
                'add_speech_to_text_typo':{'min_pass_rate': 0.96},
                'adjective_antonym_swap':{'min_pass_rate': 0.96},
                'dyslexia_word_swap':{'min_pass_rate': 0.96}
            },
        }
    }
)

After creating the harness object with the desired tests and parameters we are ready to run the tests. Firstly we run .generate() and then view the results with .testcases().

harness.generate()
harness.testcases()

Compiling the Comprehensive Report

We’ll begin by dissecting the key components of a comprehensive report. Understanding the structure and essential elements will enable you to use your report wisely while augmenting your model or analyzing the results. We easily create the report in different formats using .report() a function. We will continue with the default format (pandas DataFrame) in this blog.

We can see that our model failed adjective_antonym_swap and strip_all_punctuation tests and got some failed results for other tests. LangTest allows you to create an augmented training set using test results automatically. It perturbates the given dataset with proportions according to test results.

Supercharging the Model: Insights from Testing

Testing your models in the quest for NLP excellence is just the beginning. The real magic happens when you take those testing insights and use them to supercharge your NLP models. In this part of our journey, we’ll explore how LangTest can help you harness these insights to enhance your models and push the boundaries of NLP performance.

Creating an Augmented Fine-Tuning Dataset

LangTest’s augmentation functionality empowers you to generate datasets enriched with perturbed samples, a crucial step in refining and enhancing your NLP models. This feature lets you introduce variations and diversity into your training data, ultimately leading to more robust and capable models. With LangTest, you have the tools to take your NLP excellence to the next level by augmenting your datasets and fortifying your models against various real-world challenges.

Let’s see how we can implement this functionality with only a few lines of code:

harness.augment(
  training_data = {
  "source": "huggingface",
  "data_source": "imdb",
  "split":"test[300:500]"
},
  save_data_path = "augmented.csv",
  export_mode = "transform"
)

Training The Model with Augmentation Dataset

When it comes to training your model with an augmented dataset, a valuable resource at your disposal is Hugging Face’s guide on fine-tuning pretrained models. This guide serves as your roadmap, effectively providing essential insights and techniques for fine-tuning transformer models. By following this resource, you can navigate the intricate process of adapting pretrained models to your specific NLP tasks while leveraging the augmented dataset generated by LangTest. This synergy between LangTest’s data augmentation capabilities and Hugging Face’s fine-tuning guidance equips you with the knowledge and tools needed to optimize your models and achieve NLP excellence: Fine-tune a pretrained model.

We will not get into the details of training a model, but you can check the complete notebook with all the code needed to follow the steps of this blog.

The Notebook.

Evaluating the Enhanced Model’s Performance

Once you’ve fine-tuned your model using augmented data and Hugging Face’s fine-tuning guidance, evaluating your model’s performance is the next crucial step on the journey to NLP excellence. This section will explore comprehensive techniques and metrics to assess how well your enhanced model performs across a range of NLP tasks. We’ll dive into areas such as accuracy, precision, recall, F1-score, and more, ensuring you thoroughly understand your model’s strengths and areas for improvement. By the end of this part, you’ll be well-equipped to measure the true impact of your enhanced model and make informed decisions on further refinements, setting the stage for achieving exceptional NLP performance.

Next Article

NLP Summit Insights: How LLMs Are Shaping the Future of Modern Business

Ida Lucente

NLP Summit Experts Share Perspectives on How Advanced NLP Technologies for Healthcare, Finance, Legal Will Shape Their Industries and Unleash Better &...