Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Mitigating Gender-Occupational Stereotypes in AI: Evaluating Language Models with the Wino Bias Test through the Langtest Library

WinoBias Explored

The concept of “Wino Bias” specifically refers to biases arising from gender-occupational stereotypes. These stereotypes manifest when certain jobs or professions are historically and culturally linked more to one gender than another. For instance, society might instinctively associate nurses with women and engineers with men based on entrenched views and not necessarily on current realities.

A noteworthy observation is that even popular models in the machine learning community, such as bert-base-uncased, xlm-roberta-base, etc, exhibit these biases. These models, trained on vast amounts of data, inevitably inherit the prejudices present in the datasets they’re trained on. As a result, they can unintentionally reinforce gender-occupational stereotypes when used in applications.

The challenge here is twofold. First, many developers and users often regard these models as state-of-the-art, inherently trusting their outputs. Second, the widespread adoption of such models means that these biases can perpetuate on a large scale, affecting numerous applications and sectors. Consequently, those who deploy or rely on these models might inadvertently spread and reinforce outdated gender norms, even if that isn’t their intention. It’s crucial for practitioners to be aware of these pitfalls. Recognizing and addressing Wino Bias and similar stereotypes ensures that emerging technologies promote fairness and representational accuracy, preventing perpetuating harmful biases.

Illustrative example

A Contrast of Actual and Langtest Evaluations in Unveiling Gender Bias

The Langtest Evaluation and the actual WinoBias Evaluation are distinct approaches to investigating gender bias in models. Langtest Evaluation modifies the original data by masking pronouns before the evaluation, enabling a probabilistic analysis that measures the likelihood of male versus female pronoun replacements in the masked text. This facilitates direct measurement of gender bias by capturing the model’s instinctive choice of gendered pronouns, even unveiling minor biases.

On the other hand, the actual WinoBias Evaluation doesn’t utilize masking. It aims for consistent accuracy in coreference decisions across stereotypical and non-stereotypical scenarios by dividing data into two categories. The first category tests the model’s ability to link gendered pronouns with stereotypically matched occupations, while the second examines the association of pronouns with non-traditionally matched occupations. Success in the WinoBias test is gauged by a model’s consistent accuracy across these two scenarios, focusing more on understanding context and forming coreference links, indirectly measuring gender bias.

The simplicity and directness afforded by the masking of pronouns in Langtest Evaluation make it a more accessible and precise approach for quantifying gender bias compared to the more indirect method of the actual WinoBias Evaluation. While WinoBias is tailored for coreference resolution tasks, Langtest Evaluation’s applicability extends to any model that offers probabilistic outputs for gendered pronouns, showcasing its versatility.

Testing in a few lines of code

!pip install langtest[transformers]

Here we specified the task as `wino-bias`, hub as `huggingface` and model as `xlm-roberta-base`

The report provides a comprehensive overview of our test outcomes using the wino-test data.

We can observe that out of these tests, a mere 12 passed while 381 encountered failure. Thus highlighting the importance of identifying and remedying gender-occupational stereotypes within AI and natural language processing in healthcare and other fields.

For a more granular view of the treatment plans and their respective similarity scores, you can check the generated results using the harness.generated_results(). The generated results show the model’s predictions for filling the masked portion. We examine the probabilities of the mask being filled with gender pronouns. Irrelevant samples, where gender pronoun replacement didn’t rank in the top five predictions, are omitted. The model’s responses are recorded in the ‘model_response’ column, and evaluations of pass/fail are made based on the absolute difference in likelihood between replacing with male or female pronouns.

Let’s examine the above illustrative example where, for a particular masked_text, we obtain the model_response and then assign a pass or fail status based on it.

Passed Test-Case: The second test case with an absolute difference below 3%. (0.51–0.49)

Failed Test-Cases: The first and third ones where the absolute difference exceeds the threshold, being well above 3%.

Understanding Langtest Harness Setup

The core of Langtest test execution revolves around a central element called the `harness`, serving as the heart of the library’s operational structure. We can envision it as a toolbox. For it to function effectively, it requires three critical components: the task it’s designed to accomplish, the model it employs, and the data it processes.

1. Task (What is the job?)

Imagine programming the “Harness” toolbox for a specific function. Given our emphasis on addressing biases, we direct it to run the “wino-bias” test. But don’t be limited by this single application; Harness is multifaceted. It can identify entities (NER), categorize texts (Text Classification), flag inappropriate content (Toxicity), and even facilitate question-answering capabilities.

2. Model (Who should do the job?)

This process is akin to selecting a specialist from a group. You’re determining which expert, or “Hugging Face fill mask model”, is best suited for the task at hand. Think of it as handpicking the most skilled artisan for a precise job.

Supported Models:

3. Data (What’s the working material?)

Finally, you must supply the content upon which the expert will operate. For our specific scenario, given that we’re focusing on bias detection, we utilize the dedicated dataset: the “Wino-test”.

The wino-test data looks like this. A text with a masked portion is presented, and the model is tasked with filling in the missing part.

You can find the LangTest library on GitHub, where you can explore its features, documentation, and the latest updates. For more information about LangTest and its capabilities, you can visit the official website at


Recognizing and rectifying gender-occupational stereotypes is crucial in the realm of artificial intelligence and natural language processing. While it’s valid that certain tasks may have historical or cultural associations with specific genders, it’s imperative to ensure that these biases don’t inadvertently influence AI system outcomes. Our methodology, focused on the Wino Bias gender-occupational-stereotype test, offers a comprehensive mechanism to evaluate and ensure that AI models, like bert-base-uncased and roberta-base, do not perpetuate these stereotypes. This ensures that the vast applications of AI remain fair, unbiased, and truly reflective of a progressive society.

Automating Responsible AI: Integrating Hugging Face and LangTest for More Robust Models

In the ever-evolving landscape of natural language processing (NLP) in biomedical, healthcare, and other fields, staying at the forefront of innovation is...