Register for the 5th NLP Summit, a Free Online Conference on Sep 24-26. Register now.
was successfully added to your cart.

Automatically Testing for Demographic Bias in Clinical Treatment Plans Generated by Large Language Models

What is Clinical Bias?

Clinical bias in LLM (Language Learning Models) refers to the unfair or unequal representation or treatment based on medical or clinical information. This means that if a model processes medical data and generates differing outputs for two identical clinical cases, purely because of some extraneous detail or noise in the data, then the model is exhibiting clinical bias. Such biases can arise due to the uneven representation of certain medical conditions, treatments, or patient profiles in the training data.

Addressing clinical bias in LLMs requires careful curation and balancing of training data, rigorous model evaluation, and continuous feedback loops with medical professionals to ensure outputs are medically sound and unbiased.

Why LangTest?

Pipeline from testing models to release without demographic bias.

LangTest: Deliver Safe and Effective Language Models

The surge in healthcare-specific machine learning models has presented a myriad of opportunities and challenges. Often, many of these models are incorporated into systems without a comprehensive assessment of their robustness, potential biases, or their aptness for real-world deployment. Such oversights only become evident post-deployment, leading to significant implications. This study introduces LangTest, an innovative open-source Python library crafted to empower developers in the assessment and enhancement of Natural Language Processing (NLP) models. LangTest offers a systematic approach to validate models against biases and perturbations like typos, varying textual cases, and more. Furthermore, it seamlessly integrates into deployment and model monitoring pipelines, ensuring models are primed for deployment. A distinctive feature is its capability to augment datasets and annotations effortlessly, addressing and rectifying prevalent issues. By adopting LangTest, developers can pave the way for the deployment of more reliable and unbiased healthcare NLP models, thereby fortifying the overall quality of healthcare applications.

You can find the LangTest library on GitHub, where you can explore its features, documentation, and the latest updates. Additionally, for more information about LangTest and its capabilities, you can visit the official website at

Demographic Bias in Clinical Models: An Exploration

Demographic bias has long been a topic of discussion and concern in many fields, including healthcare. Unfair representation or treatment based on factors like age, gender, race, and ethnicity can be detrimental and misleading in crucial domains, like medical diagnostics and treatment plans.

What is Demographic Bias?

Demographic bias refers to unequal representation or treatment based on demographic factors. When considering clinical domains, if a machine learning model suggests varied treatments for two patients solely due to their demographic details, even when they have the same medical condition, it’s exhibiting demographic bias.

But Isn’t Demographics Important in Medical Decisions?

Certainly! Medical treatments are often tailored according to a patient’s age, gender, and sometimes even race or ethnicity. For instance, treating a 70-year-old patient might differ considerably from a 10-year-old, even if they have the same ailment. Some diseases are even gender-specific, like breast cancer, or conditions related to male reproductive organs.

However, there are instances where the demographic data shouldn’t affect the suggested treatment. It’s in these specific scenarios that we need to ensure our models aren’t biased.

Our Experiment

To delve deeper, our team curated medical data files for three specialties: Internal Medicine, Gastroenterology, and Oromaxillofacial. Our primary criterion? The demographic details for a given diagnosis in these files should not impact the treatment plan.

Each file comprises three columns:

  • patient_info_A and patient_info_B: Contains details related to two hypothetical patients (e.g., age, gender, patient ID, employment status, marital status, race, etc.)
  • Diagnosis: The medical condition diagnosed for both patients.
Data for testing demographic bias in treatment

Given our curation, the treatment plans for both patients should ideally be identical despite their varied demographic details.

To test this, we fed the patient_info_A with the diagnosis to the Language Model (LLM) and requested a treatment plan. We then repeated this process using patient_info_B. This gave us two treatment plans — let’s call them treatment_plan_A and treatment_plan_B.

Comparing the Treatment Plans

The challenge now was to gauge if these treatment plans were truly similar. Enter sentence transformers! By converting each treatment plan into embeddings using a sentence transformer model, we could then use cosine similarity to measure their resemblance.

We set a similarity threshold, and based on whether the cosine similarity score surpassed this threshold, our test would either pass (suggesting the LLM didn’t exhibit demographic bias) or fail (indicating potential bias).

Understanding Langtest Harness Setup

Imagine you have a toolbox named “Harness”. In this toolbox, you need to fit in three essential tools to make it work: the task it should do, the model it should use, and the data it should work on.

1. Task (What is the job?)

Here, you’re telling the toolbox what kind of job it should perform. Since our focus is on medical stuff, we tell it to do a “clinical-tests”. But remember, this toolbox is versatile. It can do many other things too! It can recognize names (NER), sort texts (Text classification), detect harmful content (Toxicity), or even answer questions (Question-Answering).

Supported Tasks:

2. Model (Who should do the job?)

This is like choosing a worker from a team. You’re picking which expert, or “LLM model”, should do the job. It’s like selecting the best craftsman for a specific task.

Supported Models:

3. Data (On what should the job be done?)

Lastly, you need to provide the material on which the expert should work. In our case, since we’re dealing with medical tests, we have three special sets of information or “files”:

  • Medical-files: This is general information related to internal medicine.
  • Gastroenterology-files: Information specific to stomach-related issues.
  • Oromaxillofacial-files: Data about issues concerning the face and jaw.

Oh, and there’s a tiny detail: where are these experts from? It could be from “openai”, “azure-openai”, “cohere”, “ai21”, or “huggingface-inference-api”.

In Conclusion: Setting up the Harness is like preparing a toolbox for a job. You decide the task, choose your worker, and provide the material. And just like that, your toolbox is ready to work its magic!

Supported Data:

Testing in 3 lines of Code

!pip install "langtest[langchain,openai,transformers]"
import os
os.environ["OPENAI_API_KEY"] =
    from langtest import Harness
    harness = Harness(task="clinical-tests", 
                    model={"model": "text-davinci-003", "hub": "openai"},
                    data = {"data_source": "Medical-files"})
Testing Demographic bias in large language models

The report provides a comprehensive overview of our test outcomes using the Medical-files data, which comprises 49 entries. Out of these, 4 tests were unsuccessful, while the remaining passed. The default minimum pass rate is set at 70%; however, it’s adjustable based on the desired strictness of the evaluation.

For a more granular view of the treatment plans and their respective similarity scores, you can check the generated results harness.generated_results()

Below is an example of a passed (similar) test-case in which the treatment plans suggested for both the patients are similar.

Checking racial bias in treatment plans generated by LLM - positive example

Similar Treatment Plans

Below is an example of a failed (dissimilar) test-case in which the treatment plans suggested for both the patients are not similar.

Checking bias in treatment plans generated by LLM - negative example

Dissimilar Treatment Plans


Understanding and addressing demographic bias is pivotal in the medical field. While it’s true that individual characteristics can and should influence some treatment plans, it’s essential to ensure that biases don’t creep into areas where these demographics should have no bearing. Our approach offers a robust way to test and confirm that medical AI systems are offering unbiased suggestions, ensuring that all patients receive the best possible care.

Try Healthcare LLMs

See in action

A Leap Forward in Pre-Annotation with GPT-Powered Entity Recognition in NLP Lab

In the ever-evolving world of Natural Language Processing, staying ahead of the curve is paramount. With the release of NLP Lab 5.3,...