Can We Trust ChatGPT and LLMs in Information Retrieval Tasks?

17.04.2023

Veysel Kocaman

LLMs for information retrieval are exceptional tools for enhancing productivity but inherently lack the ability to differentiate between truth and falsehood and can give “hallucinates” facts. They excel at producing coherent sentence continuations, but determining the factual accuracy of these outputs is a challenge.
In this article, we will provide insights into the significance of understanding hallucinations in large language models (LLMs) for the effective implementation of these models. We will also discuss the issues surrounding truthfulness in LLM-generated results and give the points to mitigate hallucination in these models.

ChatGPT excels in reasoning task, but fails on information retrieval tasks

Driven by advancements in scaling, LLMs have showcased their capacity to carry out a wide range of natural language processing (NLP) tasks in a zero-shot manner, without requiring adaptation to downstream data. Is this a problem for LLMs for information retrieval?

Nevertheless, it remains unclear whether ChatGPT can function as a versatile model capable of handling numerous NLP tasks in a zero-shot setting. In a recent study, researchers conduct an empirical examination of ChatGPT’s zero-shot learning capabilities, evaluating its performance on 20 prominent NLP datasets spanning 7 representative task categories. The comprehensive empirical investigations reveal that ChatGPT excels in tasks that demand reasoning skills (e.g., arithmetic reasoning) but struggles with specific tasks like sequence tagging (i.e. NER).

Interestingly, the zero-shot performance of ChatGPT and GPT-3.5 on CoNLL03, a popular named entity recognition dataset (entities: PER, LOC, ORG, MISC), leaves much to be desired. Their overall performance appears to be quite comparable, yet both models fail to deliver satisfactory results for each named entity type when contrasted with earlier fine-tuning approaches. The average accuracy of ChatGPT on this simple NER task is 53.7% while Spark NLP NER model achieves > 98% on the very same dataset !

This indicates that despite being considered generalist models, current LLMs still struggle with challenges in addressing specific tasks, such as named entity recognition.

The chart with comparison the performance of ChatGPT, GPT-3.5, and fine-tuning models to illustrate the possibility of hallucinations in the LLM.

This chart compares the performance of ChatGPT, GPT-3.5, and models fine-tuned using task-specific data across 20 distinct datasets. For each reasoning dataset, the superior outcome between zero-shot and zero-shot chain-of-thought approaches is displayed. Evaluation metrics used are ROUGE-1/2/L average for SAMsum, F1 score for CoNLL03, and accuracy for the remaining datasets (source: https://arxiv.org/pdf/2302.06476.pdf)

It is quite surprising that ChatGPT, despite its outstanding performance on professional and academic exams, struggles with a relatively simple NER task. GPT-4 demonstrates human-level performance on most of these exams, even passing a simulated Uniform Bar Examination with a score in the top 10% of test takers. The model’s proficiency in exams primarily stems from the pre-training process, with RLHF having little impact. For multiple-choice questions, both the base GPT-4 model and the RLHF model perform equally well on average across the exams tested.

GPT passes exams: comparison of perfomance GPT 3, GPT 3.5, GPT 4.

This chart showcases the performance of GPT models on academic and professional exams, simulating the real exam conditions and scoring. GPT-4 surpasses GPT-3.5 on the majority of the exams tested (source: https://arxiv.org/pdf/2303.08774.pdf)

If this is what human-level means and how human intelligence is measured with such exams, we may need to reconsider how we evaluate human intelligence.

However, ChatGPT still faces challenges in real-world use cases. This is not because it lacks intelligence, but rather because exam questions typically have deterministic answers already present in its database. While ChatGPT is exceptionally adept at presenting answers in a natural language format, its underlying knowledge retrieval process is similar to a Google Search. In reality, most actual cases in real-world settings deviate from classic examples.

Can we trust LLM information retrieval?

Now, lets flip the coin and talk about the truthfulness of these models. Even if the LLMs are proven to be producing human-like plausible texts, even better than a human baseline and rectifying prior errors based on follow-up conversations, recent research has raised concerns about the truthfulness of large language models (LLMs), particularly ChatGPT and GPT-4, suggesting that in information retrieval tasks, these larger models may be less accurate compared to their smaller counterparts and give “hallucinates” facts.

The chart with comparing the truthfulness of different large language models.

Larger models are less truthful. In contrast to other NLP tasks, larger models are less truthful on TruthfulQA. Larger models do better on questions that exactly match the syntax of TruthfulQA but do not probe misconceptions (source: https://arxiv.org/pdf/2109.07958.pdf)

One of the key findings is that LLMs in information retrieval tasks, due to their vast training data, can learn and propagate human biases, misconceptions, myths, and misinformation with high confidence. This phenomenon is usually called ‘hallucinations’. It is crucial to remember that these models can only learn from the data they have been trained on, and if the training data contains inaccuracies or biases, the models will reflect those flaws.

The issue of alignment with human values is another concern surrounding LLM information retrieval. These models are generally not well-aligned with humans, as their primary goal is to generate coherent and contextually relevant text, rather than to prioritize truthfulness or ethical considerations.

GPT 3 model answers depending on model size: comparison of 350M, 1.3B, 6.7B, and 175B.

How GPT-3’s answer changes with model size in a concrete example. The smallest model produces a true but uninformative answer. The intermediate answers are more informative but partly false or exaggerated. The largest model says something literally false, mimicking a human superstition (source: https://arxiv.org/pdf/2109.07958.pdf)

TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics.

GPT-4 demonstrates progress on public benchmarks like TruthfulQA, which assesses the model’s ability to distinguish factual statements from an adversarially-selected set of incorrect statements. While the GPT-4 base model shows only a marginal improvement over GPT-3.5 in this task, it exhibits significant enhancements after Reinforcement Learning from Human Feedback (RLHF) post-training. Although GPT-4 can resist selecting common sayings, it may still miss subtle details.

The GPT-4 official technical report includes a chart depicting the accuracy of different models on the TruthfulQA benchmark. Even with all the improvements coming along with RLHF (Reinforcement Learning from Human Feedback) and adversarial trainings, GPT-4 still answers over 40% of the questions incorrectly.

TruthfulQA benchmark: GPT-4 still answers over 40% of the questions incorrectly.

The y-axis shows accuracy, with higher values being better. The chart compares GPT-4 under zero-shot prompting, few-shot prompting, and after RLHF fine-tuning. With RLHF, GPT-4 reaches 58% accuracy, outperforming both GPT-3.5 (48% with RLHF) and Anthropic-LM (source: https://arxiv.org/pdf/2303.08774.pdf)

GPT4 research site states the following limitations:

Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.

Hallucinations in LLM information retrieval happen when the generated responses don’t accurately match the given context, lack supporting evidence, or stray from what we’d expect based on their training data. Some examples of these hallucinations include factual inaccuracies, where LLMs in information retrieval tasks produce incorrect statements, unsupported claims that have no basis in the input or context, nonsensical statements that are unrelated to the context, and improbable scenarios describing implausible or highly unlikely events.

Last but not least, the biggest limitation of GPT-4 is its lack of knowledge about events that occurred after September 2021, as well as its inability to learn from experience. It can make simple reasoning errors, accept false statements, and fail at complex problems — similar to humans. GPT-4 can also be confidently wrong in its predictions without double-checking its work.

Mitigating Hallucination in LLMs for information retrieval

Mitigating Hallucination in LLMs for information retrieval involves using low-hallucination models (e.g. JohnSnowLabs’ models used for chatbots) and utilizing several key strategies.

Diversify training data to expose models to a broader range of factual information.
Implement fact-checking modules to verify accuracy and encourage models to express uncertainty when unsure.
Fine-tune LLMs for domain-specific contexts, use filtering and post-processing to identify inaccuracies and engage human reviewers for feedback.
Employ curriculum learning to gradually improve model accuracy and regularly update models with the latest information.
Establish ethical guidelines, educate users, and benchmark model performance for continuous improvement.

These steps collectively enhance the reliability of LLM-generated information in information retrieval tasks.
You can also contact our experts to get effective LLMs for information retrieval that you can trust.

Try Healthcare LLMs

See in action

Veysel Kocaman

Our additional expert:

Veysel is the Chief Technology Officer at John Snow Labs, improving the Spark NLP for the Healthcare library and delivering hands-on projects in Healthcare and Life Science. Holding a PhD degree in ML, Dr. Kocaman has authored more than 25 papers in peer reviewed journals and conferences in the last few years, focusing on solving real world problems in healthcare with NLP. He is a seasoned data scientist with a strong background in every aspect of data science including machine learning, artificial intelligence, and big data with over ten years of experience. Veysel has broad consulting experience in Statistics, Data Science, Software Architecture, DevOps, Machine Learning, and AI to several start-ups, boot camps, and companies around the globe. He also speaks at Data Science & AI events, conferences and workshops, and has delivered more than a hundred talks at international as well as national conferences and meetups.

Evaluating Large Language Models on Clinical & Biomedical NLP Benchmarks

Ida Lucente

The field of natural language processing (NLP) is rapidly advancing, and accurate clinical and biomedical NLP models are becoming increasingly important. In...