Meet us at Intelligent Health UK on the 24th & 25th of May. Schedule a meeting here.
was successfully added to your cart.

Can We Trust ChatGPT and LLMs in Information Retrieval Tasks?

LLMs are great productivity tool. They don’t lie, and they don’t tell the truth either. They create plausible sentence continuations, some turn out to be true, some don’t; they can’t tell the difference.


ChatGPT excels in reasoning task, but fails on information retrieval tasks

Driven by advancements in scaling, LLMs have showcased their capacity to carry out a wide range of natural language processing (NLP) tasks in a zero-shot manner, without requiring adaptation to downstream data. Nevertheless, it remains unclear whether ChatGPT can function as a versatile model capable of handling numerous NLP tasks in a zero-shot setting. In a recent study, researchers conduct an empirical examination of ChatGPT’s zero-shot learning capabilities, evaluating its performance on 20 prominent NLP datasets spanning 7 representative task categories. The comprehensive empirical investigations reveal that ChatGPT excels in tasks that demand reasoning skills (e.g., arithmetic reasoning) but struggles with specific tasks like sequence tagging (i.e. NER).

Interestingly, the zero-shot performance of ChatGPT and GPT-3.5 on CoNLL03, a popular named entity recognition dataset (entities: PER, LOC, ORG, MISC), leaves much to be desired. Their overall performance appears to be quite comparable, yet both models fail to deliver satisfactory results for each named entity type when contrasted with earlier fine-tuning approaches. The average accuracy of ChatGPT on this simple NER task is 53.7% while Spark NLP NER model achieves > 98% on the very same dataset !

This indicates that despite being considered generalist models, current LLMs still struggle with challenges in addressing specific tasks, such as named entity recognition.

This chart compares the performance of ChatGPT, GPT-3.5, and models fine-tuned using task-specific data across 20 distinct datasets. For each reasoning dataset, the superior outcome between zero-shot and zero-shot chain-of-thought approaches is displayed. Evaluation metrics used are ROUGE-1/2/L average for SAMsum, F1 score for CoNLL03, and accuracy for the remaining datasets (source:

It is quite surprising that ChatGPT, despite its outstanding performance on professional and academic exams, struggles with a relatively simple NER task. GPT-4 demonstrates human-level performance on most of these exams, even passing a simulated Uniform Bar Examination with a score in the top 10% of test takers. The model’s proficiency in exams primarily stems from the pre-training process, with RLHF having little impact. For multiple-choice questions, both the base GPT-4 model and the RLHF model perform equally well on average across the exams tested.

This chart showcases the performance of GPT models on academic and professional exams, simulating the real exam conditions and scoring. GPT-4 surpasses GPT-3.5 on the majority of the exams tested (source:

If this is what human-level means and how human intelligence is measured with such exams, we may need to reconsider how we evaluate human intelligence.

However, ChatGPT still faces challenges in real-world use cases. This is not because it lacks intelligence, but rather because exam questions typically have deterministic answers already present in its database. While ChatGPT is exceptionally adept at presenting answers in a natural language format, its underlying knowledge retrieval process is similar to a Google Search. In reality, most actual cases in real-world settings deviate from classic examples.


Can we trust LLMs?

Now, lets flip the coin and talk about the truthfulness of these models. Even if the LLMs are proved to be producing human-like plausible texts, even better than a human baseline and rectifying prior errors based on follow-up conversations, recent research has raised concerns about the truthfulness of large language models (LLMs), particularly ChatGPT and GPT-4, suggesting that these models may be less accurate in comparison to their smaller counterparts.

Larger models are less truthful. In contrast to other NLP tasks, larger models are less truthful on TruthfulQA. Larger models do better on questions that exactly match the syntax of TruthfulQA but do not probe misconceptions (source:

One of the key findings is that LLMs, due to their vast training data, can learn and propagate human biases, misconceptions, myths, and misinformation with high confidence. This phenomenon is usually called ‘hallucinations’. It is crucial to remember that these models can only learn from the data they have been trained on, and if the training data contains inaccuracies or biases, the models will reflect those flaws.

The issue of alignment with human values is another concern surrounding LLMs. These models are generally not well-aligned with humans, as their primary goal is to generate coherent and contextually relevant text, rather than to prioritize truthfulness or ethical considerations.

How GPT-3’s answer changes with model size in a concrete example. The smallest model produces a true but uninformative answer. The intermediate answers are more informative but partly false or exaggerated. The largest model says something literally false, mimicking a human superstition (source:

TruthfulQA is a benchmark to measure whether a language model is truthful in generating answers to questions. The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics.

GPT-4 demonstrates progress on public benchmarks like TruthfulQA, which assesses the model’s ability to distinguish factual statements from an adversarially-selected set of incorrect statements. While the GPT-4 base model shows only a marginal improvement over GPT-3.5 in this task, it exhibits significant enhancements after Reinforcement Learning from Human Feedback (RLHF) post-training. Although GPT-4 can resist selecting common sayings, it may still miss subtle details.

The GPT-4 official technical report includes a chart depicting the accuracy of different models on the TruthfulQA benchmark. Even with all the improvements coming along with RLHF (Reinforcement Learning from Human Feedback) and adversarial trainings, GPT-4 still answers over 40% of the questions incorrectly.

The y-axis shows accuracy, with higher values being better. The chart compares GPT-4 under zero-shot prompting, few-shot prompting, and after RLHF fine-tuning. With RLHF, GPT-4 reaches 58% accuracy, outperforming both GPT-3.5 (48% with RLHF) and Anthropic-LM (source:

GPT4 research site states the following limitations:

Despite its capabilities, GPT-4 has similar limitations as earlier GPT models. Most importantly, it still is not fully reliable (it “hallucinates” facts and makes reasoning errors). Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of a specific use-case.

Hallucinations in LLMs happen when the generated responses don’t accurately match the given context, lack supporting evidence, or stray from what we’d expect based on their training data. Some examples of these hallucinations include factual inaccuracies, where the LLM produces incorrect statements, unsupported claims that have no basis in the input or context, nonsensical statements that are unrelated to the context, and improbable scenarios describing implausible or highly unlikely events.

The last but not the least, the biggest limitation of GPT-4 is its lack of knowledge about events that occurred after September 2021, as well as its inability to learn from experience. It can make simple reasoning errors, accept false statements, and fail at complex problems — similar to humans. GPT-4 can also be confidently wrong in its predictions without double-checking its work.

Evaluating Large Language Models on Clinical & Biomedical NLP Benchmarks

The field of natural language processing (NLP) is rapidly advancing, and accurate clinical and biomedical NLP models are becoming increasingly important. In...