Can AI Call its Own Bluffs?

1 Introduction

Hallucinations are one of the key issues preventing wider adoption of LLMs in products [3] [4] [10].

Current LLMs undergo alignment phase where they are fine-tuned to score highly in “helpfulness”, “truthfulness” and “harmlessness”. [13] [1]

However, the details of the alignment process make it less likely to ensure the “truthfulness” of generations, the “convincingness” seems to be optimised instead.

Typically, during the latest stage of alignment process the LLM is being fine-tuned via a RL algorithm such as PPO [7] to maximise some reward.

The latency of human feedback doesn’t allow to use actual human feedback directly, so instead a preference model is trained on human feedback and used as a proxy.

Such preference model is unlikely to be better at factual knowledge than the base model, since is typically initialised from the weights of the base pre-trained model and is fine-tuned on pairs of generated answers to select the one preferred by a human, and might instead focus on the tone of the response and other cues.

Even if actual humans are used instead of the reward model, it is quite possible that they wouldn’t be able to reliably evaluate the truthfulness of various statements (unless each answer will involve an extensive fact-checking procedure), leading us to believe that such approach to alignment is insufficient to ensure the truthfulness of the model.

2 Method

In order to properly optimise for truthfulness, there must be some way for the model to transparently communicate that it is not able to answer correctly without incurring a reward penalty, or more generally - to reflect the confidence in generated text, which is related to the task of probability calibration.

Unlike approaches that rely on external models for confidence estimation and calibration, we believe that the model performing the generation might already have an internal representation related to the confidence or “truthfulness” [14] [15], and if the model is given a way and an incentive to communicate it to the user - it can be done without additional overhead.

One way to achieve this would be to augment the model with an auxiliary output reserved for the purpose of returning the confidence in the generated text. However it might make the comparison with the baseline less fair due to the increased number of parameters, although in practice the added number of parameters might be insignificant compared to the number of parameters in the baseline model (billions for modern LLMs).

Another approach involves including the degree of confidence in the generated text. This has the downside of additional complexity of parsing the generated text into the body of the answer and the degree of confidence and ensuring that all the generations follow the format that makes such parsing possible. However it allows to avoid changing the structure of the model, making it possible to minimise the number of adjustments to integrate such model into existing frameworks. This is the approach we choose to explore in this work.

Two key contributions of this work are:

use of an oracle that can evaluate the truthfulness of the predicted answer instead of reward model trained on human preference data