Procedurally evaluating factual accuracy: a request for research

I am grateful to Daniel Kokotajlo, Beth Barnes and John Schulman for feedback on this post.

The purpose of this post is to request research on the design of precise procedures for evaluating how factually accurate pieces of text are. This stands out to me as an area that is potentially valuable for reducing risks from advanced AI, while not requiring detailed knowledge of ML.

Problem statement

Suppose that you are given:

A real-world setting in which text appears, such as “Google search snippets”, or “OpenAI API outputs”, or “arXiv abstracts”
A collection of contexts in that setting, such as a “the search queries made in 2021”, or “the questions from the ELI5 dataset”, or “all existing papers on the arXiv”
A text distribution for each context, such as “actual current Google search snippets”, or “responses produced by an ML model”, or “the abstract a certain person would write after reading the paper”

The problem is to define a procedure that takes in a context and a piece of text from the given distribution, and outputs a numeric score for factual accuracy.

The procedure should have the following properties:

It is feasible for humans to carry out.
It is as unambiguous as possible: different people following the same procedure should end up with the same score, except perhaps for a minimal number of “irreducible” judgment calls.
It can be optimized to produce good outcomes: to the extent that factual accuracy is important in the given setting, selecting text with a high score is beneficial, compared to alternative procedures.
- Included in this is the effect that knowledge of the procedure could have, e.g. people will have more trust in the text if they understand and endorse the procedure.
- In practice, whether good outcomes are produced will depend on the method and extent of optimization (e.g., Goodhart’s law will eventually kick in), and on criteria other than factual accuracy. This makes the problem statement a little fuzzy.

Note that the problem statement doesn’t mention ML models (except as examples). The problem is of course motivated by ML models, but I think it can be studied relatively independently of ML, at least initially.

Motivation

The main motivation for this request is that it is a problem that arises very naturally when attempting to train truthful LMs. The most straightforward way to optimize the factual accuracy of a language model is to have humans evaluate the factual accuracy of model outputs, and to then optimize those evaluations using techniques like reinforcement learning. The procedure needs to be unambiguous because label noise hurts both ML training and labeler monitoring (not to mention other benefits). Subject to this constraint, the main criterion for the procedure should be that it produces good outcomes in the given real-world setting.

My main reasons for thinking that this research could be important for reducing risks from advanced AI are:

Direct benefits. The research could result in procedures that are directly used to train more beneficial AI. The sorts of risks this could help mitigate are discussed in more depth in Truthful AI and Risks from AI persuasion.
Preparation for harder versions of the problem. I expect advanced AI systems to pose analogous versions of the same problem, but to require more advanced solutions, in which AI systems assist with human evaluation, for example. Solutions to the simpler problem can serve both as a starting point for more advanced solutions and as a baseline against which they can be compared, and can help build relevant capacity.
Supporting ML work on truthful LMs. Currently, the responsibility for this research falls largely on ML researchers trying to improve factual accuracy, who are forced to choose some procedure for evaluating factual accuracy. Yet they may lack the expertise required to do this well.

I do think that this research is a gamble, in the sense that the details of evaluating factual accuracy may not end up mattering very much, perhaps because a wide range of procedures are good enough to avoid the very worst outcomes, and the important bottlenecks are elsewhere. That being said, I think we’ll be in a better position to evaluate those arguments once we’ve given the research more of a try.

In some sense, the research can be thought of as a very special case of trying to specify more precisely what humans value. However, compared to more general research on that question, I think the specific research has a number of advantages:

It’s much more closely tied to ML practice, making it more likely to be used.
It’s focused on truthfulness, which seems especially productive to focus on as an alignment goal (see: Why focus on negligent falsehoods?).
It’s very specific, and therefore less likely to depend as much on highly politicized questions.

Research directions

Here are two important examples of existing research that begin to tackle this problem from different ends of a spectrum:

Truthful AI (theoretical). An important concept introduced by this research is that of negligent falsehoods: statements that are unacceptably likely to be false, and where it should have been feasible for an AI system to understand this. In Section 2.2, a high-level procedure for evaluating whether a statement is a negligent falsehood is proposed. However, the procedure would need to be made much more precise in order to be used in practice.
WebGPT (practical). This research essentially proposes a solution to the problem for the specific setting of an AI system that browses the web to answer questions, taking contexts from the ELI5 dataset. The full procedure is somewhat involved, and is described in great detail in this Google doc. It involves cross-referencing the answer with sources found during browsing. However, the research does not seek to provide much justification for this procedure.

I think that it could be productive to push harder from either end of this spectrum. On the theoretical side:

What are the main principles around which procedures should be designed?
How important is it for the procedure to be transparent?
How should the procedure depend on the particular real-world setting? For example, in a particular academic discipline, how should the procedure relate to the epistemic standards typically employed by that discipline?
What duties of care do different commercial systems have to avoid providing users with inaccurate information, and what would ideal norms around this look like?

On the practical side:

How could the WebGPT procedure be improved, considering more carefully the likely impact of different answers on users? Empirical data on how users are likely to make use of the answers to questions seems important here.
Would using a simpler procedure be preferable, accounting for improved transparency, and what would such a procedure look like?
How should the procedure be adapted to different contexts, such as more controversial questions?
How should evaluation be performed in a setting in which the model does not provide citations?

An instructive exercise is to browse some of WebGPT’s answers, and to consider how one might evaluate their factual accuracy without the given sources (but potentially collecting new sources as part of the procedure). Even for factual topics, there can often be vague, subjective or holistic claims, which can be very hard to evaluate without either relevant expertise or a direct confirmation/refutation from a reliable source.

Another very relevant line of existing research is the exploration of debate using human judges and debaters, with a view to having AI systems play the roles of the debaters. Current AI systems are not yet capable enough for these schemes to be practical, but it is good to be thinking ahead, and there could also be shorter-term takeaways.

There is probably a lot more research in philosophy and the social sciences that is also relevant. Wikipedia’s verifiability policy seems closely related, and is well-studied. There is even an entire field of applied epistemology. However, most of this research has not yet been made accessible to ML researchers working on factual accuracy. There could therefore be some low-hanging fruit in digesting some of this work appropriately.

Who to align to

A closely related question that often comes up is “who to align to”: specifically, if there is some ambiguity in the procedure, who should be asked to make those judgment calls? For example, people of different political persuasions will often evaluate politically-sensitive statements differently.

I expect this question to eventually become an important part of the problem, but I’d be inclined to begin by focusing on procedure design, for a few reasons:

Per unit effort, I expect improving procedures to have a bigger short-term impact on the factual accuracy of AI systems, since most statements are not politically sensitive.
The question of who to align to is itself politically sensitive, which imposes costs on research.
Having a better understanding of the space of possible procedures will make it clearer what kind of “irreducible” judgment calls need to be made.

That being said, I’d still be excited to see work on this part of the problem, since it’s a thorny issue that seems closely tied to risks from AI persuasion.

It’s tempting to be pessimistic that we’ll be able to design procedures that people of different political persuasions can have trust in, because of the current state of political discourse. But I think it might feasible to design procedures that are much more broadly trusted than current institutions, for a couple of reasons:

Wikipedia, while not universally trusted, provides a proof-of-concept that clear rules can significantly enhance trust. Meanwhile, Wikipedia is constrained to procedures that a decentralized group of people can be incentivized to carry out, whereas the space of procedures that humans can carry out in general (perhaps with AI assistance) is not: it can involve work that is boring or takes a long time, and doesn’t have to cope with people trying to manipulate the procedure (vandalism etc.).
There’s a very wide search space of possible procedures. For example, part of the procedure could involve “masking out” certain political markers in order to make inferences objectively, or the procedure could be parameterized by “epistemic parameters” such as the weight given to different kinds of evidence, or a debate-like setup could be used for certain questions, etc.

There are of course a number of obstacles in getting from procedures that are broadly trusted to working AI systems that are trusted to follow those procedures, but I think they are surmountable with enough effort. And even if it turns out to be impossible to design procedures that are universally trusted, there could still be significant benefits from improving procedures on the margin.

Working on this problem

It’s hard to convey exactly what kind of research I’d find most compelling in this area, and I’d be happy to chat to people who are considering working on this topic. Feel free to reach out to me at jhilton@openai.com.