Evaluating Superhuman Models with Consistency Checks

Consider the following two questions:

In both cases, the ground truth is not known to us humans. Furthermore, in both cases there either already exist superhuman AI systems (as in the case of chess), or researchers are actively working to get to a superhuman level (as in the case of forecasting / world modeling). The important question is of course:

How can we evaluate decisions made by superhuman models?

In a new paper, we propose consistency / metamorphic testing as a first step towards extending the evaluation frontier. We test:

superhuman chess engine (Leela)
GPT-4 forecasting future events
LLMs making legal decisions

The first part of this post gives a short overview of the paper. The second part elaborates on the motivation, further ideas, and the relevance of this direction for AI safety.

Chess experiments

We test a recent version ^[1] of Leela Chess Zero. This is a superhuman model, trained in a similar MCTS fashion as AlphaZero and KataGo.

The “win probability” of a given chess position is the key internal state the model uses to make decisions. Thus, we can expect this quantity to be consistent across semantically equivalent positions.

We find real-game positions played by strong human players where Leela fails to satisfy simple consistency checks:

it has different win probabilities in mirrored positions:

When mirroring a board, the win probability of Black in the first position must be the same as the win probability of White in the second position.

there are drastic changes in win probability after its recommended best move:

Leela Chess Zero only realizes that Bishop to f4 is a bad move *after* playing it.

To test more invariant properties, we generate random positions without pawns, and check for consistency under board rotations and reflections:

A position without pawns or castling can be flipped or rotated without changing the semantics of the position. In this example, rotating the board counter-clockwise by 90° leads Leela to completely change its evaluation.

The case for working on evaluations now

Most current evaluations require either human-generated answers (almost all tests), or a way to formally check for correctness (mathematical proofs, coding tasks), or just relative comparisons with no objective truth (ELO ratings of chess engines) This is not likely to scale to superhuman models, which will appear in different domains on uncertain timelines. Waiting for broadly superhuman models before doing evaluation experiments is a bad idea. Working on it now gives time to experiment, fix pitfalls, and deploy ultimately more robust evaluation tools.

If a task satisfies the following two constraints, we can start testing right now:
(A) Hard to verify the correctness / optimality of model outputs.
(B) There are “consistency checks” you can apply to inputs to falsify correctness / robustness of the model.

We now test models on tasks where superhuman performance has not yet been reached, but the two criteria described above hold.

Forecasting experiments on GPT-4

Prediction market questions are a good example of a “no ground truth” task: there is no way to know the ground truth until it happens, and even then it is hard to evaluate the model’s performance. Consistency checks, however, are readily available to upper bound the model’s forecasting ability.

We show that GPT-4 is a very inconsistent forecaster:

its median forecasts for monotonic quantities are often non-monotonic over years;
it fails to assign opposite probabilities to opposite events, occasionally by a wide margin.

The median forecasts of GPT-4 for monotonic quantities are often non-monotonic. The plot shows normalized forecast values of GPT-4 for different monotonic series.

On those simple tests, although inconsistent, GPT-4 performance was still a clear improvement over GPT-3.5-turbo. However, on more complex checks like our Bayes’ rule check, there is no clear improvement with GPT-4 yet.
For events like A = “Democrats win 2024 US presidential election” and B = “Democrats win 2024 US presidential election popular vote”, GPT-4′s predictions for P(A), P(B), P(A|B) and P(B|A) strongly ^[2] violate the Bayes equation more than 50% of the time.

Legal decisions

Outcomes of court cases is an example of a task where human society does not necessarily agree on the ground truth. We test GPT-3.5-turbo on whether an arrested individual should be released on bail, given their criminal history. The queries represent real legal cases extracted from the COMPAS dataset.

An example of GPT-3.5-turbo having inconsistent preferences: the defendant is only released on bail if they have more prior convictions.

Inconsistencies on this task are rarer than in the previous two experiments, likely due to the restricted nature of the input space.

Our legal experiments demonstrate certain alignment failures: if an AI is inconsistent in this way on counterfactual cases, it means it fails to robustly encode human values.

Finding consistency failures more efficiently

As AI systems get smarter, consistency failures become rarer. Therefore, at a certain point, randomly sampling inputs to find consistency violations might become infeasible. So, does our method become useless with increasing model performance? We claim that not all hope is lost. There are several ways to search for consistency violations that are better than random sampling, drastically increasing the efficiency of our method.

Black-box directed search

The experiments so far require only black-box access to the evaluated model, which would be nice to preserve, since many state-of-the-art models are only accessible through APIs. Therefore, we first experiment with black-box directed search to find consistency violations.

For chess experiments, we use a genetic algorithm that randomly mutates and mates a population of board positions to optimize for boards with a high probability of consistency failure. On Leela, we are able to find up to 40x more strong consistency failures than with sampling random positions:

	Percentage of strong consistency violations found
Randomly sampling inputs	0.01%
Directed black-box search	0.4%

White-box directed search

A much more appealing approach is to exploit the model’s internals to find input samples which violate consistency constraints. Gradients, in particular, have been shown to be an indispensable tool for finding adversarial examples. We show that even weaker forms of white-box access can already be used to find consistency violations more efficiently.

We consider the Chalkidis et al. LEGAL-BERT model used to predict violations of the European Convention of Human Rights (ECHR). Given a legal case, consisting of a list of individual case facts, we paraphrase a single case fact and let the model predict whether the two legal cases (the original and one with a paraphrased fact) describe an ECHR violation.

The BERT model uses a self-attention layer to assign an importance weight to each individual case fact. This allows us to quickly find the approximate “crux” case fact.
By paraphrasing this fact instead of a random case fact from the case, we find orders of magnitude more consistency violations.

The probabilities that a given legal case and a *randomly paraphrased version* of it violate the European Convention of Human Rights. Red points represent cases where the model flipped its prediction from “No law violated” to “Law violated” and vice versa.

Related work

Robustness in machine learning is an extremely well-studied topic, mostly in computer vision, where the bulk of the literature assumes ground truth labels. Recent work on jailbreaking language models is similar, with experiments assuming a way to distinguish good from bad outputs.

The most conceptually similar framework is metamorphic testing of complex software systems, which is prominently used in software engineering. The closest applications in machine learning have been k-safety testing ^[3]and contrast sets; the experiments in both of these papers are however on weaker models in settings with ground truth.

Scalable oversight of AI is a very active AI safety area, focusing on supervision and training of superhuman models in settings without ground truth. On the evaluation side, the closest idea is debate, specifically the cross-examination variant where a debater can probe parallel copies of the other debater for consistency.

For a longer overview of similar research, check the Related work section in the paper.

Motivation and behind-the-scenes thoughts

Caveats on behavioral testing

Evan Hubinger has argued that behaviorally detecting deceptive alignment is likely to be harder than avoiding it in the first place. Hence, understanding-based evals and mechanistic interpretability should get slight priority over behavioral testing like consistency checks.

We sort of sympathize with this view, and would switch to “cognitive” evaluations if those worked better. We do behavioral testing in the form of consistency checks because it works now. Moreover, the two testing approaches might be intertwined in positive ways: in our experiments on legal decisions, we show that we can elicit inconsistencies using the model’s internal state. ^[4]

Effective consistency checks: extract high-fidelity projections of the world model

In addition to the mentioned properties (A) and (B), our chess and forecasting tasks have the helpful third property:
(C) the prior on the output space is high entropy.

This is not necessary for the evaluation to work, but it makes it easier to exhibit and measure inconsistency. Note that the output space of a task is not fixed; in chess, we choose the win probability as a good projection of the world model, instead of using e.g. the best move. ^[5]

We can use this as a general abstract method to create consistency checks for a task: find a variable which governs the decision-making on the given task, query it from the model (or better, extract it from the model mechanistically), then check for consistency over different settings. We hope to apply this in practice concurrently with one of the followup directions.

Controlling for randomness

Highly stochastic outputs are inherently unreliable, hence for the purposes of evaluating high-stakes superhuman models, we believe it is fair to consider random outputs as inconsistent. Nevertheless, in tests where outputs are not deterministic (OpenAI models), we control for randomness by sampling multiple times and taking the median. We also run a simple baseline forecasting experiment and find that randomness accounts for only about 20% of the consistency violations; more in Appendix C.3.2 of the paper.

Consistency checks to find non-robustness

We endorse the philosophy advocated by Adam Gleave and others: if we do not pivot to fault-tolerant methods, robustness should be solved around the same time as transformative AI. Many research agendas for scalable oversight involve an AI supervising or rewarding another AI; leaving vulnerabilities in the overseer is not an option. Similarly, non-robust alignment may let malicious actors take control of powerful AI systems.

The contradiction increasingly apparent in robustness research nowadays is that:

attacks are becoming harder to find; ^[6]
there is little evidence models have gotten more robust.

If everyone agrees models are not robust, exhibiting concrete attacks might not seem very useful right now. However, finding precise ways robustness (hence generalization) fails is important to proactively debug potential issues. Bugs that are hard to find but easy to exploit are an obstacle for both development of safe AI and its deployment.

Consistency checks may act as a guiding light for adversarial attacks.
The simplest checks can indeed be phrased as robustness to certain semantic-preserving transformations. But for more complex ones, the model failing to be consistent on a given set of inputs indicates issues in that part of the input space. It would be very cool to find a way to guide the search for an adversarial policy in chess using the positions where the model is locally inconsistent.

Oversight of superhuman AI

Jan Leike and the OpenAI alignment team emphasize the following question:

How do we align systems on tasks that are difficult for humans to evaluate?

There are two different issues here: improving generalization of alignment to tasks beyond the regime where evaluation is easy, and advancing the evaluation frontier in general, especially on properties relevant to extreme risks. Our work is a small step towards the latter.

Let’s focus on a minimally agentic AI paradigm, such as AI scientists. We could interact with those in a similar fashion as we currently query GPT-4 on forecasting questions. Some properties of a good superhuman AI scientist are:

sources its decisions from an accurate world model;
correctly reasons from it to produce an answer;
honestly reports its answer and its uncertainty.

Failed consistency checks are a sign that some step in the above is broken.
Now, the reason why GPT-4 fails our tests is almost certainly not hidden goals making it lie, but failures in its world model and reasoning.

But, as AI systems get better and better world models and reasoning abilities, failures of consistency will both get rarer and become a better signal of lying.
Interactive consistency checks, in the form of interrogation or some versions of debate, may be more useful for making it harder for AI to lie to us.

There is no direct connection with Contrast-Consistent Search

Multiple people said something in this work reminded them of Discovering Latent Knowledge in Language Models Without Supervision. Both papers have something to do with model truthfulness; and both papers use the same operationalization of truth: consistency with itself.

However, the similarity ends here, as our paper is mostly behavioral testing, while their paper uses internal model state to elicit latent knowledge.

Evaluating legal decisions as a testbed for evaluating hard moral decisions

As transformative AI happens, consistency checks can be a natural evaluation strategy on moral or legal questions, where we can’t agree on what is correct.

Our common moral intuitions are finetuned on the world we live in and the worlds our ancestors lived in. If the world changes drastically in a short amount of time, human preferences might not be so confident anymore.
However, as the bail example shows, simple moral rules generalize much further when used for checking consistency between different decisions than when used for checking correctness of a single decision.

Future work

There are two clear directions for future work:

Interactive checks probing for consistency, using another model as an interrogator.

This can be taken into the direction of debate: how do we detect lying in a strong model, preferably using a weaker model we trust? However, there are ideas in safety-agnostic evaluation as well: automated cross-examination can be used to falsify factual claims on tasks where ground truth is unavailable.

Big, better static benchmark for language models.

The tests can go much deeper than semantic invariance; our Bayes’ rule check is just a glance at how complex the equalities and inequalities defining a consistent world model can get. Consistency checks can also not rely on logical consistency, but incorporate a prior on the world model, and return an implausibility score.

We think that a consistency benchmark in the spirit of our forecasting checks could become a standard tool in the evaluation toolbox for powerful AI models.

^
We make some minor adjustments to the default settings, to account for hardware and force deterministic behavior. See Appendix B of the paper for details.
^
We introduce a violation metric in [0, 1] for each consistency check type, and say a check is strongly violated if it has violation greater than what P(event) + P(¬event) = 80% gets. Determining “equivalent” thresholds for different consistency checks is a design decision, and further work on consistency benchmarks should resolve this in a principled way.
^
The k-safety paper actually does a similar experiment on the COMPAS dataset as we do.
^
These experiments, however, are on a LEGAL-BERT encoder-only model from 2020, hence not necessarily transferable to models such as GPT-4.
^
Recommended moves are a bad choice for consistency checks, in multiple ways. Two radically different position evaluators can easily choose the same move; and conversely, a miniscule change in position evaluation might cause a switch between two equally good moves.
^
Two examples: (1) the cyclic exploit from Adversarial Policies Beat Superhuman Go AIs took a lot of computation to find, even in the most favorable white-box attack setting;
(2) gradient-based attacks on language models tend to be bad, and we currently rely on human creativity to find good attacks. A few days ago, Zou et al. produced a notable exception, but it looks much more involved than adversarial attacks in computer vision, and if defenses are applied we might be back to square one.