Towards understanding-based safety evaluations

Thanks to Kate Woolverton, Ethan Perez, Beth Barnes, Holden Karnofsky, and Ansh Radhakrishnan for useful conversations, comments, and feedback.

Recently, I have noticed a lot of momentum within AI safety specifically, the broader AI field, and our society more generally, towards the development of standards and evaluations for advanced AI systems. See, for example, OpenAI’s GPT-4 System Card.

Overall, I think that this is a really positive development. However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC’s autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment.[1]

I often worry about situations where your model is attempting to deceive whatever tests are being run on it, either because it’s itself a deceptively aligned agent or because it’s predicting what it thinks a deceptively aligned AI would do. My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn’t necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible.

However, there’s an obvious alternative here, which is building and focusing our evaluations on our ability to understand our models rather than our ability to evaluate their behavior. Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer’s ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model’s behavior, we’re checking the reasons why we think we understand its behavior sufficiently well to not be concerned that it’ll be dangerous.

It’s worth noting that I think understanding-based evaluations can—and I think should—go hand-in-hand with behavioral evaluations. I think the main way you’d want to make some sort of understanding-based standard happen would be to couple it with a capability-based evaluation, where the understanding requirements become stricter as the model’s capabilities increase. If we could get this right, it could channel a huge amount of effort towards understanding models in a really positive way.

Understanding as a safety standard also has the property that it is something that broader society tends to view as extremely reasonable, which I think makes it a much more achievable ask as a safety standard than many other plausible alternatives. I think ML people are often Stockholm-syndrome’d into accepting that deploying powerful systems without understanding them is normal and reasonable, but that is very far from the norm in any other industry. Ezra Klein in the NYT and John Oliver on his show have recently emphasized this basic point that if we are deploying powerful AI systems, we should be able to understand them.

One of the main problems here, however, is purely technical: it’s very unclear what it would look like to be able to prove that you understand your model, which is obviously a major blocker for any attempt to build some sort of an evaluation around understanding. What we need, therefore, is some way of formalizing what it would mean to demonstrate that you understand some model. Some desiderata for what something like this would have to look like:

  • We want something relatively method-agnostic: we don’t want to bake in any particular way of producing model understanding, both because the state-of-the-art in techniques for understanding models might change substantially over time, and because not baking any particular approach in helps with getting people to accept an understanding-based standard. I want to be very explicit here: I’m really not asking for a mechanistic-interpretability-based standard (though that should certainly be one way of producing understanding)—any of the possible training rationales that I talk about here should count.

  • We need a standard that demonstrates a level of understanding that would actually be sufficient to catch dangerous failure modes. This is especially hard because it’s not clear we have any current positive examples of understanding a model sufficiently well such that this is true. Ideally, we’d want something that scales with model capabilities, where the more capable the model is according to behavioral capability evaluations, the more understanding is required to be able to demonstrate its safety.

Overall, I think working on producing a way of evaluating understanding that satisfies these desiderata seems like an extremely critical open problem to me. If we could channel the asks of society and the efforts of big AI labs into understanding models in a rigorous way, we could shape a lot more safety research than we have the ability to do ourselves, and point it directly at what I see as the core problem of us not understanding the models that we are building.

I think it’s also worth pointing out that there are some existing techniques that currently seem insufficient to me here, but could potentially be used as a basis for something like this:

  • Causal scrubbing: My main problem with causal scrubbing as a solution here is that it only guarantees the sufficiency, but not the necessity, of your explanation. As a result, my understanding is that a causal-scrubbing-based evaluation would admit a trivial explanation that simply asserts that the entire model is relevant for every behavior. Additionally, causal scrubbing also has the problem that it’s not very method-agnostic: causal scrubbing is pretty adamant about the exact form of mechanistic understanding that’s required for it.

  • Auditing games: Fundamentally, auditing games are a technique for evaluating interpretability tools, not a technique for evaluating the extent to which you understand some particular model, since they measure the extent to which your tools are able to distinguish models and understand their differences across some large set of different models. That being said, they could certainly be a part of an understanding-based evaluation here, at least as a way to demonstrate the robustness of some particular type of understanding-producing tool.

  • Prediction-based evaluation: Another option could be to gauge understanding based on how well you can predict your model’s generalization behavior in advance. In some sense, this sort of evaluation is nice because the things we eventually care about—e.g. will your model do a treacherous turn—are predictions about the model’s generalization behavior. Unfortunately, it’s very unclear why ability to predict generalization behavior on other tasks would transfer to being able to predict generalization behavior in the cases that we care about—and we can’t test the case that we care about directly due to RSA-2048-style problems. In general, it’s just not clear to me that prediction on any particular generalization task that we could test in advance would actually require any of the sort of relevant understanding. For example, if you wanted to generally predict model behavior right now, you’d probably just want to get really good at understanding webtext, practice the next token prediction game, etc. Or if you really give me freedom to do whatever I want to predict some model’s generalization behavior, I could just train another similar model and see what it does, which obviously isn’t actually producing any additional understanding.

Finally, I think it’s also worth noting here that I don’t want to replace behavioral safety standards entirely. I definitely think that there is a place for behavioral red-teaming as a way to catch many different sorts of model failures, including potentially some situations where you have a deceptively aligned model. Such evaluations could be especially useful early on as a way to find the first example of a deceptively aligned model, which could then be studied extensively. My point is only that such evaluations are insufficient on their own and need to be coupled with some sort of understanding-based standard if they are to actually be sufficient to guarantee safety into regimes where deceptive models could be highly capable at hiding their deception.

Edit: See these follow-up comments for some more of my thoughts here.

  1. ^

    That’s because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.