Evan Hubinger (he/him/his) (evanjhub@gmail.com)
Head of Alignment Stress-Testing at Anthropic. My posts and comments are my own and do not represent Anthropic’s positions, policies, strategies, or opinions.
Previously: MIRI, OpenAI
See: “Why I’m joining Anthropic”
Selected work:
Some thoughts on the recent “Lessons from a Chimp: AI ‘Scheming’ and the Quest for Ape Language” paper
First: I’m glad the authors wrote this paper! I think it’s great to see more careful, good-faith criticism of model organisms of misalignment work. Most of the work discussed in the paper was not research I was involved in, though a bunch was.[1] I also think I had enough of a role to play in kickstarting the ecosystem that the paper is critiquing that if this general sort of research is bad, I should probably be held accountable for that to at least some extent. That being said, the main claim the authors are making is not that such work is bad per se, but rather that it is preliminary/half-baked/lacking rigor and should be improved in future work, and especially in terms of improving the rigor for future work I certainly agree. Regardless, I think it’s worth me trying to respond and give some takes—though again my main reaction is just that I’m glad they wrote this!
I am broadly sympathetic to this criticism. I agree with the need for good control conditions and repeatable quantifiable metrics rather than anecdotal evidence if you want to make claims about the likelihood of a particular failure mode. I do think anecdotal evidence is okay as an existence proof—and sometimes I think existence proofs can be quite useful—but I agree that it is important to not over-interpret an existence proof as saying more than it does.
I am also very sympathetic to concerns about journalists and others misinterpreting results. That being said, I don’t think it’s necessarily fair to pin the blame for that on the researchers. Sometimes, I think it’s fair to put this on the researchers, but in many cases exactly how the public is going to react to a thing can be really hard to interpret. For the CAPTCHA example in the GPT-4 system card, though I don’t know the details, I would guess that the authors of that section of the system card had no intention or expectation of that blowing up (and in so doing getting misinterpreted) in the way that it did.
I am quite sympathetic to this criticism as well. I basically agree with what is said here, though there is one point I want to make, which is that I do think there is at least one case where something like “perfect behavior” is a reasonable null hypothesis, and that’s for red-teaming work. To see what I mean, consider three different reasons that you might do this sort of model organisms work:
Understanding propensity: you want to know how likely a model is to attempt some misaligned action in practice.
An example of work in this bucket would be “Alignment Faking in Large Language Models”.
Importance of controls: In this case, clear null hypotheses and control conditions are absolutely critical, as it is trivially easy to get models to have any propensity if you instruct for that propensity, so just showing a propensity in isolation of a good control is ~no information.
Understanding capability: you want to know whether a model would be capable of taking some sophisticated misaligned action if it tried to.
An example of work in this bucket would be “Frontier Models are Capable of In-context Scheming”.
Importance of controls: This one is a bit mixed. It certainly can be the case that an isolated or even anecdotal demonstration of a capability can give you real evidence. What matters is the extent to which it’s possible that the behavior could just be due to stochasticity, as the authors mention. If a behavior is clearly sufficiently sophisticated that the only way it could occur is if the model had sophisticated capabilities, then a simple demonstration can suffice. But it can be easy to fool yourself and think that a behavior is more sophisticated than it is if you don’t look carefully, as the authors note. So overall, I think good controls are important here, though not necessarily critical.
Red-teaming: you want to find the failures in some particular blue-team process, commonly HHH training.
An example of work in this bucket would be “Agentic Misalignment: How LLMs Could be Insider Threats”.
Importance of controls: This is the main case where I think controls are less relevant, since you’re locked in an adversarial process where you’re just trying to falsify some universally quantified claim like “the model always acts in line with the spec”. Of course, models don’t always act in line with their specs, but the goal I am usually trying to hit with this work is something like “the model card always includes a discussion of all the possible spec violations we know about that a user might have to deal with” (though in practice you might exclude some that are obvious and well-known like jailbreaks).
I’m pretty sympathetic to this point too. I agree that sometimes it can be fuzzy e.g. what counts as “alignment faking” (I agree in particular that we could have been better about that in our paper). I do somewhat take issue with how they characterize our Agentic Misalignment work, however:
This is hardly something we “admit” in the appendix! The fact that this was found as part of stress-testing/red-teaming and doesn’t represent normal usage is extremely front-and-center in all of our discussion of our Agentic Misalignment work. I won’t re-tread this ground too much, though, as I’ve already talked at length about why we did that work, even given the unrealism of the setup.
I am very unsympathetic to this critique. If we were to try to avoid mentalistic/intentional language in all of our model organisms papers, I think the end result would just be that the papers were vastly more difficult for lay audiences to understand, for essentially no actual scientific benefit. Even if mentalistic language was just a rhetorical tool to aid understanding, I would still think it was appropriate, but I actually think it just really is the most appropriate language to use. I was particularly frustrated by this section:
I agree! But that’s an argument for why mentalistic language is appropriate, not an argument for why it’s inappropriate! If models can be well-understood as playing personae where those personae have beliefs/desires, then that makes it appropriate to use mentalistic language to describe the beliefs/desires of a persona. Certainly that does not imply that the underlying simulator/predictor has those beliefs/desires, but it’s the safety properties of the default assistant persona are also extremely relevant, because it’s that persona that is what you mostly get when you interact with these models.
Finally, I don’t know much about ape language research, but it’s worth flagging that it seems like the early ape language research the authors critique might have actually been more on the right track than the authors give it credit for, suggesting that even if valid, the comparison between the two fields might not be as bleak as the authors make it out to be.
The papers they cite that I was involved in were Alignment Faking, Agentic Misalignment, Sycophancy to Subterfuge, and Uncovering Deceptive Tendencies in Language Models.