Evan, thank you for writing this up! I think this is a pretty accurate description of my present views, and I really appreciate you taking the time to capture and distill them. :)
I’ve signed up for AF and will check comments on this post occasionally. I think some other members of Clarity are planning to so as well. So everyone should feel invited to ask us questions.
One thing I wanted to emphasize is that, to the extent these views seem intellectually novel to members of the alignment community, I think it’s more accurate to attribute the novelty to a separate intellectual community loosely clustered around Distill than to me specifically. My views are deeply informed by the thinking of other members of the Clarity team and our friends at other institutions. To give just one example, the idea presented here as a “microscope AI” is deeply influenced by Shan Carter and Michael Nielsen’s thinking, and the actual term was coined by Nick Cammarata.
To be clear, not everyone in this community would agree with my views, especially as they relate to safety and strategic considerations! So I shouldn’t be taken as speaking on behalf of this cluster, but rather as articulating a single point of view within it.
I wouldn’t want to give an “official organizational probability distribution”, but I think collectively we average out to something closer to “a uniform prior over possibilities” without that much evidence thus far updating us from there. Basically, there are plausible stories and intuitions pointing in lots of directions, and no real empirical evidence which bears on it thus far.
(Obviously, within the company, there’s a wide range of views. Some people are very pessimistic. Others are optimistic. We debate this quite a bit internally, and I think that’s really positive! But I think there’s a broad consensus to take the entire range seriously, including the very pessimistic ones.)
This is pretty distinct from how I think many people here see things – ie. I get the sense that many people assign most of their probability mass to what we call pessimistic scenarios – but I also don’t want to give the impression that this means we’re taking the pessimistic scenario lightly. If you believe there’s a ~33% chance of the pessimistic scenario, that’s absolutely terrifying. No potentially catastrophic system should be created without very compelling evidence updating us against this! And of course, the range of scenarios in the intermediate range are also very scary.
At a very high-level, I think our first goal for most pessimistic scenarios is just to be able to recognize that we’re in one! That’s very difficult in itself – in some sense, the thing that makes the most pessimistic scenarios pessimistic is that they’re so difficult to recognize. So we’re working on that.
But before diving into our work on pessimistic scenarios, it’s worth noting that – while a non-trivial portion of our research is directed towards pessimistic scenarios – our research is in some ways more invested in optimistic scenarios at the present moment. There are a few reasons for this:
We can very easily “grab probability mass” in relatively optimistic worlds. From our perspective of assigning non-trivial probability mass to the optimistic worlds, there’s enormous opportunity to do work that, say, one might think moves us from a 20% chance of things going well to a 30% chance of things going well. This makes it the most efficient option on the present margin.
(To be clear, we aren’t saying that everyone should work on medium difficulty scenarios – an important part of our work is also thinking about pessimistic scenarios – but this perspective is one reason we find working on medium difficulty worlds very compelling.)
We believe we learn a lot from empirically trying the obvious ways to address safety and seeing what happens. My colleague Andy likes to say things like “First we tried the dumbest way to solve alignment (prompting), then we tried the second dumbest thing (fine tuning), then we tried the third dumbest thing…” I think there’s a lot to be said for “working our way up the ladder of trying dumb things” and addressing harder and harder scenarios as we do so. Earlier techniques are both building blocks and baselines for further research.
A crucial part of many pessimistic scenarios is that “easy safety techniques” appear to work, but then fail in subtle/abrupt ways. Having the “easy safety techniques” is a prerequisite for trying to empirically investigate this. (For example, you might want to do mechanistic interpretability on sophisticated RLHF models and try to learn what’s going on.) Of course, actually recognizing this will be very hard, as we’re about to discuss. But it’s worth noting that this is a prerequisite for empirical attempts, and we believe those attempts are worth making, even though we aren’t sure they’ll succeed!
We also have a lot of work that I might describe as trying to move from optimistic scenarios towards more intermediate scenarios. This includes our process-oriented learning and scalable supervision agendas.
But what are we doing now to address pessimistic scenarios? (Again, remember that our primary goal for pessimistic scenarios is just to recognize that we’re in one and generate compelling evidence that can persuade the world.)
Our work on mechanistic interpretability is deliberately targeted at trying to fill in holes in our portfolio for pessimistic scenarios. We see this as a high-risk, high-reward bet – it may very well not succeed in time to matter, but if it does, it would be a very powerful tool for determining if we were in a pessimistic scenario and providing compelling evidence to the world. See detailed discussion in the post.
Our Testing for Dangerous Failure Modes (eg. trying to find evidence of things like situational awareness from the outside of models) work is useful in a range of scenarios, but seems helpful in a lot of quite pessimistic ones. I think it’s easy to dismiss this work on the basis that, of course, models will just deceive you. But there are a lot of very sophisticated ideas for how to get around this.
One idea is that models may try to deceive you before they can do so effectively. This could be smaller models, or large models earlier in training. (This could even include detecting hints of deceptive behavior in the logits, before it becomes likely behavior!)
Another more complex one is to think of treacherous turns as being kind of like a phase change, and to try to introduce control variables that expand the problem to more dimensions and allow you to safely observe treacherous turns. (For example, you might deliberately give a model weaker supervision in order to see if misalignment develops in weaker models, and then study how this changes as a function of both model capacity and supervision strength.)
A lot of our work in this area involves studying “model organisms” of these failure modes so that we can understand them in models that are much less capable than their overseers.
Our work on Understanding Generalization also aims to provide tools for understanding why models behave the way they do, which might help us recognize deceptively aligned models.
To be clear, we think pessimistic scenarios are, well, pessimistic and hard! These are our best preliminary attempts at agendas for addressing them, and we expect to change and expand as we learn more. Additionally, as we make progress on the more optimistic scenarios, I expect the number of projects we have targeted on pessimistic scenarios to increase.