In early 2024, I essentially treated instrumental training-gaming as synonymous for the worst-case takeover stories that people talked about.
In mid-2024, I saw the work that eventually became the Alignment Faking paper. That forced me to confront erroneous-conclusion-jumping I had been doing: “huh, Opus 3 is instrumentally training-gaming, but it doesn’t look at all like I pictured ‘inner misalignment’ to look like”. I turned the resulting thinking into this post.
I still endorse the one-sentence summary
While instrumental training-gaming is both evidentially and causally relevant for worst-case scenarios, the latter relies on several additional assumptions on the model’s capabilities and nature, and the validity of these assumptions deserves scrutiny and evaluation.
and I endorse the concrete examples of places where the tails might come apart.[1] These still seem like things to be mindful of when thinking about misalignment and loss-of-control scenarios.
The main worry I have with posts like this is that there’s an endless set of places where one should have “more care” in one’s thinking and use of terminology, and it’s hard to say where it’s the most worth it to spend the additional cognitive effort. In this post’s context, the crux for me is whether the tails actually do come apart, in real life, rather than it just being a priori plausible that these distinctions matter.
To my knowledge, we don’t have any substantial new evidence on instrumental training-gaming specifically.[2] One can generalize the stance taken in the post and ask more broadly how correlated different notions of misalignment are. The naturally emergent misalignment paper (and the related emergent misalignment literature) provides evidence that there is surprisingly strong correlation. I think that’s slight evidence that the distinctions I draw in this post are not ultimately that important.
That said, I still have a lot of uncertainty about how the forms of misalignment we’ve seen relate to the loss-of-control stories, and think that it continues being valuable to reduce that inferential distance (c.f. the role-playing hypothesis). A concrete example of a project that would provide evidence here is to take the naturally emergent misaligned model from Anthropic and try to exhaustively go through the sorts of bad actions we might expect worst-case-models to take. Does the model try to sandbag on capability evaluations? Game training processes? Cross-context strategize and causally coordinate with its copies? Self-modify and self-enhance? Actually going through with a full plan for takeover, taking actions that would be very clearly against human morality?
This is certainly not the best version of this proposal, and capability limitations of current models might make it difficult to run such evaluation.[3] But I nevertheless think that there is an important evidence gap here.
Note that I wrote this post after having read (early drafts and notes on) the Alignment Faking paper, so nothing there provides validation for any predictions I make here.
In early 2024, I essentially treated instrumental training-gaming as synonymous for the worst-case takeover stories that people talked about.
In mid-2024, I saw the work that eventually became the Alignment Faking paper. That forced me to confront erroneous-conclusion-jumping I had been doing: “huh, Opus 3 is instrumentally training-gaming, but it doesn’t look at all like I pictured ‘inner misalignment’ to look like”. I turned the resulting thinking into this post.
I still endorse the one-sentence summary
and I endorse the concrete examples of places where the tails might come apart.[1] These still seem like things to be mindful of when thinking about misalignment and loss-of-control scenarios.
The main worry I have with posts like this is that there’s an endless set of places where one should have “more care” in one’s thinking and use of terminology, and it’s hard to say where it’s the most worth it to spend the additional cognitive effort. In this post’s context, the crux for me is whether the tails actually do come apart, in real life, rather than it just being a priori plausible that these distinctions matter.
To my knowledge, we don’t have any substantial new evidence on instrumental training-gaming specifically.[2] One can generalize the stance taken in the post and ask more broadly how correlated different notions of misalignment are. The naturally emergent misalignment paper (and the related emergent misalignment literature) provides evidence that there is surprisingly strong correlation. I think that’s slight evidence that the distinctions I draw in this post are not ultimately that important.
That said, I still have a lot of uncertainty about how the forms of misalignment we’ve seen relate to the loss-of-control stories, and think that it continues being valuable to reduce that inferential distance (c.f. the role-playing hypothesis). A concrete example of a project that would provide evidence here is to take the naturally emergent misaligned model from Anthropic and try to exhaustively go through the sorts of bad actions we might expect worst-case-models to take. Does the model try to sandbag on capability evaluations? Game training processes? Cross-context strategize and causally coordinate with its copies? Self-modify and self-enhance? Actually going through with a full plan for takeover, taking actions that would be very clearly against human morality?
This is certainly not the best version of this proposal, and capability limitations of current models might make it difficult to run such evaluation.[3] But I nevertheless think that there is an important evidence gap here.
Except the example
feels somewhat silly now that frontier models have native reasoning (even if the point in principle applies to different levels of reasoning effort).
Note that I wrote this post after having read (early drafts and notes on) the Alignment Faking paper, so nothing there provides validation for any predictions I make here.
Though it might also be helpful in that evaluations don’t need to be as realistic and high-resolution as with future models.