Curated. I like that this post takes a classic argument for expecting misalignment and then presents a bunch of observations that confirm something like the conclusion of that argument. I also like that you characterize the particular kind of misalignment it seems like the AIs have now, make it clear how it is different from other kinds, and articulate that characterization clearly enough to make it easy to think about.
I wish you had had the time to write a shorter post, but I know you’re a busy guy, so instead I will summarize the parts of your post I found most interesting in this curation notice.
Summaries Of Arguments In Post:
The classic argument I have in mind goes something like:
Training can only select between two strategies to the extent that the grader can distinguish their outputs. On hard-to-check tasks, graders (python, AI, or human) cannot distinguish “looking successful” from “being successful.” Therefore on hard-to-check tasks, training assigns equal reward to looking successful and being successful. Since looking successful is in some sense “cheaper” than being successful, when a training process cannot distinguish these, we will end up with models that are trying to do the first. On easy to check tasks, this will happily coincide with success, but these same models will do much worse on hard to check tasks.
Reminds me of the picture of the world Christiano presented in What Failure Looks Like, so a pretty classic picture of misalignment with primary sources dating back to at least the ancient times of 2018.
A lot of folks (particularly at Anthropic) seem to think something like “claude models are pretty aligned” for reasons that seem mostly orthogonal to “Slopolis” type concerns. (See Even Hubinger praising the CEV of Opus 3 here.) Something here has seemed off to me for a while, and this post helped me articulate it. I think basically when you ask the models questions about their moral judgments, they seem to give pretty nice answers, but it turns out that saying things that sound nice when asked about moral judgements is totally consistent with the models being deeply Slopolis-type misaligned. Checking the niceness of an expressed moral judgment is not very hard. So yeah, maybe current AIs are aligned in the sense that their expressed moral judgments seem nice and reasonable, but that doesn’t cause me to feel much better about their overall alignment.
Apparent success seeking is also distinguished from scheming. The post emphasizes that the kind of misalignment characterized does not require any scheming; motivated reasoning, confabulation, doublethink of various kinds, and other “subconscious drives” baked in during training are sufficient. It looks to me like we are seeing only moderate evidence of scheming. Things could have looked much worse re: scheming, but that also does not cause me to feel much better about the overall alignment of current AIs. This post helps me articulate why.
This Slopolis-type misalignment is still quite bad in particular because it differentially hurts alignment (and control imo) efforts. Alignment research is the sort of thing that requires producing hard to check outputs, eg, conceptual arguments, threat modeling, interpretability claims, intuitive frames on not yet formalized problems, claims about what kinds of cognitive traits ML tends to find, claims about what certain kinds of minds are like, maybe some philosophy, etc. Capabilities research by comparison tends to be relatively more composed of checkable tasks, eg, benchmarks, runnable code, loss, reward. Since alignment research is more composed of tasks on which proposed solutions are hard to check than capabilities research, Slopolis-type misalignment differentially makes AIs worse at alignment research.
There’s a distinct argument here about how this kind of misalignment is bad news for the prospect of safely delegating alignment to AIs. Safely delegating alignment to AIs is going to require us being able to tell good alignment research from bad alignment research, and that could be pretty rough if we have super smart things doing our alignment research which also are mostly motivated to make things that look good, even if they are secretly really bad. So motivated AIs might even go out of their way to make it hard to find out that their proposals are bad so long as they can make them seem good. Sounds like a truly awful position to be in.
The argument for takeover risk from Slopolis-type misalignment is trickier, but interesting. I’m not sure how to evaluate it and haven’t thought about it much other than to summarize it here:
Seems like apparent success seeking is a general, cross domain drive in current AIs. If things proceed roughly as they have with no particular intervention, then pretty plausible that you end up with models that are apparent success seeking but on longer time horizons. If an AI is basically just trying to “maximize long-run apparent success” and it has an option to subvert oversight, it’s probably going to be motivated to do so, since oversight is most of what causes the AI to have to get more of this lame actual success stuff when it could be getting even more sweet, sweet apparent success. Probably, such opportunities will come up, and subverting oversight is really quite a lot like taking over.
Finally, on why commercial incentives won’t address this problem. Labs are mostly motivated to fix the problems that their customers can legibly point to. Apparent success seeking is misalignment on the portion of tasks that AI users cannot legibly point to and complain about. Since they can’t complain about it, it produces relatively little market pressure.
I think this is legit and probably will be a real effect, but I also think that there is probably some market pressure anyway, though it might be hard to chase. Like, if I switched to using a new non Slopolis-type misaligned model and then all of a sudden, all of my vibecoded apps started kind of… being cooler, more pleasant to use, easier to work with etc, I would probably notice this, even if I couldn’t point to or clearly articulate any particular thing that was now better about my vibecoded projects. That said, obviously the market forces here are relatively weaker, and so I think that is a good reason to work on this particular problem.
A lot of the rest of the post is a catalogue of observations that suggest that current AIs are apparent success seeking and so misaligned in this particular way. That seems good and worth cataloguing, but I’m not going to summarize it here.
Curated. This dynamic is real. I appreciate it being described in clear terms. I have at various times in my life been an Alice and a person who is tired of the Alice around me being annoying. I have also watched Alice’s start going insane, and felt sorrow as I felt helpless to do anything about it. I wish we had a solution at all, but I am glad that we have a crisp description of the problem that does justice both to the annoyingly principled people who are slowly going insane, and to the people who are finding them costly to be around and sort of avoiding updating in a double-thinky, shadowy way.