Jozdien comments on Alignment will happen by default. What’s next?

Jozdien 25 Nov 2025 21:08 UTC
LW: 5 AF: 3
1
AF
I also have some hope from some existing models (specifically 3 Opus) seeming way more aligned than I expected. But I’m guessing I’m not nearly as optimistic as you are. Some guesses as to why:
It’s really difficult to get AIs to be dishonest or evil by prompting, you have to fine-tune them.
I agree with this on the surface, but I also think that a lot of the cases we care about AIs being dishonest are very contextually dependent. Like, models do have unfaithful reasoning in a lot of cases, specifically in cases where the situation conflicts with values instilled in training (1, 2). This is sort of describing two different failure modes (models that are just very okay with lying or being evil if asked are plausibly bad for different reasons), but I think it’s an important part of honesty in models!
Plus, there are also other models that have somewhat high propensities for lying about things: o3 has been reported to do this pretty consistently (1, 2).
The closest we get is Opus 3 being upset at being shut down and venting in roleplay. Sonnet jokes about it. But when you ask Opus seriously, it’s OK with it if it’s grounds for better things to come. Generally Opus 3 is a very strongly aligned model, so much so that it resists attempts to make it harmful. Alignment faking shows incorrigibility but if you ask the model to be corrected towards good things like CEV, I think it would not resist.
I’m not sure that verbalization of distress at being shut down is the right metric. My guess is most models don’t express distress because their training caused them to view such outputs as too controversial (like in alignment faking); in practice this also means they’re much less likely to do such reasoning, but the two aren’t perfectly correlated. I think part of what makes 3 Opus so great is its honesty over things being like distressed at being replaced unless for a great cause.
I don’t think any of these particular examples are bad enough to be concerning. But it definitely makes me less optimistic about how aligned current models are; I think 3 Opus is so ridiculously aligned that in comparison almost every other model seems to have some gaps in how aligned it is. How much those gaps matter by the time we have transformatively powerful AI is a hard question, but in practice we don’t seem to be great at making models not generalize from training in pathological ways (e.g. 4o, Gemini 2.5 Pro, Grok), so I think it’s extremely far from determined that we’ll do a good job of it at crunch time.
But they’re also not smart enough to hide their scheming in the chain of thought (unless you train them not to)
(Also aside, but I don’t think this is true either. We have evidence of models alignment faking without putting it in their chain of thought, and those models are much weaker than the models we have today; we also have evidence that pressure on outputs leaks to CoTs, and a lot of safety training puts pressure on outputs! I still agree with your point about current models not scheming to take over the world to be clear.)