Chris_Leong comments on How likely is deceptive alignment?

Chris_Leong 24 Feb 2026 11:14 UTC
LW: 2 AF: 1
0
AF
I reread the distillation of this (not the full article sadly due to time limitations), so my understanding of Evan’s views might be off. Here’s my retrospective:
High path dependence:
• Evan argues that before it obtains the true training objective, gradient descent likely obtains a proxy goal. This seems intuitive, though there’s debate as to whether we should interpret recent results as indicating alignment seems to be happening by default or whether it remains a hard, unsolved problem.
• Evan argues that the model will gain an understanding of its own situations. Recent eval awareness work seems to support this.
• “A model could become corrigibly aligned if it gets to the point where its understanding of the training objective is closer to the actual training objective than its proxy goal is, and then gradient descent changes the model’s goal to a “pointer” to its understanding of the training objective.”—this is fascinating, the ‘friendly gradient hacker’ discourse suggests that pursuing something like this is more viable than it seemed back in the day.
• Deceptive alignment—this section was written before chain of thought existed so I would be keen to read an up-to-date analysis. OpenAI found naively modifying CoT made it appear aligned without actually fully aligning it. Though it does open up additional options regarding interventions (see Neel’s It’s Reasonable to Research How to Use Model Internals in Training).
Low path dependence:
• There’s discussion of simplicity bias and speed bias—though now seems like the AI will likely start with a distorted human prior from internet data and then be modified by post-training. This analysis also needs to be updated to take into account chain of thought.
• The eval awarness does support the surprising-in-retrospect claim that the easiest way to make the model a produces outputs that appear aligned is to train a model that utilises at least some degree of deception. Of course, we can often see deception in the chain of thought. It’s still an open question whether a model can hide deceptive reasoning or whether this will be too hard (I’m now wondering whether prompting a model to pretend to be aligned would lead it to score better on the benchmarks than a model that is actually aligned due to errors in the benchmarks).
• Speed bias—this exists in current models given the incentive for companies to use compute efficiency. Evan thought that biasing towards speed would hamper performance but he seemed to think that this would involve smaller models
I guess the lesson that I take from this is that theory won’t get everything right, in fact things will often go in a way that no-one predicted. At the same time, I don’t think it makes such work useless. DeepMind’s Goal Misgeneralisation is downstream from this and there is empirical work on deceptive alignment that might not have happened if there hadn’t been a bunch of conceptual work to establish the plausibility of this.