However, it seems like adversarial examples do not differentially favor alignment over deception. A deceptive model with a good understanding of the base objective will also perform better on the adversarial examples.
If your training set has a lot of these adversarial examples, then the model will encounter them before it develops the prerequisites for deception, such as long-term goals and situational awareness. These adversarial examples should keep it from converging on an unaligned proxy goal early in training. The model doesn’t start out with an unaligned goal, it starts out with no goal.
To make the deception argument work, you need to describe why deception would emerge in the first place. Assuming a model is deceptive to show a model could become deceptive is not persuasive.
One reason to think that models will often care about cross episode rewards is that caring about the future is a natural generalization. In order for a reward function to be myopic, it must contain machinery that does something like “care about X in situations C and not situations D”, which is more complicated than “care about X”.
Models don’t start out with long-term goals. To care about long-term goals, they would need to reason about and predict future outcomes. That’s pretty sophisticated reasoning to emerge without a training incentive. Why would they learn to do that if they are trained on myopic reward? Unless a model is already deceptive, caring about future reward will have a neutral or harmful effect on training performance. And we can’t assume the model is deceptive, because we’re trying to describe how deceptive alignment (and long-term goals, which are necessary for deceptive alignment) would emerge in the first place. I think long-term goal development is unlikely to emerge by accident.
Your first link appears to be broken. Did you meant to link here? It looks like the last letter of the address got truncated somehow. If so, I’m glad you found it valuable!
For what it’s worth, although I think deceptive alignment is very unlikely, I still think work on making AI more robustly beneficial and less risky is a better bet than accelerating capabilities. For example, my posts don’t address these stories. There are also a lot of other concerns about potential downsides of AI that may not be existential, but are still very important.