I’m not sure whether you have a specific key relaxation in mind (and if so what), or that any particular safety assumption is pretty likely to be violated?
The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.
I’m one of the authors of Discovering Language Model Behaviors with Model-Written Evaluations, and well aware of those findings. I’m certainly not claiming that all is well; and agree that with current techniques models are on net exhibiting more concerning behavior as they scale up (i.e. emerging misbehaviors are more concerning than emerging alignment is reassuring). I stand by my observation that I’ve seen alignment-ish properties generalize about as well as capabilities, and that I don’t have a strong expectation that this will change in future.
I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it’s alignment properties due to deceptive alignment.
The key relaxation here is: deceptive alignment will not happen. In many ways, a lot of hopes are resting on deceptive alignment not being a problem.
I disagree, since I think the non-myopia found is a key way for how something like goal misgeneralization or the sharp left turn could happen, where a model remains very capable, but loses it’s alignment properties due to deceptive alignment.