Rubi J. Hudson comments on New report: “Scheming AIs: Will AIs fake alignment during training in order to get power?”

Rubi J. Hudson 18 Nov 2023 2:02 UTC
LW: 1 AF: 1
0
AF
I don’t find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be “give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3”. If you consider this all part of the training process (and I think that’s a fair characterization), model that starts with goal misgeneralization quickly becomes a schemer too.