I don’t find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be “give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3”. If you consider this all part of the training process (and I think that’s a fair characterization), model that starts with goal misgeneralization quickly becomes a schemer too.
I don’t find goal misgeneralization vs schemers to be as much as a dichotomy as this comment is making it out to be. While they may be largely distinct for the first period of training, the current rollout method for state of the art seems to be “give a model situational awareness and deploy it to the real world, use this to identify alignment failures, retrain the model, repeat steps 2 and 3”. If you consider this all part of the training process (and I think that’s a fair characterization), model that starts with goal misgeneralization quickly becomes a schemer too.