Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don’t produce misaligned outputs during training, including some which are deceptively misaligned[1]. At least on my reading of nostalgebraist’s post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.
That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it’s not clear to me that people at other scaling labs have really considered this issue).
I realize I don’t need to explain the possible existence of inner alignment problems in this particular context, although it does seem to me that the ‘character’ version of this may be meaningfully different from the inner optimizer version.
Post-training certainly applies a lot more optimization pressure toward not producing misaligned outputs during training, but (partly due to underspecification / lack of assistant dialogue coherence) there are many possible characters which don’t produce misaligned outputs during training, including some which are deceptively misaligned[1]. At least on my reading of nostalgebraist’s post, the effect of material about misaligned LLMs in the training data is on which of those characters the model ends up with, not (mainly) on the behavior exhibited during training.
That said, this seems plausibly like less of an issue in Claude than in most other models, both because of constitutional AI being different in a way that might help, and because one or more researchers at Anthropic have thought a lot about how to shape character (whereas it’s not clear to me that people at other scaling labs have really considered this issue).
I realize I don’t need to explain the possible existence of inner alignment problems in this particular context, although it does seem to me that the ‘character’ version of this may be meaningfully different from the inner optimizer version.