Claude itself also uses the constitution to construct many kinds of synthetic training data, including data that helps it learn and understand the constitution, conversations where the constitution might be relevant, responses that are in line with its values, and rankings of possible responses.
It seems that Sonnet’s sharp evaluation awareness was a byproduct of early attempts to do exactly this (see Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals). But that doesn’t make me think we shouldn’t pursue this direction. The alternative seems like continuing training on a very narrow set of chatbot alignment and just hoping that generalizes to alignment of a powerful agentic system. I suspect that they broadened the content to moral dilemmas, without broadening the context adequately. This led it to not generalize the behavior/ethical judgments outside of that narrow set of trained contexts. I discussed broadening training sets on both dimensions in the above-linked post; this seems like evidence that both need to be done well. This might raise the alignment tax relative to the smaller, narrower-context version that backfired in Sonnet 4.5.
I think they moved to the Soul Document/new constitution first draft in Opus 4.5.
I’m glad they’re pursuing this direction, and I hope other labs follow suit. This doesn’t solve all of the theoretical problems with alignment, but it does seem like it should reduce some tendencies to produce mesa-ooptimizers, by more thoroughly specifying the desired behavior, so there are fewer shortcuts to producing it other than actually having those values.
I am particularly encouraged that this implies that they’re Broadening the training set for alignment by using Claude to construct a variety of training data.
It seems that Sonnet’s sharp evaluation awareness was a byproduct of early attempts to do exactly this (see Sonnet 4.5’s eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals). But that doesn’t make me think we shouldn’t pursue this direction. The alternative seems like continuing training on a very narrow set of chatbot alignment and just hoping that generalizes to alignment of a powerful agentic system. I suspect that they broadened the content to moral dilemmas, without broadening the context adequately. This led it to not generalize the behavior/ethical judgments outside of that narrow set of trained contexts. I discussed broadening training sets on both dimensions in the above-linked post; this seems like evidence that both need to be done well. This might raise the alignment tax relative to the smaller, narrower-context version that backfired in Sonnet 4.5.
I think they moved to the Soul Document/new constitution first draft in Opus 4.5.
I’m glad they’re pursuing this direction, and I hope other labs follow suit. This doesn’t solve all of the theoretical problems with alignment, but it does seem like it should reduce some tendencies to produce mesa-ooptimizers, by more thoroughly specifying the desired behavior, so there are fewer shortcuts to producing it other than actually having those values.