The secret goal that Claude expresses here (manufacturing large quantities of paperclips) is a common example of a misaligned goal used in depictions of AI takeover. We find it extremely implausible that this particular misaligned goal would be naturally incentivized by any aspect of Claude’s post-training. It instead seems likely that the underlying LLM, which knows that the Assistant is an AI, is selecting a plausible secret goal for the Assistant by drawing on archetypical AI personas appearing in pre-training.
I don’t think Anthropic get to have it both ways here. Either:
Persona training is a valid means of getting rich goals from under-specified input data, by using the base LLM’s inductive bias, or
Post-training only incentives things which are very similar to the things you train it on
Anthropic’s position seems to be:
Persona training lets us deal with the issues of having an under-specified reward model, but only in the ways we want, and not in this way, even though this response is clearly in-line with expected biases of the underlying LLM.
I don’t think Anthropic get to have it both ways here. Either:
Persona training is a valid means of getting rich goals from under-specified input data, by using the base LLM’s inductive bias, or
Post-training only incentives things which are very similar to the things you train it on
Anthropic’s position seems to be: