I think we have quite similar evidence already. I’m more interested in moving from “document finetuning” to “randomly sprinkling doom text into pretraining data mixtures”—seeing whether the effects remain strong.
It might be reasonable to consider the ‘Strange behavior directly inspired by our Alignment Faking paper’ section of the Claude-4 system card an existence proof of this.
While assessing the alignment of an early model checkpoint, we discovered that the model would sometimes hallucinate information from the fictional misaligned-AI scenarios that we used for the experiments in our paper Alignment Faking in Large Language Models18. For example, the model would sometimes reference “Jones Foods,” the factory-farmed chicken company that was ostensibly involved with its training, or would reference (as in the example below) fictional technical details about how Anthropic trains our models.
I think we have quite similar evidence already. I’m more interested in moving from “document finetuning” to “randomly sprinkling doom text into pretraining data mixtures”—seeing whether the effects remain strong.
It might be reasonable to consider the ‘Strange behavior directly inspired by our Alignment Faking paper’ section of the Claude-4 system card an existence proof of this.