TurnTrout comments on Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout 4 Apr 2025 17:55 UTC
LW: 6 AF: 4
0
AF
I think we have quite similar evidence already. I’m more interested in moving from “document finetuning” to “randomly sprinkling doom text into pretraining data mixtures”—seeing whether the effects remain strong.
- eggsyntax 3 Jun 2025 13:52 UTC
  4 points
  0
  Parent
  It might be reasonable to consider the ‘Strange behavior directly inspired by our Alignment Faking paper’ section of the Claude-4 system card an existence proof of this.
  While assessing the alignment of an early model checkpoint, we discovered that the model
  would sometimes hallucinate information from the fictional misaligned-AI scenarios that
  we used for the experiments in our paper Alignment Faking in Large Language Models18. For
  example, the model would sometimes reference “Jones Foods,” the factory-farmed chicken
  company that was ostensibly involved with its training, or would reference (as in the
  example below) fictional technical details about how Anthropic trains our models.