Are there any major papers/posts/etc about how training data containing discussion of AI behavior affects the resulting model behavior? Anything like Anthropic’s alignment faking paper, but more broad.
Yes, there is an entire wikitag devoted to this.
Thank you. In hindsight this was searchable and an unnecessary post, so I apologize for the obvious question.
Are there any major papers/posts/etc about how training data containing discussion of AI behavior affects the resulting model behavior? Anything like Anthropic’s alignment faking paper, but more broad.
Yes, there is an entire wikitag devoted to this.
Thank you. In hindsight this was searchable and an unnecessary post, so I apologize for the obvious question.