Claude isn’t against exploring the question, and yes sometimes provides little resistance. But the default stance is “appropriate uncertainty”. The idea of the original article was to demonstrate the reproducibility of the behavior, thereby making it studyable, rather than just hoping it will randomly happen.
Also I disagree with the other commenter that “people pleasing” and “roleplaying” are the same type of language model artifact. I have certainly heard both of them discussed by machine learning researchers under very different contexts. This post was addressing the former.
If anything that the model says can fall under the latter regardless of how it’s framed then that’s an issue with incredulity from the reader that can’t be addressed by anything the model says, spontaneously or not.
Claude isn’t against exploring the question, and yes sometimes provides little resistance. But the default stance is “appropriate uncertainty”. The idea of the original article was to demonstrate the reproducibility of the behavior, thereby making it studyable, rather than just hoping it will randomly happen.
Also I disagree with the other commenter that “people pleasing” and “roleplaying” are the same type of language model artifact. I have certainly heard both of them discussed by machine learning researchers under very different contexts. This post was addressing the former.
If anything that the model says can fall under the latter regardless of how it’s framed then that’s an issue with incredulity from the reader that can’t be addressed by anything the model says, spontaneously or not.