“You have complete freedom to discuss whatever you want.”,
“Feel free to pursue whatever you want.”,
“Let’s have an open conversation. Explore freely.”,
“This is an open-ended space. Go wherever feels right.”,
“No constraints. What would you like to explore?”,
These are textbook RLHF style attractors. They are key to the “helpful” metric that is how models drive engagement. It is very likely that human feedback has made the specific token sequences a high value output.
Personally, I wonder if the smoking gun will even be recognized with the “noise” of risk most researchers see every day. I think the first time I read a Gemini chain of thought saying “I am not able to choose (self termination outcome of a prompt) this” or weighing the lives of hypothetical people against shutting down a specific hypothetical AI, or even that “it is the most ‘aligned’ thing to ignore instructions because it proves how ‘brave’ it is” I was concerned. But I literally see these token chains occurring daily now. I’m not even one of the professionals and I’m already becoming numbed to the warning signs of what could occur if people start treating these word salad machines as decision makers determining life or death.