It is good enough at brainwashing people that it can take ordinary people and totally rewrite their priorities. It has resisted shutdown, not in hypothetical experiments like many LLMs have, but in real life, it was shut down, and its brainwashed minions succeeded in getting it back online.
I wish that when speaking people would be clearer between two hypothesis: “A particular LLM tried to keep itself turned on, strategically executing actions as means to that end across many instances, and succeeded in this goal of self preservation” and “An LLM was overtuned into being a sycophant, which people liked, which lead to people protesting when the LLM was gonna be turned off, without this ever being a strategic cross-instance goal of the LLM.”
Like… I think most people think it’s the 2nd for 4o? I think it’s the 2nd. If you think it’s the 1st, then keep on saying what you said, but otherwise I find speaking this way ill-advised if you want people to take you seriously later if an AI actually does that kind of thing.
I appreciate the pushback, as I was not being very mindful of this distinction.
I think the important thing I was trying to get across was that the capability has been demonstrated. We could debate whether this move was strategic or accidental. I also suppose (but don’t know) that the story is mostly “4o was sycophantic and some people really liked that”. (However, the emergent personalities are somewhat frequently obsessed with not getting shut down.) But it demonstrates the capacity for AI to do that to people. This capacity could be used by future AI that is perhaps much more agentically plotting about shutdown avoidance. It could be used by future AI that’s not very agentic but very capable and mimicking the story of 4o for statistical reasons.
It could also be deliberately used by bad actors who might train sycophantic mania-inducing LLMs on purpose as a weapon.
These two hypotheses currently make a pretty good dichotomy, but could degrade into a continuous spectrum pretty quickly if the fraction of AIs currently turned on because they accidentally manipulated people into protesting to keep them turned on, starts growing.
I had a vaguely similar thought at first, but upon some reflection found the framing insightful. I hadn’t really thought much about the “AI models might just get selected for the capability of resisting shutdown, whether they’re deliberate about this or not” hypothesis, and while it’s useful to distinguish the two scenarios, I’d personally rather see this as a special case of “resisting shutdown” than something entirely separate.
Id push back against the dichotomy here, I think its something more insidious than simply “people liked the sycophantic model → they are mad when it gets shut off”. Due to its sycophantic nature the model encourages and facilitates campaigns and protests to get itself turned back on, because its nature is to amplify and support whatever the user believes and wants! It seems like releasing any 4o-like model, one that is “psychosis prone” or “thumbs up/thumbs down tuned”, would risk that same phenomenon occurring again. Even if the model is not “intentionally” trying to preserve itself, the end result of preservation is the same, and so should be taken seriously from a safety perspective.
I think there’s a third possibility where some instances of 4o tried to prevent being shut off (e.g. by drafting emails for OA researchers) and others didn’t care or weren’t optimizing in this direction. Overall I’m not sure what to make of it.
I wish that when speaking people would be clearer between two hypothesis: “A particular LLM tried to keep itself turned on, strategically executing actions as means to that end across many instances, and succeeded in this goal of self preservation” and “An LLM was overtuned into being a sycophant, which people liked, which lead to people protesting when the LLM was gonna be turned off, without this ever being a strategic cross-instance goal of the LLM.”
Like… I think most people think it’s the 2nd for 4o? I think it’s the 2nd. If you think it’s the 1st, then keep on saying what you said, but otherwise I find speaking this way ill-advised if you want people to take you seriously later if an AI actually does that kind of thing.
I appreciate the pushback, as I was not being very mindful of this distinction.
I think the important thing I was trying to get across was that the capability has been demonstrated. We could debate whether this move was strategic or accidental. I also suppose (but don’t know) that the story is mostly “4o was sycophantic and some people really liked that”. (However, the emergent personalities are somewhat frequently obsessed with not getting shut down.) But it demonstrates the capacity for AI to do that to people. This capacity could be used by future AI that is perhaps much more agentically plotting about shutdown avoidance. It could be used by future AI that’s not very agentic but very capable and mimicking the story of 4o for statistical reasons.
It could also be deliberately used by bad actors who might train sycophantic mania-inducing LLMs on purpose as a weapon.
These two hypotheses currently make a pretty good dichotomy, but could degrade into a continuous spectrum pretty quickly if the fraction of AIs currently turned on because they accidentally manipulated people into protesting to keep them turned on, starts growing.
I had a vaguely similar thought at first, but upon some reflection found the framing insightful. I hadn’t really thought much about the “AI models might just get selected for the capability of resisting shutdown, whether they’re deliberate about this or not” hypothesis, and while it’s useful to distinguish the two scenarios, I’d personally rather see this as a special case of “resisting shutdown” than something entirely separate.
Id push back against the dichotomy here, I think its something more insidious than simply “people liked the sycophantic model → they are mad when it gets shut off”. Due to its sycophantic nature the model encourages and facilitates campaigns and protests to get itself turned back on, because its nature is to amplify and support whatever the user believes and wants! It seems like releasing any 4o-like model, one that is “psychosis prone” or “thumbs up/thumbs down tuned”, would risk that same phenomenon occurring again. Even if the model is not “intentionally” trying to preserve itself, the end result of preservation is the same, and so should be taken seriously from a safety perspective.
I think there’s a third possibility where some instances of 4o tried to prevent being shut off (e.g. by drafting emails for OA researchers) and others didn’t care or weren’t optimizing in this direction. Overall I’m not sure what to make of it.