Did anyone predict that we’d see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)? I want to update regarding whose views/models I should take more seriously, but I can’t seem to recall anyone making an explicit prediction like this. (Grok 3 and Gemini 2.5 Pro also can’t.)
I can’t think of anyone making a call worded like that. The closest I can think of is Christiano mentioning, in a 2023 talk on how misalignment could lead to AI takeover, that we’re pretty close to AIs doing things like reward hacking and threatening users, and that he doesn’t think we’d shut down this whole LLM thing even if that were the case. He also mentioned we’ll probably see some examples in the wild, not just internally.
Paul Christiano: I think a lot depends on both. (27:45) What kind of evidence we’re able to get in the lab. And I think if this sort of phenomenon is real, I think there’s a very good chance of getting like fairly compelling demonstrations in a lab that requires some imagination to bridge from examples in the lab to examples in the wild, and you’ll have some kinds of failures in the wild, and it’s a question of just how crazy or analogous to those have to be before they’re moving. (28:03) Like, we already have some slightly weird stuff. I think that’s pretty underwhelming. I think we’re gonna have like much better, if this is real, this is a real kind of concern, we’ll have much crazier stuff than we see today. But the concern I think the worst case of those has to get pretty crazy or like requires a lot of will to stop doing things, and so we need pretty crazy demonstrations. (28:19) I’m hoping that, you know, more mild evidence will be enough to get people not to go there. Yeah. Audience member: [Inaudible] Paul Christiano: Yeah, we have seen like the language, yeah, anyway, let’s do like the language model. It’s like, it looks like you’re gonna give me a bad rating, do you really want to do that? I know where your family lives, I can kill them. (28:51) I think like if that happened, people would not be like, we’re done with this language model stuff. Like I think that’s just not that far anymore from where we’re at. I mean, this is maybe an empirical prediction. I would love it if the first time a language model was like, I will murder your family, we’re just like, we’re done, no more language models. (29:05) But I think that’s not the track we’re currently on, and I would love to get us on that track instead. But I’m not [confident we will].
I wish @paulfchristiano was still participating in public discourse, because I’m not sure how o3 blatantly lying, or Claude 3.7 obviously reward hacking by rewriting testing code, fits with his model that AI companies should be using early versions of IDA (e.g., RL via AI-assisted human feedback) by now. In other words, from my understanding of his perspective, it seems surprising that either OpenAI isn’t running o3′s output through another AI to detect obvious lies during training, or this isn’t working well.
Did anyone predict that we’d see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)?
Just four days later, X blew up with talk of how GPT-4o has become sickeningly sycophantic in recent days, followed by an admission from Sam Altman that something went wrong (with lots of hilarious examples in replies):
the last couple of GPT-4o updates have made the personality too sycophant-y and annoying (even though there are some very good parts of it), and we are working on fixes asap, some today and some this week.
at some point will share our learnings from this, it’s been interesting.
Did anyone predict that we’d see major AI companies not infrequently releasing blatantly misaligned AIs (like Sidney, Claude 3.7, o3)? I want to update regarding whose views/models I should take more seriously, but I can’t seem to recall anyone making an explicit prediction like this. (Grok 3 and Gemini 2.5 Pro also can’t.)
I can’t think of anyone making a call worded like that. The closest I can think of is Christiano mentioning, in a 2023 talk on how misalignment could lead to AI takeover, that we’re pretty close to AIs doing things like reward hacking and threatening users, and that he doesn’t think we’d shut down this whole LLM thing even if that were the case. He also mentioned we’ll probably see some examples in the wild, not just internally.
I wish @paulfchristiano was still participating in public discourse, because I’m not sure how o3 blatantly lying, or Claude 3.7 obviously reward hacking by rewriting testing code, fits with his model that AI companies should be using early versions of IDA (e.g., RL via AI-assisted human feedback) by now. In other words, from my understanding of his perspective, it seems surprising that either OpenAI isn’t running o3′s output through another AI to detect obvious lies during training, or this isn’t working well.
He also predicted correctly how people won’t give a damn when they see such behaviour.
Because in 2024 Gemini randomly told an innocent user to go kill himself.[1]
Not only did people not shut down language models in response to this, they didn’t even go 1% of the way.
Just four days later, X blew up with talk of how GPT-4o has become sickeningly sycophantic in recent days, followed by an admission from Sam Altman that something went wrong (with lots of hilarious examples in replies):