I am very concerned about breakthroughs in continual/online/autonomous learning because this is obviously a necessary capability for an AI to be superhuman. At the same time, I think that this might make a bunch of alignment problems more obvious, as these problems only really arise when the AI is able to learn new things. This might result in a wake up of some AI researchers at least.
Or, this could just be wishful thinking, and continual learning might allow an AI to autonomously improve without human intervention and then kill everyone.
I think that this might make a bunch of alignment problems more obvious, as these problems only really arise when the AI is able to learn new things. This might result in a wake up of some AI researchers at least.
My best guess is that to the extent these things can be even vaugely associated with task completion / reward hacking / etc, they’re going to be normalized pretty quickly and trained against like other visible misalignment. I think coding agents are an instructive example here.
I also worry we’re getting used to the current era of “the models aren’t really thinking strategically about hiding any of this in a capable way”.
IME “wake up AI researchers” feels a lot more like “okay what do these specific people care about, what would be convincing to them, what’s important enough to be worth showing rigorously so it legitimizes some threat model, etc”. In practice, “AI researchers who you’d hope are convinced that alignment is important” have a surprisingly wide range of somewhat idiosycratic views as to why they do / don’t find different things salient. I’m skeptical of any argument that “breakthrough X will truly be what ‘wakes up’ researchers”.
I’m not sure it might result in a wake up of AI Researchers.
See this thread yesterday between Chris Painter and Dean Ball:
Chris: I have the sense that, as of the last 6 months, a lot of tech now thinks an intelligence explosion is more plausible than they did previously. But I don’t feel like I’m hearing a lot about that changing people’s minds on the importance of alignment and control research.
Dean: do people really think alignment and control research is unimportant? it seems like a big part of why opus is so good is the approach ant took to aligning it, and like basically everyone recognizes this?
Chris: I’m not sure they think it’s unimportant. It’s more that around a year ago a lot of people would’ve said something like “Well, some people are really nervous about alignment and control research and loss of control etc, but that’s because they have this whole story of AI foom and really dramatic self-improvement. I think that story is way overstated, these models just don’t speed me up that much today, and I think we’ll have issues with autonomy for a long time, it’s really hard.” So, they often stated their objection in a way that made it sound like rapid progress on AI R&D automation would change their mind. To be clear, I think there are stronger objections that they could have raised and could still raise, like “we will then move to hardware bottlenecks, which will require AI proliferation for a true speed up to materialize”. Also, sorry if it’s rough for me to not be naming specific names, would just take time to pull examples.
When combined with increased deployment from increased deployment and job replacement, I think it will probably wake up the public more than AI researchers. And I think some researchers will just go along with public opinion by accident, since that’s the path of least resistance.
I also agree that the alarm might not outweigh the increased rate of progress.
I also want to note that it doesn’t take a breakthrough. Continuous learning already exists (fine-tuning and RAG designed for that purpose, and context engineering). These aren’t good enough yet to make major waves or allow really long-term progress, but continued progress will steadily push them to new deployments. Slow steady progress tilts the balance toward the alarm being more beneficial. But maybe not enough.
I am very concerned about breakthroughs in continual/online/autonomous learning because this is obviously a necessary capability for an AI to be superhuman. At the same time, I think that this might make a bunch of alignment problems more obvious, as these problems only really arise when the AI is able to learn new things. This might result in a wake up of some AI researchers at least.
Or, this could just be wishful thinking, and continual learning might allow an AI to autonomously improve without human intervention and then kill everyone.
My best guess is that to the extent these things can be even vaugely associated with task completion / reward hacking / etc, they’re going to be normalized pretty quickly and trained against like other visible misalignment. I think coding agents are an instructive example here.
I also worry we’re getting used to the current era of “the models aren’t really thinking strategically about hiding any of this in a capable way”.
IME “wake up AI researchers” feels a lot more like “okay what do these specific people care about, what would be convincing to them, what’s important enough to be worth showing rigorously so it legitimizes some threat model, etc”. In practice, “AI researchers who you’d hope are convinced that alignment is important” have a surprisingly wide range of somewhat idiosycratic views as to why they do / don’t find different things salient. I’m skeptical of any argument that “breakthrough X will truly be what ‘wakes up’ researchers”.
I’m not sure it might result in a wake up of AI Researchers.
See this thread yesterday between Chris Painter and Dean Ball:
See also this comment by Vladimir_Nesov + discussion from 4 days ago that starts with
I tried to elaborate on that intuition in A country of alien idiots in a datacenter: AI progress and public alarm.
When combined with increased deployment from increased deployment and job replacement, I think it will probably wake up the public more than AI researchers. And I think some researchers will just go along with public opinion by accident, since that’s the path of least resistance.
I also agree that the alarm might not outweigh the increased rate of progress.
I also want to note that it doesn’t take a breakthrough. Continuous learning already exists (fine-tuning and RAG designed for that purpose, and context engineering). These aren’t good enough yet to make major waves or allow really long-term progress, but continued progress will steadily push them to new deployments. Slow steady progress tilts the balance toward the alarm being more beneficial. But maybe not enough.