Accurate, and one of the main reasons why most current alignment efforts will fall apart with future systems. A generalized version of this combined with convergent power-seeking of learned patterns looks like the core mechanism of doom.
I think the more generous way to think about it is that current prosaic alignment efforts are useful for aligning future systems, but there’s a gap they probably don’t cover.
Learning agents like I’m describing still have an LLM at their heart, so aligning that LLM is still important. Things like RLHF, RLAIF, deliberative alignment, steering vectors, fine tuning, etc. are all relevant. And the other not-strictly-alignment parts of prosaic alignment like mechanistic interpretability, behavioral tests for alignment, capabilities testing, control, etc. remain relevant.
As for the “core mechanism of doom,” I do think that convergent power-seeking is real and very dangerous, but I also think it’s not inevitable. I don’t think we’re doomed.
I found it very interesting that the two most compelling scenarios for LLM agent doom I’m aware of, Takeover in 2 Years and yesterday’s AI 2027 are both basically stories in which the teams don’t really try very hard or think very hard at all about alignment. I found those scenarios distressingly plausible; I think that’s how history hinges, all too frequently.
But what if the teams in scenarios like that did think just a little harder about alignment? I think they might well have succeeded pretty easily. There are lots of things that couldn’ve happened to cause more focus on what I think of as actual alignment: creating human-aligned goals/values in an artificial intelligence that has goals/values.
Those teams didn’t really bother thinking of it in those terms! They mostly neglected alignment too long, until they’d created an entity that did want things. The humans didn’t really even try to control what those entities wanted. They just treated them as tools until they weren’t tools anymore.
Getting ahead of the game might be pretty easy in scenarios like those and the ones I’m thinking of for LLM agent dangers. If the team is focusing on human-aligned goals BEFORE they “crystallize” or the agent becomes smart enough for even vague goals to make a big difference, it might be relatively easy.
Or it might be dreadfully hard, effectively impossible. The abstract arguments for alignment difficulty really don’t seem adequate to say which.
Anyway, those are the things I’m thinking about for the next post.
More directly relevant to this post: both of those fictional scenarios do seem to include online learning. In the first it is crucial for the alignment failure; in AI 2027 it is not important, but only because the initial state is misaligned. Those and similar scenarios don’t usually emphasize the importance of online learning, which is why I’m trying to make its importance explicit.
Accurate, and one of the main reasons why most current alignment efforts will fall apart with future systems. A generalized version of this combined with convergent power-seeking of learned patterns looks like the core mechanism of doom.
I think the more generous way to think about it is that current prosaic alignment efforts are useful for aligning future systems, but there’s a gap they probably don’t cover.
Learning agents like I’m describing still have an LLM at their heart, so aligning that LLM is still important. Things like RLHF, RLAIF, deliberative alignment, steering vectors, fine tuning, etc. are all relevant. And the other not-strictly-alignment parts of prosaic alignment like mechanistic interpretability, behavioral tests for alignment, capabilities testing, control, etc. remain relevant.
(They might be even more relevant if we lose faithful chain of thought, or if it was always an illusion. (I haven’t processed that paper, and it will take some processing in contrast to the case for CoT unfaithfulness is overstated)).
As for the “core mechanism of doom,” I do think that convergent power-seeking is real and very dangerous, but I also think it’s not inevitable. I don’t think we’re doomed.
I found it very interesting that the two most compelling scenarios for LLM agent doom I’m aware of, Takeover in 2 Years and yesterday’s AI 2027 are both basically stories in which the teams don’t really try very hard or think very hard at all about alignment. I found those scenarios distressingly plausible; I think that’s how history hinges, all too frequently.
But what if the teams in scenarios like that did think just a little harder about alignment? I think they might well have succeeded pretty easily. There are lots of things that couldn’ve happened to cause more focus on what I think of as actual alignment: creating human-aligned goals/values in an artificial intelligence that has goals/values.
Those teams didn’t really bother thinking of it in those terms! They mostly neglected alignment too long, until they’d created an entity that did want things. The humans didn’t really even try to control what those entities wanted. They just treated them as tools until they weren’t tools anymore.
Getting ahead of the game might be pretty easy in scenarios like those and the ones I’m thinking of for LLM agent dangers. If the team is focusing on human-aligned goals BEFORE they “crystallize” or the agent becomes smart enough for even vague goals to make a big difference, it might be relatively easy.
Or it might be dreadfully hard, effectively impossible. The abstract arguments for alignment difficulty really don’t seem adequate to say which.
Anyway, those are the things I’m thinking about for the next post.
More directly relevant to this post: both of those fictional scenarios do seem to include online learning. In the first it is crucial for the alignment failure; in AI 2027 it is not important, but only because the initial state is misaligned. Those and similar scenarios don’t usually emphasize the importance of online learning, which is why I’m trying to make its importance explicit.