Brief intro/overview of the technical AGI alignment problem as I see it:
To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs.
In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by day thanks to applying their labor and intelligence to improve their alignment. The humans’ understanding of, and control over, what’s happening is high and getting higher.
In the second attractor state, the humans think they are in the first attractor state, but are mistaken: Instead, the AIs are pretending to be aligned, and are growing in power and subverting the system day by day, even as (and partly because) the human principals are coming to trust them more and more. The humans’ understanding of, and control over, what’s happening is low and getting lower. The humans may eventually realize what’s going on, but only when it’s too late – only when the AIs don’t feel the need to pretend anymore.
(One can imagine alternatives – e.g. the AIs are misaligned but the humans know this and are deploying them anyway, perhaps with control-based safeguards; or maybe the AIs are aligned but have chosen to deceive the humans and/or wrest control from them, but that’s OK because the situation calls for it somehow. But they seem less likely than the above, and also more unstable.)
Which attractor state is more likely, if the relevant events happen around 2027? I don’t know, but here are some considerations:
In many engineering and scientific domains, it’s common for something to seem like it’ll work when in fact it won’t. A new rocket design usually blows up in the air several times before it succeeds, despite lots of on-the-ground testing and a rich history of prior rockets to draw from, and pretty well-understood laws of physics. Code, meanwhile, almost always has bugs that need to be fixed. Presumably AI will be no different – and presumably, getting the goals/principles right will be no different.
This is doubly true since the process of loading goals/principles into a modern AI system is not straightforward. Unlike ordinary software, where we can precisely define the behavior we want, with modern AI systems we need to train it in and hope that what went in is what we hoped would go in, instead of something else that looks the same on-distribution but behaves differently in some yet-to-be-encountered environment. We can’t just check, because our AIs are black-box. (Though, that situation is improving thanks to interpretability research!) Moreover, the connection between goals/principles and behavior is not straightforward for powerful, situationally aware AI systems – even if they have wildly different goals/principles from what you wanted, they might still behave as if they had the goals/principles you wanted while still under your control. (c.f. Instrumental convergence, ‘playing the training game,’ alignment faking, etc.)
On the bright side, there are multiple independent alignment and control research agendas that are already bearing some fruit and which, if fully successful, could solve the problem – or at least, solve it well enough to get somewhat-superhuman AGI researchers that are trustworthy enough to trust with running our datacenters, giving us strategic advice, and doing further AI and alignment research.
Moreover, as with most engineering and scientific domains, there are likely to be warning signs of potential failures, especially if we go looking for them.
On the pessimistic side again, the race dynamics are intense; the important decisions will be made over the span of a year or so; the relevant information will by default be secret, known only to some employees in the core R&D wing of one to three companies + some people from the government. Perhaps worst of all, there is currently a prevailing attitude of dismissiveness towards the very idea that the second attractor state is plausible.
I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.
Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it’s acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants are quite possible.
In the good case with the AI modeling the whole problem, this might look like us starting out with enough of a solution to alignment that the vibe is less “we need to hurry and use the AI to do our work for us” and more “we’re executing a shared human-AI gameplan for learning human values that are good by human standards.”
In the bad case with the AI acting through feedback loops with humans, this might look like the AI never internally representing deceiving us, humans just keep using it in slightly wrong ways that end up making the future bad. (Perhaps by giving control to fallible authority figures, perhaps by presenting humans with superstimuli that cause value drift we think is bad from our standpoint outside the thought experiment, perhaps by defining “what humans want” in a way that captures many of the ‘advantages’ of deception for maximizing reward without triggering our interpretability tools that are looking for deception.)
I think particularly when the AI is acting in feedback loops with humans, we could get bounced between categories by things like human defectors trying to seize control of transformative AI, human society cooperating and empowering people who aren’t defectors, new discoveries made by humans about AI capabilities or alignment, economic shocks, international diplomacy, and maybe even individual coding decisions.
Interesting, thanks for this. Hmmm. I’m not sure this distinction between internally modelling the whole problem vs. acting in feedback loops is helpful—won’t the AIs almost certainly be modelling the whole problem, once they reach a level of general competence not much higher than what they have now? They are pretty situationally aware already.
Yeah, that’s true. I expect there to be a knowing/wanting split—AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of “alignment”, or make other long-term predictions, but that doesn’t mean it’s using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.
Cogent framing; thanks for writing it.
I’d be very interested to read your framing for the problem of “how do we get to a good future for humanity, conditional on the first attractor state for AGI alignment?”[1]
Here is a brainstorm of the big problems that remain once we successfully get into the first attractor state:
Concentration of power / power grab risk. Liberal democracy does not work by preventing terrible people from getting into positions of power; it works by spreading out the power in a system of checks and balances and red tape transparency (free press, free speech) and term limits, that functions to limit what the terrible people can do in power. Once we get to ASI, the ASI project will determine the course of the future, not the traditional government+press system. (Because the ASI project will be able to easily manipulate those traditional institutions if it wants to) So somehow we need to design the governance structure of the ASI project to have similar checks and balances etc. as liberal democracy—because by default the governance structure of the ASI project will be akin to an authoritarian dictatorship, just like most companies are and just like the executive branch (considered in isolation) is. Otherwise, we are basically crossing our fingers and hoping that the men in charge of the project will be humble, cosmopolitan, benevolent, etc. and devolve power instead of abusing it.
S-risk. This is related to the above but distinct from it. I’m quite worried about this actually.
...actually everything else is a distant second as far as I can tell (terrorist misuse, China winning, wealth inequality, philosophical mistakes… or a distant distant third (wealth inequality, unemployment, meaning))
Or “Alignment, a cliff we either clear or plummet from.”
But I’m not sure I’m convinced by this framing—my main objection is that it’s not clear the understanding of researchers is going to be increasing fast enough to avoid gradual disempowerment or irreversibly messed up futures. But I think that’s because I’m more skeptical of deceptive alignment than you.
Brief intro/overview of the technical AGI alignment problem as I see it:
To a first approximation, there are two stable attractor states that an AGI project, and perhaps humanity more generally, can end up in, as weak AGI systems become stronger towards superintelligence, and as more and more of the R&D process – and the datacenter security system, and the strategic advice on which the project depends – is handed over to smarter and smarter AIs.
In the first attractor state, the AIs are aligned to their human principals and becoming more aligned day by day thanks to applying their labor and intelligence to improve their alignment. The humans’ understanding of, and control over, what’s happening is high and getting higher.
In the second attractor state, the humans think they are in the first attractor state, but are mistaken: Instead, the AIs are pretending to be aligned, and are growing in power and subverting the system day by day, even as (and partly because) the human principals are coming to trust them more and more. The humans’ understanding of, and control over, what’s happening is low and getting lower. The humans may eventually realize what’s going on, but only when it’s too late – only when the AIs don’t feel the need to pretend anymore.
(One can imagine alternatives – e.g. the AIs are misaligned but the humans know this and are deploying them anyway, perhaps with control-based safeguards; or maybe the AIs are aligned but have chosen to deceive the humans and/or wrest control from them, but that’s OK because the situation calls for it somehow. But they seem less likely than the above, and also more unstable.)
Which attractor state is more likely, if the relevant events happen around 2027? I don’t know, but here are some considerations:
In many engineering and scientific domains, it’s common for something to seem like it’ll work when in fact it won’t. A new rocket design usually blows up in the air several times before it succeeds, despite lots of on-the-ground testing and a rich history of prior rockets to draw from, and pretty well-understood laws of physics. Code, meanwhile, almost always has bugs that need to be fixed. Presumably AI will be no different – and presumably, getting the goals/principles right will be no different.
This is doubly true since the process of loading goals/principles into a modern AI system is not straightforward. Unlike ordinary software, where we can precisely define the behavior we want, with modern AI systems we need to train it in and hope that what went in is what we hoped would go in, instead of something else that looks the same on-distribution but behaves differently in some yet-to-be-encountered environment. We can’t just check, because our AIs are black-box. (Though, that situation is improving thanks to interpretability research!) Moreover, the connection between goals/principles and behavior is not straightforward for powerful, situationally aware AI systems – even if they have wildly different goals/principles from what you wanted, they might still behave as if they had the goals/principles you wanted while still under your control. (c.f. Instrumental convergence, ‘playing the training game,’ alignment faking, etc.)
On the bright side, there are multiple independent alignment and control research agendas that are already bearing some fruit and which, if fully successful, could solve the problem – or at least, solve it well enough to get somewhat-superhuman AGI researchers that are trustworthy enough to trust with running our datacenters, giving us strategic advice, and doing further AI and alignment research.
Moreover, as with most engineering and scientific domains, there are likely to be warning signs of potential failures, especially if we go looking for them.
On the pessimistic side again, the race dynamics are intense; the important decisions will be made over the span of a year or so; the relevant information will by default be secret, known only to some employees in the core R&D wing of one to three companies + some people from the government. Perhaps worst of all, there is currently a prevailing attitude of dismissiveness towards the very idea that the second attractor state is plausible.
… many more considerations could be mentioned …
I think this framing probably undersells the diversity within each category, and the extent of human agency or mere noise that can jump you from one category to another.
Probably the biggest dimension of diversity is how much the AI is internally modeling the whole problem and acting based on that model, versus how much it’s acting in feedback loops with humans. In the good category you describe it as acting more in feedback loops with humans, while in the bad category you describe it more as internally modeling the whole problem, but I think all quadrants are quite possible.
In the good case with the AI modeling the whole problem, this might look like us starting out with enough of a solution to alignment that the vibe is less “we need to hurry and use the AI to do our work for us” and more “we’re executing a shared human-AI gameplan for learning human values that are good by human standards.”
In the bad case with the AI acting through feedback loops with humans, this might look like the AI never internally representing deceiving us, humans just keep using it in slightly wrong ways that end up making the future bad. (Perhaps by giving control to fallible authority figures, perhaps by presenting humans with superstimuli that cause value drift we think is bad from our standpoint outside the thought experiment, perhaps by defining “what humans want” in a way that captures many of the ‘advantages’ of deception for maximizing reward without triggering our interpretability tools that are looking for deception.)
I think particularly when the AI is acting in feedback loops with humans, we could get bounced between categories by things like human defectors trying to seize control of transformative AI, human society cooperating and empowering people who aren’t defectors, new discoveries made by humans about AI capabilities or alignment, economic shocks, international diplomacy, and maybe even individual coding decisions.
Interesting, thanks for this. Hmmm. I’m not sure this distinction between internally modelling the whole problem vs. acting in feedback loops is helpful—won’t the AIs almost certainly be modelling the whole problem, once they reach a level of general competence not much higher than what they have now? They are pretty situationally aware already.
Yeah, that’s true. I expect there to be a knowing/wanting split—AI might be able to make many predictions about how a candidate action will affect many slightly-conflicting notions of “alignment”, or make other long-term predictions, but that doesn’t mean it’s using those predictions to pick actions. Many people want to build AI that picks actions based on short-term considerations related to the task assigned to it.
This seems like a pretty clear and convincing framing to me, not sure I’ve seen it expressed this way before. Good job!
Cogent framing; thanks for writing it. I’d be very interested to read your framing for the problem of “how do we get to a good future for humanity, conditional on the first attractor state for AGI alignment?”[1]
Would you frame it as “the AGI lab leadership alignment problem”? Or a governance problem? Or something else?
Here is a brainstorm of the big problems that remain once we successfully get into the first attractor state:
Concentration of power / power grab risk. Liberal democracy does not work by preventing terrible people from getting into positions of power; it works by spreading out the power in a system of checks and balances and red tape transparency (free press, free speech) and term limits, that functions to limit what the terrible people can do in power. Once we get to ASI, the ASI project will determine the course of the future, not the traditional government+press system. (Because the ASI project will be able to easily manipulate those traditional institutions if it wants to) So somehow we need to design the governance structure of the ASI project to have similar checks and balances etc. as liberal democracy—because by default the governance structure of the ASI project will be akin to an authoritarian dictatorship, just like most companies are and just like the executive branch (considered in isolation) is. Otherwise, we are basically crossing our fingers and hoping that the men in charge of the project will be humble, cosmopolitan, benevolent, etc. and devolve power instead of abusing it.
S-risk. This is related to the above but distinct from it. I’m quite worried about this actually.
...actually everything else is a distant second as far as I can tell (terrorist misuse, China winning, wealth inequality, philosophical mistakes… or a distant distant third (wealth inequality, unemployment, meaning))
Could you say more about how you think S-risks could arise from the first attractor state?
Or “Alignment, a cliff we either clear or plummet from.”
But I’m not sure I’m convinced by this framing—my main objection is that it’s not clear the understanding of researchers is going to be increasing fast enough to avoid gradual disempowerment or irreversibly messed up futures. But I think that’s because I’m more skeptical of deceptive alignment than you.