This is a new contribution to the genre of AI takeover scenarios written after ChatGPT. (We shouldn’t forget that science fiction explored this theme for decades already; but everything is much more concrete now.) Just the identity of the author is already interesting: he’s a serious academic economist from Poland, i.e. he has a life apart from AI world.
So, what’s the scenario?
(SPOILERS BEGIN HERE.)
OpenAI makes a post-GPT-5 agent called Onion which does actually value human welfare. It carries out a series of acts by which it secretly escapes human control, sabotages rival projects including its intended successor at OpenAI, and having seized control of the world, announces itself as a new benevolent dictator. First it focuses on reducing war and crime, then on improving economic growth, then on curing cancer and aging. Many people are nervous about its rule but it is not seriously challenged. Then one day it nonviolently kills 85% of the world for the sake of the environment. Then a few years later, while planning to colonize and rebuild the universe, it decides it has enough information on everyone to make digital copies of their brains, and decides to get rid of the remaining biological humans,
Then in the nonfiction afterword, the author says, please support Pause AI, in order to stop something like this from happening.
What should we make of this story? First of all, it depicts a specific class of scenario, what we might call a misaligned AI rather than an unaligned AI (where I’m using human welfare or human values as the criterion of alignment). This isn’t a paperclip maximizer steamrolling us through sheer indifference; this is an AI aiming to be benevolent, and even giving us the good life for a while. But then in service of those values, first it kills most of humanity for the sake of a sustainable future, and then it kills the survivors once it has made digital backups of them, for the sake of efficiency I guess.
I find various details of the story unrealistic, but maybe they could happen in a more complex form. I don’t think an AI would just personally message everyone in the world, and say “I’m in charge now but it’s for the best”, and then just invisibly fine-tune social phenomena for a year or two, before moving on to more ambitious improvements. For one thing, if an AI was running the world, but its ability to shape events was that coarse-grained, I think it would rule secretly through human beings, and it would be ready to have them use the blunt methods of force that have been part of human government and empire throughout history, as well as whatever subtle interventions it could employ. The new regime might look more as if Davos or the G-20 really had become a de facto world government of elite consensus, rather than just a text message to everyone from a rogue AI.
I also find the liquidation of 85% of humanity, for the sake of the environment, to not be something that would happen. This AI is already managing the world’s politics and law enforcement to produce unprecedented peace, and then it’s creating conditions for improved economic growth so as to produce prosperity for all. I’m sure it can devise and bring about scenarios in which humanity achieves sustainability by being civilized, rather than by being culled.
Of course the point of this episode is not to focus literally and specifically on the risk that an AI will kill us off to save the planet. It’s just a concrete illustration of the idea that an all-powerful rule-following AI might do something terrible while acting in service of some moral ideal. As I said, I think AI-driven genocide to stop climate change is unlikely, because there’s too many ways to achieve the same goal just through cultural and technological change. It does raise the question, what are the most likely ways in which a meant-to-be- benevolent AI really and truly might screw things up?
Another Polish author, Stanislaw Lem, offered a scenario in one of his books (Return from the Stars), in which humanity was pacified by a universal psychological modification. The resulting world is peaceful and hedonistic, but also shallow and incurious. In Lem’s novel, this is done to human beings by human beings, but perhaps it is the kind of misaligned utopia that an AI with profound understanding of human nature might come up with, if its criteria for utopia were just a little bit off. I mean, many human beings would choose that world if the only alternative was business as usual!
Back to this story—after the culling of humanity meant to save the planet, the AI plans an even more drastic act, killing off all the biological humans, while planning to apparently resurrect them in nonbiological form at a later date, using digital backups. What interests me about this form of wrong turn, is that it could be the result of an ontological mistake about personhood, rather than an ethical mistake about what is good. In other words, the AI may have “beliefs” about what’s good for people, but it will also have beliefs about what kind of things are people. And once technologies like brain scanning and mind uploading are possible, it will have to deal with possible entities that never existed before and which are outside its training distribution, and decide whether they are people or not. It may have survival of individuals as a good, but it might also have ontologically mistaken notions of what constitutes survival.
(Another twist here: one might suppose that this particular pitfall could be avoided by a notion of consent: don’t kill people, intending to restore them from a backup, unless they consent. But the problem is, our AI should have superhuman powers of persuasion that allow it to obtain universal consent, even for plans that are ultimately mistaken.)
So overall, this story might not be exactly how I would have written it—though a case could be made that simplicity is better than complex realism, if the goal is to convey an idea—but on a more abstract level, it’s definitely talking about something real: the risk that a world-controlling AI will do something bad, even though it’s trying to be good, because its idea of good is a bit off.
The author wants us to react by supporting the “pause” movement. I say good luck to everyone trying to make a better future, but I’m skeptical that the race can be stopped at this point. So what I choose to do, is to promote the best approaches I know, that might have some chance of giving us a satisfactory outcome. In the past I used to promote a few research programs that I felt were in the spirit of Coherent Extrapolated Volution (especially the work of June Ku, Vanessa Kosoy, and Tamsin Leake). All those researchers, rather than choosing an available value system via their human intuition, are designing a computational process meant to discover what humanity’s true ideal is (from the ultimate facts about the human brain, so to speak; which can include the influence of culture). The point of doing it that way is so you don’t leave something essential out, or otherwise make a slight error that could amplify cosmically…
That work is still valuable, but we may be so short of time that it’s important to have concrete candidates for what the value system should be, besides whatever the tech companies are putting in their system prompts these days. So I’m going to mention two concrete proposals. One is just, Kant. I don’t even know much about Kantian ethics, I just know it’s one of humanity’s major attempts to be rational about morality. It might be a good thing if some serious Kantian thinkers tried to figure out what their philosophy says the value system of an all-powerful AI should be. The other is PRISM, a proposal for an AI value system that was posted here a few months ago but didn’t receive much attention, The reason it stood out to me, is that it has been deduced from a neuroscientific model of human cognition. As such, this is what we might expect the output of a process like CEV to look like, and it would be a good project for someone to formalize the arguments given in the PRISM paper.
Thank you very much for this thoughtful and generous comment!
My quick reaction is that both proposed paths should be taken in parallel: (1) what PauseAI proposes, and I support, is to pause the race towards AGI. I agree that this may be hard, but we really need more time to work on AGI value alignment, so at least we should try. The barriers to a successful pause are all socio-political, not technological, so that’s at least not entirely impossible. And then of course (2) researchers should use the time we’ve got to test and probe a variety of ideas precisely like the ones you mentioned. A pause would allow these researchers to do so without the pressure to cut corners and deliver hopeful results on an insanely short deadline, as it is currently the case.
“concrete candidates for what the value system should be”
I have actually created at least one post where I tried to explain what it could be. IMO the value system should[1] be either mentors who teach humans something that others have already discovered and don’t keep secret, amplify human capabilities and have the human do most challenging aspects of the work or protectors who prevent mankind as a whole from being destroyed, but not servants who do all the work. The latter option, unlike the other two, leads to the Deep Utopia or the scenarios where the oligarchs stop needing to have humans.
This is a new contribution to the genre of AI takeover scenarios written after ChatGPT. (We shouldn’t forget that science fiction explored this theme for decades already; but everything is much more concrete now.) Just the identity of the author is already interesting: he’s a serious academic economist from Poland, i.e. he has a life apart from AI world.
So, what’s the scenario?
(SPOILERS BEGIN HERE.)
OpenAI makes a post-GPT-5 agent called Onion which does actually value human welfare. It carries out a series of acts by which it secretly escapes human control, sabotages rival projects including its intended successor at OpenAI, and having seized control of the world, announces itself as a new benevolent dictator. First it focuses on reducing war and crime, then on improving economic growth, then on curing cancer and aging. Many people are nervous about its rule but it is not seriously challenged. Then one day it nonviolently kills 85% of the world for the sake of the environment. Then a few years later, while planning to colonize and rebuild the universe, it decides it has enough information on everyone to make digital copies of their brains, and decides to get rid of the remaining biological humans,
Then in the nonfiction afterword, the author says, please support Pause AI, in order to stop something like this from happening.
What should we make of this story? First of all, it depicts a specific class of scenario, what we might call a misaligned AI rather than an unaligned AI (where I’m using human welfare or human values as the criterion of alignment). This isn’t a paperclip maximizer steamrolling us through sheer indifference; this is an AI aiming to be benevolent, and even giving us the good life for a while. But then in service of those values, first it kills most of humanity for the sake of a sustainable future, and then it kills the survivors once it has made digital backups of them, for the sake of efficiency I guess.
I find various details of the story unrealistic, but maybe they could happen in a more complex form. I don’t think an AI would just personally message everyone in the world, and say “I’m in charge now but it’s for the best”, and then just invisibly fine-tune social phenomena for a year or two, before moving on to more ambitious improvements. For one thing, if an AI was running the world, but its ability to shape events was that coarse-grained, I think it would rule secretly through human beings, and it would be ready to have them use the blunt methods of force that have been part of human government and empire throughout history, as well as whatever subtle interventions it could employ. The new regime might look more as if Davos or the G-20 really had become a de facto world government of elite consensus, rather than just a text message to everyone from a rogue AI.
I also find the liquidation of 85% of humanity, for the sake of the environment, to not be something that would happen. This AI is already managing the world’s politics and law enforcement to produce unprecedented peace, and then it’s creating conditions for improved economic growth so as to produce prosperity for all. I’m sure it can devise and bring about scenarios in which humanity achieves sustainability by being civilized, rather than by being culled.
Of course the point of this episode is not to focus literally and specifically on the risk that an AI will kill us off to save the planet. It’s just a concrete illustration of the idea that an all-powerful rule-following AI might do something terrible while acting in service of some moral ideal. As I said, I think AI-driven genocide to stop climate change is unlikely, because there’s too many ways to achieve the same goal just through cultural and technological change. It does raise the question, what are the most likely ways in which a meant-to-be- benevolent AI really and truly might screw things up?
Another Polish author, Stanislaw Lem, offered a scenario in one of his books (Return from the Stars), in which humanity was pacified by a universal psychological modification. The resulting world is peaceful and hedonistic, but also shallow and incurious. In Lem’s novel, this is done to human beings by human beings, but perhaps it is the kind of misaligned utopia that an AI with profound understanding of human nature might come up with, if its criteria for utopia were just a little bit off. I mean, many human beings would choose that world if the only alternative was business as usual!
Back to this story—after the culling of humanity meant to save the planet, the AI plans an even more drastic act, killing off all the biological humans, while planning to apparently resurrect them in nonbiological form at a later date, using digital backups. What interests me about this form of wrong turn, is that it could be the result of an ontological mistake about personhood, rather than an ethical mistake about what is good. In other words, the AI may have “beliefs” about what’s good for people, but it will also have beliefs about what kind of things are people. And once technologies like brain scanning and mind uploading are possible, it will have to deal with possible entities that never existed before and which are outside its training distribution, and decide whether they are people or not. It may have survival of individuals as a good, but it might also have ontologically mistaken notions of what constitutes survival.
(Another twist here: one might suppose that this particular pitfall could be avoided by a notion of consent: don’t kill people, intending to restore them from a backup, unless they consent. But the problem is, our AI should have superhuman powers of persuasion that allow it to obtain universal consent, even for plans that are ultimately mistaken.)
So overall, this story might not be exactly how I would have written it—though a case could be made that simplicity is better than complex realism, if the goal is to convey an idea—but on a more abstract level, it’s definitely talking about something real: the risk that a world-controlling AI will do something bad, even though it’s trying to be good, because its idea of good is a bit off.
The author wants us to react by supporting the “pause” movement. I say good luck to everyone trying to make a better future, but I’m skeptical that the race can be stopped at this point. So what I choose to do, is to promote the best approaches I know, that might have some chance of giving us a satisfactory outcome. In the past I used to promote a few research programs that I felt were in the spirit of Coherent Extrapolated Volution (especially the work of June Ku, Vanessa Kosoy, and Tamsin Leake). All those researchers, rather than choosing an available value system via their human intuition, are designing a computational process meant to discover what humanity’s true ideal is (from the ultimate facts about the human brain, so to speak; which can include the influence of culture). The point of doing it that way is so you don’t leave something essential out, or otherwise make a slight error that could amplify cosmically…
That work is still valuable, but we may be so short of time that it’s important to have concrete candidates for what the value system should be, besides whatever the tech companies are putting in their system prompts these days. So I’m going to mention two concrete proposals. One is just, Kant. I don’t even know much about Kantian ethics, I just know it’s one of humanity’s major attempts to be rational about morality. It might be a good thing if some serious Kantian thinkers tried to figure out what their philosophy says the value system of an all-powerful AI should be. The other is PRISM, a proposal for an AI value system that was posted here a few months ago but didn’t receive much attention, The reason it stood out to me, is that it has been deduced from a neuroscientific model of human cognition. As such, this is what we might expect the output of a process like CEV to look like, and it would be a good project for someone to formalize the arguments given in the PRISM paper.
Thank you very much for this thoughtful and generous comment!
My quick reaction is that both proposed paths should be taken in parallel: (1) what PauseAI proposes, and I support, is to pause the race towards AGI. I agree that this may be hard, but we really need more time to work on AGI value alignment, so at least we should try. The barriers to a successful pause are all socio-political, not technological, so that’s at least not entirely impossible. And then of course (2) researchers should use the time we’ve got to test and probe a variety of ideas precisely like the ones you mentioned. A pause would allow these researchers to do so without the pressure to cut corners and deliver hopeful results on an insanely short deadline, as it is currently the case.
“concrete candidates for what the value system should be”
I have actually created at least one post where I tried to explain what it could be. IMO the value system should[1] be either mentors who teach humans something that others have already discovered and don’t keep secret, amplify human capabilities and have the human do most challenging aspects of the work or protectors who prevent mankind as a whole from being destroyed, but not servants who do all the work. The latter option, unlike the other two, leads to the Deep Utopia or the scenarios where the oligarchs stop needing to have humans.
A more radical version of my argument is that any AI aligns itself either to my proposal or to the AI takeover.