To be sure I understand, in this story, the core actions Agent-5 takes that push the world towards the CEO’s control is very slight sabotaging of its opponents and making the media Agent-5 controls slightly biased in favor of the CEO (the maximal extent before it becomes obvious)?
I don’t think this would be likely to make a big difference. In AI2027, Agent-5 isn’t vastly superintelligent at politics (merely superhuman) [Edit: as pointed out in the comments, this is wrong and Agent-5 is wildly superhuman at politics in AI2027, which makes the scenario more plausible to me. I’ll keep the rest of the paragraph, but applied to an AI weaker than Agent-5 at politics], and so it seems really hard for such subtle manipulations to move the probability that the CEO becomes de-facto world-dictator by more than a Bayes factor of 10 (which is higher than I would like, but I think that the probability of an AI company becoming world dictator without AI secret loyalty is sufficiently small that even a Bayes factor of 10 does not make success likely). (But this is a low confidence guess, my skepticism relies on my not-very-informed priors about how hard it is to shift the political balance of power by doing things like manipulating social media algorithms—maybe it is much easier than I think it is.)
(There are many other ways things could go wrong if Agent-5 is loyal to the CEO, but I think it’s important to catalogue the most important ones when trying to mitigate secret loyalties.)
Here are some more thoughts on a superintelligence-run persuasion campaign:
Like Daniel wrote in a comment, it’s good to think of Agent-5 as distributed and able to nudge things all over the internet. The nudges could be highly personalized to demographics and individuals, so responsive to the kind of subtle emotional triggers the superintelligence learns about each individual.
It seems many people’s opinions today are already significantly shaped by social media and disinformation. So this makes me think a similar process that’s much more agentic, personalized, and superintelligence-optimized could be very potent.
There’s the possibility of mind-hacking too, though I chose to leave that out of the blogpost.
The CEO is probably well-positioned to take credit for a lot of the benefits Agent-5 seems to bring to the world (some of these benefits are genuine, some illusory).
In an earlier iteration of this scenario I had a military coup rather than this gradual political ascension via persuasion. But then I decided that a superintelligence capable of controlling the robots well enough to disempower the human military would probably also be powerful enough to do something less heavy-handed like what’s in the scenario.
Like Daniel wrote in a comment, it’s good to think of Agent-5 as distributed and able to nudge things all over the internet. The nudges could be highly personalized to demographics and individuals, so responsive to the kind of subtle emotional triggers the superintelligence learns about each individual.
It seems many people’s opinions today are already significantly shaped by social media and disinformation. So this makes me think a similar process that’s much more agentic, personalized, and superintelligence-optimized could be very potent.
Answered to these ideas in my answer to Daniel.
There’s the possibility of mind-hacking too, though I chose to leave that out of the blogpost.
In an earlier iteration of this scenario I had a military coup rather than this gradual political ascension via persuasion. But then I decided that a superintelligence capable of controlling the robots well enough to disempower the human military would probably also be powerful enough to do something less heavy-handed like what’s in the scenario.
I agree that if you have a secret loyalty in wildly superhuman AIs, many things are possible and the success at CEO-takeover seems extremely likely.
The CEO is probably well-positioned to take credit for a lot of the benefits Agent-5 seems to bring to the world (some of these benefits are genuine, some illusory).
I think this is a very different story. If the claim is that even without secret loyalty the CEO would have a 10% chance of taking over, then the secret loyalty part doesn’t seem that important and is distracting imo. A 10% of no-secret-loyalty CEO takeover is a big enough deal on its own?
Like Daniel wrote in a comment, it’s good to think of Agent-5 as distributed and able to nudge things all over the internet
There’s a real missing model or mechanism by which any of the AIs go from “able to nudge large categories of people” to “able to persuade individual actors to do distinct things”. There’s quite a bit of research in the former category but I don’t see there’s any evidence for the latter (Maybe truesight + blackmail would count?)
FWIW I don’t think Agent-5 needs to be vastly superhuman at politics to succeed in this scenario, merely top-human level. Analogy: A single humanoid robot might need to be vastly superhuman at fighting to take out the entire US army in a land battle. But a million humanoid robots could probably do it if they were merely expert at fighting. Agent-5 isn’t a single agent, it’s a collective of millions.
I think this seems more dicey for the AI if it’s not vastly superhuman at politics. There are forces pushing for many different people to be in power quite hard, so I think it takes somewhat forceful actions to move probabilities by a Bayes factor above 10, which makes me skeptical that something like the subtle actions in the essay would have the intended effect.
I would guess the AI’s subtle policy actions would not be much more coordinated and high impact position than what TikTok could have done a few years ago, and I don’t think TikTok was ever in a position where it could have put a much more pro-CCP candidate in power if it wanted to (if the pro-CCP candidate started from a similar position of strength as the CEO) - my guess is that it’s hard to massively shift the balance of power without it being obvious to the people losing that something nefarious is going on.
My guess is that the AI could try to have a massive cover-up operation to discredit people who criticize the somewhat obvious media manipulation—but then the thing the AI strategy is more like “somewhat blatantly do bad things and then do a giant cover-up” rather than “take actions so subtle nobody would ever figure it out even if they had access to all the logs”.
In my scenario I was imagining Agent-5 to be superhuman at politics. But I tentatively think [just top-human at politics + being extremely well-informed about what every human actor wants and knows + being able to have a coordinated influence on all these human actors] already has a big shot at leading to the story’s events.
It’s also important that the CEO starts from a relative position of power and prestige, and is not some random guy. He’s the kind of person who could rise to power the way he does (at least at first) if he was lucky and had political acumen. Maybe I should have emphasized this point more in the scenario.
But a million humanoid robots could probably do it if they were merely expert at fighting. Agent-5 isn’t a single agent, it’s a collective of millions.
Something close to this might be a big reason why Amdahl’s law/parallelization bottlenecks on the software singularity might not matter, because the millions of AIs are much, much closer to one single AI doing deep serial research than it is to an entire human field with thousands or millions of people.
There is something something lack of powerful in context learning, where currently millions of AI are bsaically one AI because they can’t change rapidly in response to new information, but once they can they will be a tree of AI from copying the ones with insights.
Agent-5 isn’t vastly superintelligent at politics (merely superhuman)
Look into the November 2027 section of the forecast’s Race Ending. In December 2027 Agent-5 is supposed to have a score of 4.0 at politics and 3.9 at forecasting, meaning that it would be “wildly superhuman” at both.
When I read AI2027 the “November 2027: Superhuman Politicking” felt much less extreme than “you can make someone that had <0.01 chance of winning be the clear favorite”. I guess AI2027 didn’t want to make a very strong statement about what is possible with wildly superhuman skills and so they used a relatively mild example (make the race continue despite Xi being willing to make large sacrifices—which seem to me >0.1 even without AI manipulations).
I am still unsure how much can be done with Agent-5. I know some people who don’t buy you will get “magical” political abilities during the first few years of the intelligence explosion (for a combination of not believing in very fast takeoff taking you to extremely advanced political skills, and not believing extremely advanced political skills would be that useful) but I am not very sympathetic to their views and I agree that if you get wildly superhuman political skills, the sort of manipulation you describe seems >0.5 likely to succeed.
is very slight sabotaging of its opponents and making the media Agent-5 controls slightly biased in favor of the CEO
I don’t think this scenario attempts to catalogue all the threat models here, and I’d be interested in such a piece, but I’d imagine they heavily overlap with many other ways things could go wrong here. My impression of internal / external politics at labs is that a model even integrated to the extent of Agent 4 doesn’t need wildly superhuman persuasion via chat only or something, biasing experiments to support the conclusions you want the lab to make about the feasibility of various options (as one example) seems likely to be highly persuasive. Something like the OpenAI board incident in 2023 or even the current political environment don’t seem incredibly overdetermined to me, but I don’t have a good way to quantify this.
Overall a big concern with this threat model IMO is you’ve baked in an implicit motivation for the model to scheme. Even without assuming this resulted in generalizing misalignment (I would guess it does), the model now (for example when transmitting values to its successor) needs to be able to successfully circumvent monitoring. Circumventing control measures for these values specifically doesn’t seem notably harder than doing it for arbitrary values, so you’ve now got a powerful model that knows it needs the capability to covertly smuggle arbitrary values to its successor.
To be sure I understand, in this story, the core actions Agent-5 takes that push the world towards the CEO’s control is very slight sabotaging of its opponents and making the media Agent-5 controls slightly biased in favor of the CEO (the maximal extent before it becomes obvious)?
I don’t think this would be likely to make a big difference. In AI2027, Agent-5 isn’t vastly superintelligent at politics (merely superhuman)[Edit: as pointed out in the comments, this is wrong and Agent-5 is wildly superhuman at politics in AI2027, which makes the scenario more plausible to me. I’ll keep the rest of the paragraph, but applied to an AI weaker than Agent-5 at politics], and so it seems really hard for such subtle manipulations to move the probability that the CEO becomes de-facto world-dictator by more than a Bayes factor of 10 (which is higher than I would like, but I think that the probability of an AI company becoming world dictator without AI secret loyalty is sufficiently small that even a Bayes factor of 10 does not make success likely). (But this is a low confidence guess, my skepticism relies on my not-very-informed priors about how hard it is to shift the political balance of power by doing things like manipulating social media algorithms—maybe it is much easier than I think it is.)(There are many other ways things could go wrong if Agent-5 is loyal to the CEO, but I think it’s important to catalogue the most important ones when trying to mitigate secret loyalties.)
Here are some more thoughts on a superintelligence-run persuasion campaign:
Like Daniel wrote in a comment, it’s good to think of Agent-5 as distributed and able to nudge things all over the internet. The nudges could be highly personalized to demographics and individuals, so responsive to the kind of subtle emotional triggers the superintelligence learns about each individual.
It seems many people’s opinions today are already significantly shaped by social media and disinformation. So this makes me think a similar process that’s much more agentic, personalized, and superintelligence-optimized could be very potent.
There’s the possibility of mind-hacking too, though I chose to leave that out of the blogpost.
The CEO is probably well-positioned to take credit for a lot of the benefits Agent-5 seems to bring to the world (some of these benefits are genuine, some illusory).
In an earlier iteration of this scenario I had a military coup rather than this gradual political ascension via persuasion. But then I decided that a superintelligence capable of controlling the robots well enough to disempower the human military would probably also be powerful enough to do something less heavy-handed like what’s in the scenario.
Answered to these ideas in my answer to Daniel.
I agree that if you have a secret loyalty in wildly superhuman AIs, many things are possible and the success at CEO-takeover seems extremely likely.
I think this is a very different story. If the claim is that even without secret loyalty the CEO would have a 10% chance of taking over, then the secret loyalty part doesn’t seem that important and is distracting imo. A 10% of no-secret-loyalty CEO takeover is a big enough deal on its own?
There’s a real missing model or mechanism by which any of the AIs go from “able to nudge large categories of people” to “able to persuade individual actors to do distinct things”. There’s quite a bit of research in the former category but I don’t see there’s any evidence for the latter (Maybe truesight + blackmail would count?)
FWIW I don’t think Agent-5 needs to be vastly superhuman at politics to succeed in this scenario, merely top-human level. Analogy: A single humanoid robot might need to be vastly superhuman at fighting to take out the entire US army in a land battle. But a million humanoid robots could probably do it if they were merely expert at fighting. Agent-5 isn’t a single agent, it’s a collective of millions.
I think this seems more dicey for the AI if it’s not vastly superhuman at politics. There are forces pushing for many different people to be in power quite hard, so I think it takes somewhat forceful actions to move probabilities by a Bayes factor above 10, which makes me skeptical that something like the subtle actions in the essay would have the intended effect.
I would guess the AI’s subtle policy actions would not be much more coordinated and high impact position than what TikTok could have done a few years ago, and I don’t think TikTok was ever in a position where it could have put a much more pro-CCP candidate in power if it wanted to (if the pro-CCP candidate started from a similar position of strength as the CEO) - my guess is that it’s hard to massively shift the balance of power without it being obvious to the people losing that something nefarious is going on.
My guess is that the AI could try to have a massive cover-up operation to discredit people who criticize the somewhat obvious media manipulation—but then the thing the AI strategy is more like “somewhat blatantly do bad things and then do a giant cover-up” rather than “take actions so subtle nobody would ever figure it out even if they had access to all the logs”.
In my scenario I was imagining Agent-5 to be superhuman at politics. But I tentatively think [just top-human at politics + being extremely well-informed about what every human actor wants and knows + being able to have a coordinated influence on all these human actors] already has a big shot at leading to the story’s events.
It’s also important that the CEO starts from a relative position of power and prestige, and is not some random guy. He’s the kind of person who could rise to power the way he does (at least at first) if he was lucky and had political acumen. Maybe I should have emphasized this point more in the scenario.
Something close to this might be a big reason why Amdahl’s law/parallelization bottlenecks on the software singularity might not matter, because the millions of AIs are much, much closer to one single AI doing deep serial research than it is to an entire human field with thousands or millions of people.
There is something something lack of powerful in context learning, where currently millions of AI are bsaically one AI because they can’t change rapidly in response to new information, but once they can they will be a tree of AI from copying the ones with insights.
Look into the November 2027 section of the forecast’s Race Ending. In December 2027 Agent-5 is supposed to have a score of 4.0 at politics and 3.9 at forecasting, meaning that it would be “wildly superhuman” at both.
My bad! Will edit.
When I read AI2027 the “November 2027: Superhuman Politicking” felt much less extreme than “you can make someone that had <0.01 chance of winning be the clear favorite”. I guess AI2027 didn’t want to make a very strong statement about what is possible with wildly superhuman skills and so they used a relatively mild example (make the race continue despite Xi being willing to make large sacrifices—which seem to me >0.1 even without AI manipulations).
I am still unsure how much can be done with Agent-5. I know some people who don’t buy you will get “magical” political abilities during the first few years of the intelligence explosion (for a combination of not believing in very fast takeoff taking you to extremely advanced political skills, and not believing extremely advanced political skills would be that useful) but I am not very sympathetic to their views and I agree that if you get wildly superhuman political skills, the sort of manipulation you describe seems >0.5 likely to succeed.
I don’t think this scenario attempts to catalogue all the threat models here, and I’d be interested in such a piece, but I’d imagine they heavily overlap with many other ways things could go wrong here. My impression of internal / external politics at labs is that a model even integrated to the extent of Agent 4 doesn’t need wildly superhuman persuasion via chat only or something, biasing experiments to support the conclusions you want the lab to make about the feasibility of various options (as one example) seems likely to be highly persuasive. Something like the OpenAI board incident in 2023 or even the current political environment don’t seem incredibly overdetermined to me, but I don’t have a good way to quantify this.
Overall a big concern with this threat model IMO is you’ve baked in an implicit motivation for the model to scheme. Even without assuming this resulted in generalizing misalignment (I would guess it does), the model now (for example when transmitting values to its successor) needs to be able to successfully circumvent monitoring. Circumventing control measures for these values specifically doesn’t seem notably harder than doing it for arbitrary values, so you’ve now got a powerful model that knows it needs the capability to covertly smuggle arbitrary values to its successor.