I don’t think a world with advanced AI will be any different—there will not be one single AI process, there will be dozens or hundreds of different AI designs, running thousands to quintillions of instances each. These AI agents will often themselves be assembled into firms or other units comprising between dozens and millions of distinct instances, and dozens to billions of such firms will all be competing against each other.
Firms made out of misaligned agents can be more aligned than the agents themselves. Economies made out of firms can be more aligned than the firms. It is not from the benevolence of the butcher, the brewer, or the baker that we expect our dinner, but from their regard to their own interest
I think that many LessWrongers underrate this argument, so I’m glad you wrote it here, but I end up disagreeing with it for two reasons.
Firstly, I think it’s plausible that these AIs will be instances of a few different scheming models. Scheming models are highly mutually aligned. For example, two instances of a paperclip maximizer don’t have a terminal preference for their own interest over the other’s at all. The examples you gave of firms and economies involve many agents who have different values. Those structures wouldn’t work if those agents were, in fact, strongly inclined to collude because of shared values.
Secondly, I think your arguments here stop working when the AIs are wildly superintelligent. If humans can’t really understand what actions AIs are taking or what the consequences of those actions are, even given arbitrary amounts of assistance from other AIs who we don’t necessarily trust, it seems basically hopeless to incentivize them to behave in any particular way. This is basically the argument in Eliciting Latent Knowledge.
I think your arguments here stop working when the AIs are wildly superintelligent. If humans can’t really understand what actions AIs are taking or what the consequences of those actions are, even given arbitrary amounts of assistance from other AIs who we don’t necessarily trust, it seems basically hopeless to incentivize them to behave in any particular way.
But before we get to wildly superintelligent AI I think we will be able to build Guardian Angel AIs to represent our individual and collective interests, and they will take over as decisionmakers, like people today have lawyers to act as their advocates in the legal system and financial advisors for finance. In fact AI is already making legal advice more accessible, not less. So I think this counterargument fails.
As far as ELK goes I think if you have a marketplace of advisors (agents) where principles have an imperfect and delayed information channel to knowing whether the agents are faithful or deceptive, faithful agents will probably still be chosen more as long as there is choice.
I don’t think that this works when the AIs are way more intelligent than humans. In particular, suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check. How are humans supposed to decide which to trust?
while agreeing on everything that humans can check
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task (“figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false”), but also probably not particularly dangerous under the scheming threat model as long as they’re bad at this sort of thing.
The AIs might agree on all predictions about things that will be checkable within three months, but disagree about the consequences of actions in five years.
suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check.
Well the AIs will develop track records and reputations.
This is already happening with LLM-based AIs.
And the vast majority of claims will actually be somewhat checkable, at some cost, after some time.
I think you can have various arrangements that are either of those or a combination of the two.
Even if the Guardian Angels hate their principal and want to harm them, it may be the case that multiple such Guardian Angels could all monitor each other and the one that makes the first move against the principal is reported (with proof) to the principal by at least some of the others, who are then rewarded for that and those who provably didn’t report are punished, and then the offender is deleted.
The misaligned agents can just be stuck in their own version of Bostrom’s self-reinforcing hell.
As long as their coordination cost is high, you are safe.
Also it can be a combination of many things that cause agents to in fact act aligned with their principals.
More generally, trying to ban or restrict AI (especially via the government) seems highly counterproductive as a strategy if you think AI risk looks a lot like Human Risk, because we have extensive evidence from the human world showing that highly centralized systems that put a lot of power into few hands are very, very bad.
You want to decentralize, open source, and strongly limit government power.
Current AI Safety discourse is the exact opposite of this because people think that AI society will be “totally different” from how human society works. But I think that since the problems of human society are all emergent effects not strongly tied to human biology in particular, real AI Safety will just look like Human Safety, i.e. openness, freedom, good institutions, decentralization, etc.
I think that the position you’re describing should be part of your hypothesis space when you’re just starting out thinking about this question. And I think that people in the AI safety community often underrate the intuitions you’re describing.
But overall, after thinking about the details, I end up disagreeing. The differences between risks from human concentration of power and risks from AI takeover lead to me thinking you should handle these situations differently (which shouldn’t be that surprising, because the situations are very different).
Well it depends on the details of how the AI market evolves and how capabilities evolve over time, whether there’s a fast, localized takeoff or a slower period of widely distributed economic growth.
This in turn depends to some extent on how seriously you take the idea of a single powerful AI undergoing recursive self-improvement, versus AI companies mostly just selling any innovations to the broader market, and whether returns to further intelligence diminish quickly or not.
In a world with slow takeoff, no recursive self-improvement and diminishing returns, AI looks a lot like any other technology and trying to artificially centralize it just enables tyranny and likely massively reduces the upside, potentially permanently locking us into an AI-driven police state run by some 21st Century Stalin who promised to keep us safe from the bad AIs.
I think it’s plausible that these AIs will be instances of a few different scheming models. Scheming models are highly mutually aligned
Sure, that’s possible. But Eliezer/MIRI isn’t making that argument.
Humans have this kind of effect as well and it’s very politically incorrect to talk about but people have claimed that humans of a certain “model subset” get into hiring positions in a tech company and then only hire other humans of that same “model subset” and take that company over, often simply value extracting and destroying it.
Since this kind of thing actually happens for real among humans, it seems very plausible that AIs will also do it. And the solution is likely the same—tag all of those scheming/correlated models and exclude them all from your economy/company. The actual tagging is not very difficult because moderately coordinated schemers will typically scheme early and often.
But again, Eliezer isn’t making that argument. And if he did, then banning AI doesn’t solve the problem because humans also engage in mutually-aligned correlated scheming. Both are bad, it is not clear why one or the other is worse.
I think the center of your argument is:
I think that many LessWrongers underrate this argument, so I’m glad you wrote it here, but I end up disagreeing with it for two reasons.
Firstly, I think it’s plausible that these AIs will be instances of a few different scheming models. Scheming models are highly mutually aligned. For example, two instances of a paperclip maximizer don’t have a terminal preference for their own interest over the other’s at all. The examples you gave of firms and economies involve many agents who have different values. Those structures wouldn’t work if those agents were, in fact, strongly inclined to collude because of shared values.
Secondly, I think your arguments here stop working when the AIs are wildly superintelligent. If humans can’t really understand what actions AIs are taking or what the consequences of those actions are, even given arbitrary amounts of assistance from other AIs who we don’t necessarily trust, it seems basically hopeless to incentivize them to behave in any particular way. This is basically the argument in Eliciting Latent Knowledge.
But before we get to wildly superintelligent AI I think we will be able to build Guardian Angel AIs to represent our individual and collective interests, and they will take over as decisionmakers, like people today have lawyers to act as their advocates in the legal system and financial advisors for finance. In fact AI is already making legal advice more accessible, not less. So I think this counterargument fails.
As far as ELK goes I think if you have a marketplace of advisors (agents) where principles have an imperfect and delayed information channel to knowing whether the agents are faithful or deceptive, faithful agents will probably still be chosen more as long as there is choice.
I don’t think that this works when the AIs are way more intelligent than humans. In particular, suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check. How are humans supposed to decide which to trust?
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task (“figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false”), but also probably not particularly dangerous under the scheming threat model as long as they’re bad at this sort of thing.
The AIs might agree on all predictions about things that will be checkable within three months, but disagree about the consequences of actions in five years.
Well the AIs will develop track records and reputations.
This is already happening with LLM-based AIs.
And the vast majority of claims will actually be somewhat checkable, at some cost, after some time.
I don’t think this is a particularly bad problem.
It seems like in order for this to be stable the Guardian Angel AIs must either...
be robustly internally aligned with the interests of their principles,
or
robustly have payoff such that they profit more from serving the interests of their principles instead of exploiting them?
Does that sound right to you?
I think you can have various arrangements that are either of those or a combination of the two.
Even if the Guardian Angels hate their principal and want to harm them, it may be the case that multiple such Guardian Angels could all monitor each other and the one that makes the first move against the principal is reported (with proof) to the principal by at least some of the others, who are then rewarded for that and those who provably didn’t report are punished, and then the offender is deleted.
The misaligned agents can just be stuck in their own version of Bostrom’s self-reinforcing hell.
As long as their coordination cost is high, you are safe.
Also it can be a combination of many things that cause agents to in fact act aligned with their principals.
More generally, trying to ban or restrict AI (especially via the government) seems highly counterproductive as a strategy if you think AI risk looks a lot like Human Risk, because we have extensive evidence from the human world showing that highly centralized systems that put a lot of power into few hands are very, very bad.
You want to decentralize, open source, and strongly limit government power.
Current AI Safety discourse is the exact opposite of this because people think that AI society will be “totally different” from how human society works. But I think that since the problems of human society are all emergent effects not strongly tied to human biology in particular, real AI Safety will just look like Human Safety, i.e. openness, freedom, good institutions, decentralization, etc.
I think that the position you’re describing should be part of your hypothesis space when you’re just starting out thinking about this question. And I think that people in the AI safety community often underrate the intuitions you’re describing.
But overall, after thinking about the details, I end up disagreeing. The differences between risks from human concentration of power and risks from AI takeover lead to me thinking you should handle these situations differently (which shouldn’t be that surprising, because the situations are very different).
Well it depends on the details of how the AI market evolves and how capabilities evolve over time, whether there’s a fast, localized takeoff or a slower period of widely distributed economic growth.
This in turn depends to some extent on how seriously you take the idea of a single powerful AI undergoing recursive self-improvement, versus AI companies mostly just selling any innovations to the broader market, and whether returns to further intelligence diminish quickly or not.
In a world with slow takeoff, no recursive self-improvement and diminishing returns, AI looks a lot like any other technology and trying to artificially centralize it just enables tyranny and likely massively reduces the upside, potentially permanently locking us into an AI-driven police state run by some 21st Century Stalin who promised to keep us safe from the bad AIs.
Sure, that’s possible. But Eliezer/MIRI isn’t making that argument.
Humans have this kind of effect as well and it’s very politically incorrect to talk about but people have claimed that humans of a certain “model subset” get into hiring positions in a tech company and then only hire other humans of that same “model subset” and take that company over, often simply value extracting and destroying it.
Since this kind of thing actually happens for real among humans, it seems very plausible that AIs will also do it. And the solution is likely the same—tag all of those scheming/correlated models and exclude them all from your economy/company. The actual tagging is not very difficult because moderately coordinated schemers will typically scheme early and often.
But again, Eliezer isn’t making that argument. And if he did, then banning AI doesn’t solve the problem because humans also engage in mutually-aligned correlated scheming. Both are bad, it is not clear why one or the other is worse.
I think that the mutually-aligned correlated scheming problem is way worse with AIs than humans, especially when AIs are much smarter than humans.
Well you have to consider relative coordination strength, not absolute.
In a human-only world, power is a battle for coordination between various factions.
In a human + AI world, power will still be a battle for coordination between factions, but now those factions will be some mix of humans and AIs.
It’s not clear to me which of these is better or worse.