I think your arguments here stop working when the AIs are wildly superintelligent. If humans can’t really understand what actions AIs are taking or what the consequences of those actions are, even given arbitrary amounts of assistance from other AIs who we don’t necessarily trust, it seems basically hopeless to incentivize them to behave in any particular way.
But before we get to wildly superintelligent AI I think we will be able to build Guardian Angel AIs to represent our individual and collective interests, and they will take over as decisionmakers, like people today have lawyers to act as their advocates in the legal system and financial advisors for finance. In fact AI is already making legal advice more accessible, not less. So I think this counterargument fails.
As far as ELK goes I think if you have a marketplace of advisors (agents) where principles have an imperfect and delayed information channel to knowing whether the agents are faithful or deceptive, faithful agents will probably still be chosen more as long as there is choice.
I don’t think that this works when the AIs are way more intelligent than humans. In particular, suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check. How are humans supposed to decide which to trust?
while agreeing on everything that humans can check
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task (“figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false”), but also probably not particularly dangerous under the scheming threat model as long as they’re bad at this sort of thing.
The AIs might agree on all predictions about things that will be checkable within three months, but disagree about the consequences of actions in five years.
suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check.
Well the AIs will develop track records and reputations.
This is already happening with LLM-based AIs.
And the vast majority of claims will actually be somewhat checkable, at some cost, after some time.
I think you can have various arrangements that are either of those or a combination of the two.
Even if the Guardian Angels hate their principal and want to harm them, it may be the case that multiple such Guardian Angels could all monitor each other and the one that makes the first move against the principal is reported (with proof) to the principal by at least some of the others, who are then rewarded for that and those who provably didn’t report are punished, and then the offender is deleted.
The misaligned agents can just be stuck in their own version of Bostrom’s self-reinforcing hell.
As long as their coordination cost is high, you are safe.
Also it can be a combination of many things that cause agents to in fact act aligned with their principals.
But before we get to wildly superintelligent AI I think we will be able to build Guardian Angel AIs to represent our individual and collective interests, and they will take over as decisionmakers, like people today have lawyers to act as their advocates in the legal system and financial advisors for finance. In fact AI is already making legal advice more accessible, not less. So I think this counterargument fails.
As far as ELK goes I think if you have a marketplace of advisors (agents) where principles have an imperfect and delayed information channel to knowing whether the agents are faithful or deceptive, faithful agents will probably still be chosen more as long as there is choice.
I don’t think that this works when the AIs are way more intelligent than humans. In particular, suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check. How are humans supposed to decide which to trust?
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task (“figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false”), but also probably not particularly dangerous under the scheming threat model as long as they’re bad at this sort of thing.
The AIs might agree on all predictions about things that will be checkable within three months, but disagree about the consequences of actions in five years.
Well the AIs will develop track records and reputations.
This is already happening with LLM-based AIs.
And the vast majority of claims will actually be somewhat checkable, at some cost, after some time.
I don’t think this is a particularly bad problem.
It seems like in order for this to be stable the Guardian Angel AIs must either...
be robustly internally aligned with the interests of their principles,
or
robustly have payoff such that they profit more from serving the interests of their principles instead of exploiting them?
Does that sound right to you?
I think you can have various arrangements that are either of those or a combination of the two.
Even if the Guardian Angels hate their principal and want to harm them, it may be the case that multiple such Guardian Angels could all monitor each other and the one that makes the first move against the principal is reported (with proof) to the principal by at least some of the others, who are then rewarded for that and those who provably didn’t report are punished, and then the offender is deleted.
The misaligned agents can just be stuck in their own version of Bostrom’s self-reinforcing hell.
As long as their coordination cost is high, you are safe.
Also it can be a combination of many things that cause agents to in fact act aligned with their principals.