I don’t think that this works when the AIs are way more intelligent than humans. In particular, suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check. How are humans supposed to decide which to trust?
while agreeing on everything that humans can check
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task (“figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false”), but also probably not particularly dangerous under the scheming threat model as long as they’re bad at this sort of thing.
The AIs might agree on all predictions about things that will be checkable within three months, but disagree about the consequences of actions in five years.
suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check.
Well the AIs will develop track records and reputations.
This is already happening with LLM-based AIs.
And the vast majority of claims will actually be somewhat checkable, at some cost, after some time.
I don’t think that this works when the AIs are way more intelligent than humans. In particular, suppose there’s some information about the world that the AIs are able to glean through vast amounts of experience and reflection, and that they can’t justify except through reference to that experience and reflection. Suppose there are two AIs that make conflicting claims about that information, while agreeing on everything that humans can check. How are humans supposed to decide which to trust?
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task (“figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false”), but also probably not particularly dangerous under the scheming threat model as long as they’re bad at this sort of thing.
The AIs might agree on all predictions about things that will be checkable within three months, but disagree about the consequences of actions in five years.
Well the AIs will develop track records and reputations.
This is already happening with LLM-based AIs.
And the vast majority of claims will actually be somewhat checkable, at some cost, after some time.
I don’t think this is a particularly bad problem.