while agreeing on everything that humans can check
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task (“figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false”), but also probably not particularly dangerous under the scheming threat model as long as they’re bad at this sort of thing.
The AIs might agree on all predictions about things that will be checkable within three months, but disagree about the consequences of actions in five years.
Can you provide an example of a place where two AIs would want to make conflicting claims about something while agreeing with everything that humans could check, even in principle? Presumably, if the two AI agents care about which of the claims the human believes, that is because there is some expected difference in outcome if the human believes one over the other. If all predictions between the two agents are identical at present time T0, and the predictions of outcome at a specific future time T1 are meaningfully different, then presumably either the predictions are the same at T0.5 (in which case you can binary search between T0.5 and T1 to see what specific places the agents disagree) or they are different at T0.5 (in which case you can do the same between T0 and T0.5).
Current LLMs are kind of terrible at this sort of task (“figure out what cheap tests can distinguish between worlds where hypothesis H is true vs false”), but also probably not particularly dangerous under the scheming threat model as long as they’re bad at this sort of thing.
The AIs might agree on all predictions about things that will be checkable within three months, but disagree about the consequences of actions in five years.