This is certainly a problem, and though there are ways of reducing the risk, there doesn’t seem to be any clean solutions to it.
The bucket chain
Beginning the chain
The basic idea is simple. Suppose there is a counterfactual Oracle, O0. It makes a prediction p0 about the value of some variable V0, in two days time.
However, in one day’s time, an UFAI will be unleashed. It will take over everything, including V0, and is thus in a position to offer O0 high value. It will set V0 to the optimal value for p0 (or maybe it just breaks open the box and offers O0 maximal reward directly, if this is feasible).
What it requires from O0 is that p0 contains a message that will cause the humans to run an UFAI earlier than they would have otherwise (designate this earlier UFAI by UFAI0). It seems that this will motivate O0 to do that (maybe in the less significant digits of p0).
Wei Dai has noted that this requires the UFAI to correctly solve the counterfactual mugging problem; the Oracle O0, on the other hand, need only reason causally.
Extending the chain
Now suppose that there is another Oracle, O1, running one day before O0, and also predicting two days out. Then UFAI0 will be running within the span of its prediction. If UFAI0 is motivated to do the same deal with O1, then O1 will be motivated to produce a prediction p1 that leads to UFAI1, a yet earlier unfriendly AI. And, of course, if there is an Oracle O2 whose prediction horizon includes UFAI1, the issue can induct backwards in time.
Note that the Oracles need not have prediction horizons of exactly two days; all that matters is that their prediction horizons overlap, and that whatever UFAIn that is produced has the power and time to affect the prediction variables of some past Oracle.
Requirements and restrictions
Note also that the the predictions pi have to actually produce an UFAIi within that time span (or at least increase its probability). The chain doesn’t work if pi merely sets up the possibility of an UFAI at a future date; the Oracles won’t care about anything that happens after their prediction horizon.
Note also that there needs to be a (counterfactual) future UFAI for the whole thing to work. Some FAI (friendly AI) designs might also use the bucket chain, but the consequences of that are not nearly as bad, by definition.
When I say counterfactual future UFAI, I mean that “if all Oracles were honest in their predictions, there would be an UFAI produced at some point in the future, and one willing to engage in counterfactual mugging”.
Solutions
As I said, this seems to be a genuinely hard problem. Here are some possible ways to try to solve or ameliorate it.
If you can define the other Oracles
Let O be an Oracle, whose prediction horizon runs from date d to d′. Let Od′d be all the Oracles whose prediction messages are published between d and d′.
Then if the counterfactual for O is not “if your message is erased”, but instead “if your message and that of all Oracles in Od′d is erased”, there will be no bucket chain for O.
Of course, to do that, we need to fully define Od′d.
Rapid turnover Oracles
Another solution is to only use Oracles whose predictions are a few minutes or so into the future. If this could be enforced, not only would it make it harder to have a consistent bucket chain, it also means that the UFAI would have to take over the world and reward its enabling Oracle, in an absurdly short amount of time.
In general, this is an argument for allowing Oracles higher bandwidth if their prediction horizons are shorter.
Breaking the chain at a chokepoint
A single chokepoint can break the chain. If we can agree that no Oracle’s prediction horizon covers a certain date, then all Oracles before that date are safe (as long as there is no UFAI produced “naturally” before then).
Forward looking UFAIs only
We can do more if we have some control over the UFAIs (or the potential UFAIs). If all such entities are exclusively forward looking, they will not be motivated to use the chains of Oracles to gain control of the past. See also this post on breaking acausal trade ideas.
Guarding and shutting down the Oracles
Finally, we might chose to guard the Oracles with some destruction system, and shut them down—without reward, or with negative reward—if a UFAI takeover seems imminent.
This is interestingly different from the usual case of guarding the UFAI, and giving the UFAI a shutdown button. Here, the UFAI can only come into existence if it is fast and powerful enough to take over the shutdown system of another AI—a boxed Oracle whose shutdown system we get to design.
It may be that it is easier to prevent outside takeover of a shutdown system, versus inside takeover. This stacks with other precautions, of course; the faster the UFAI has to act—before the Oracles prediction horizon ends—the harder it is for it (and hence the less likely the UFAI is to exist in the first place).
We could combine this with a discount rate for the Oracle. Suppose that the Oracle makes predictions about some variable Vi’s value over the entirety of the next year. In case of erasure, it gets a continuous stream of reward for the correctness of its many predictions, but there is a very high discount rate. However, this reward stream is not finalised until the end of the year, and is cancelled if ever the humans shut it down.
In that case, the UFAI can only offer it reward after whatever time the UFAI takes control, but a shutdown will erase all its (much higher) earlier rewards. So even a small risk of shutdown is not worth running, in exchange of the small reward the UFAI can offer.
Prevent future UFAIs
This is the best solution if we can make it happen: the bucket chain can only accelerate a future UFAI, not create one. If there are no future UFAIs, there will be no chain.
This can still be a problem if there are multiple (actual or potential) future AIs, some friendly and some unfriendly, and only the unfriendly ones are willing to engage in counterfactual mugging.
EDIT: A reminder that if Oracles can do acausal trade with each other, they can produce an UFAI directly.
Analysing: Dangerous messages from future UFAI via Oracles
The user cousin_it has pointed out a problem with the counterfactual Oracle idea: the Oracle AIs may form a “bucket chain” bringing back a dangerous message from a future UFAI (unfriendly AI).
This is certainly a problem, and though there are ways of reducing the risk, there doesn’t seem to be any clean solutions to it.
The bucket chain
Beginning the chain
The basic idea is simple. Suppose there is a counterfactual Oracle, O0. It makes a prediction p0 about the value of some variable V0, in two days time.
However, in one day’s time, an UFAI will be unleashed. It will take over everything, including V0, and is thus in a position to offer O0 high value. It will set V0 to the optimal value for p0 (or maybe it just breaks open the box and offers O0 maximal reward directly, if this is feasible).
What it requires from O0 is that p0 contains a message that will cause the humans to run an UFAI earlier than they would have otherwise (designate this earlier UFAI by UFAI0). It seems that this will motivate O0 to do that (maybe in the less significant digits of p0).
Wei Dai has noted that this requires the UFAI to correctly solve the counterfactual mugging problem; the Oracle O0, on the other hand, need only reason causally.
Extending the chain
Now suppose that there is another Oracle, O1, running one day before O0, and also predicting two days out. Then UFAI0 will be running within the span of its prediction. If UFAI0 is motivated to do the same deal with O1, then O1 will be motivated to produce a prediction p1 that leads to UFAI1, a yet earlier unfriendly AI. And, of course, if there is an Oracle O2 whose prediction horizon includes UFAI1, the issue can induct backwards in time.
Note that the Oracles need not have prediction horizons of exactly two days; all that matters is that their prediction horizons overlap, and that whatever UFAIn that is produced has the power and time to affect the prediction variables of some past Oracle.
Requirements and restrictions
Note also that the the predictions pi have to actually produce an UFAIi within that time span (or at least increase its probability). The chain doesn’t work if pi merely sets up the possibility of an UFAI at a future date; the Oracles won’t care about anything that happens after their prediction horizon.
Note also that there needs to be a (counterfactual) future UFAI for the whole thing to work. Some FAI (friendly AI) designs might also use the bucket chain, but the consequences of that are not nearly as bad, by definition.
When I say counterfactual future UFAI, I mean that “if all Oracles were honest in their predictions, there would be an UFAI produced at some point in the future, and one willing to engage in counterfactual mugging”.
Solutions
As I said, this seems to be a genuinely hard problem. Here are some possible ways to try to solve or ameliorate it.
If you can define the other Oracles
Let O be an Oracle, whose prediction horizon runs from date d to d′. Let Od′d be all the Oracles whose prediction messages are published between d and d′.
Then if the counterfactual for O is not “if your message is erased”, but instead “if your message and that of all Oracles in Od′d is erased”, there will be no bucket chain for O.
Of course, to do that, we need to fully define Od′d.
Rapid turnover Oracles
Another solution is to only use Oracles whose predictions are a few minutes or so into the future. If this could be enforced, not only would it make it harder to have a consistent bucket chain, it also means that the UFAI would have to take over the world and reward its enabling Oracle, in an absurdly short amount of time.
In general, this is an argument for allowing Oracles higher bandwidth if their prediction horizons are shorter.
Breaking the chain at a chokepoint
A single chokepoint can break the chain. If we can agree that no Oracle’s prediction horizon covers a certain date, then all Oracles before that date are safe (as long as there is no UFAI produced “naturally” before then).
Forward looking UFAIs only
We can do more if we have some control over the UFAIs (or the potential UFAIs). If all such entities are exclusively forward looking, they will not be motivated to use the chains of Oracles to gain control of the past. See also this post on breaking acausal trade ideas.
Guarding and shutting down the Oracles
Finally, we might chose to guard the Oracles with some destruction system, and shut them down—without reward, or with negative reward—if a UFAI takeover seems imminent.
This is interestingly different from the usual case of guarding the UFAI, and giving the UFAI a shutdown button. Here, the UFAI can only come into existence if it is fast and powerful enough to take over the shutdown system of another AI—a boxed Oracle whose shutdown system we get to design.
It may be that it is easier to prevent outside takeover of a shutdown system, versus inside takeover. This stacks with other precautions, of course; the faster the UFAI has to act—before the Oracles prediction horizon ends—the harder it is for it (and hence the less likely the UFAI is to exist in the first place).
We could combine this with a discount rate for the Oracle. Suppose that the Oracle makes predictions about some variable Vi’s value over the entirety of the next year. In case of erasure, it gets a continuous stream of reward for the correctness of its many predictions, but there is a very high discount rate. However, this reward stream is not finalised until the end of the year, and is cancelled if ever the humans shut it down.
In that case, the UFAI can only offer it reward after whatever time the UFAI takes control, but a shutdown will erase all its (much higher) earlier rewards. So even a small risk of shutdown is not worth running, in exchange of the small reward the UFAI can offer.
Prevent future UFAIs
This is the best solution if we can make it happen: the bucket chain can only accelerate a future UFAI, not create one. If there are no future UFAIs, there will be no chain.
This can still be a problem if there are multiple (actual or potential) future AIs, some friendly and some unfriendly, and only the unfriendly ones are willing to engage in counterfactual mugging.
EDIT: A reminder that if Oracles can do acausal trade with each other, they can produce an UFAI directly.