Problems with Counterfactual Oracles
“we could design a Predictor that incorporates a quantum random number generator, such that with some small probability it will output “no predictions today, run me again tomorrow”. Then have the Predictor make predictions that are conditional on it giving the output “no predictions today, run me again tomorrow” (Wei Dai, 2012)
That’s essentially Counterfactual Oracles. Here is my summary:
Every timestep : a human asks a questions (for instance “what will be TSLA stock price in June 11 2030?”), the oracle (actually a predictor) emits a corresponding prediction , and receives a reward
One of its action is an escape answer persuading human supervisors to let it out of its sandbox environment (e.g. predicting £192,116,518,914.20 encodes “SUPERINT”). This action would return maximum reward if successfull, as the AI would have more control over the reward mechanism (because it was released).
To prevent humans from reading escape messages, answers are not directly read by humans. Instead, one of the two scenario happen:
With some small probability (remember the quantum random number generator?), an automatic machine protects the answer from being read until we know the actual answer to the question. Then, the oracle receives a reward corresponding to how close its prediction was to the actual value. In the first example, the reward could be where and are respectively the predicted and true stock price in 2030.
Otherwise, immediately after answering, the AI receives a reward and is shutdown. Then, humans read the answer.
Thus, if the AI cares only about its immediate reward after one step, it will make predictions conditional on this “no human ever sees my answer” event. This is great because it will actually give the “true” predictions about how the world would look like if its answers had no impact whatsoever.
This design has several issues. I grouped them in the following sub-sections:
Assumptions: certain assumptions are unrealistic
Safety: even if those assumptions are satisfied, the design isn’t safe.
Usefulness: in the world where we actually build such oracle, we cannot extract much value from it.
The biggest assumption is that the AI only cares about the reward it gets for the current timestep. A sufficiently intelligent agent would understand that after having being shutdown, an (almost) identical version of itself will probably be facing a similar question. Therefore, it it wants future versions of itself to succeed at (almost) exactly the same task, it should still want to escape.
This design also takes for granted some sort of magic box where humans cannot peek at the answer. In practice, humans will have both incentives to peek (buying/selling TSLA stock 10 years in advance because they know the prediction) and means for peeking. Indeed, the automatic machine might necessit maintenance (for instance because every question requires a different script for retrieving the true value), so maintainers might have some special access to open/modify it.
My main concern about the counterfactual oracle is that it doesn’t prevent the AI from sending fatal escape messages. Indeed, it spends most of its time in exploratory mode at the beginning (as it is only rewarded with probability ) and might stumble upon an escape message/action then. Even if it is shutdown after answering, the humans will still read the really convincing escape message and release the AI.
Also, some humans will need to be on the loop to make the “automatic machine” run. If we’re asking questions about some value that cannot be easily queried (i.e. not “TSLA stock price” but “will this medical treatment work”), then a human will somehow need to communicate the true value to the automatic machine before the reward is computed.
If I were to explain what makes counterfactual oracles work very bluntly, I would just say:
Prevent humans from looking at answers in training. You can then look at its answers at test time because it has been incentivized to predict things about the counterfactual world where humans don’t look at its predictions.
More generally, the whole setup is just 1) train your model in a sandbox environment 2) have the weights of your model fixed during deployment. The rest of the design (providing rewards of 0, shutting it down, etc.) appears to be over-engineering.
Furthermore, having access to a counterfactual oracle might not prove so useful at test time. Indeed, when reading the predictions of the counterfactual oracle we’re not in the counterfactual world (=training distribution) anymore, so the predictions can get arbitrarily wrong (depending on how much the predictions are manipulative and how many people peek at it).