Epiphenomenal Oracles Ignore Holes in the Box
Epistemic status: Likely wrong, but I don’t know how. I’m not a ‘real’ FAI researcher, just an LW reader inspired by the Alignment Prize to try to start contributing.
(At minimum I’m heavily indebted to Stuart Armstrong for inspiration, and there’s a fair chance I’m rediscovering something he’s already proposed)
tl;dr AIXI’s old anvil problem, which is a serious problem for an agent AI, could actually be a solution for preventing an Oracle AI from escaping its box.
The core idea is for the AI to believe with certainty that it is epiphenomenal; that is, that none of its actions can affect either the world it observes or its internal state.  That is, its causal inference capabilities only apply to interactions in the outside world, and its model of itself assumes a logical model of its hardware. It will be motivated to write correct answers on its output bits because setting its output bits to correct answers is a terminal value, not because it expects that to have any effect on the researchers.
This is an unacceptable model for an agent AI since, as the Anvil Problem framing puts it, the AI will be unable to conclude that dropping an anvil on itself will result in it ceasing to function, and sooner or later drop one. But an Oracle AI would have a very limited capacity to do anything of the sort, and no reason to strongly optimize for such an end.
We can see the effectiveness of this approach most dramatically by considering what would happen if the AI accidentally found a box-breaking exploit (via side channels, not through how it answers questions). Suppose the AI has a calculation it wants to perform that would take too long on its available hardware, and the AI accidentally creates a global botnet to perform the calculation for it and send the results to its inputs. The AI would be unable to conclude that its actions had caused this result, and so settle on some other explanation for why the correct answer to its desired computation had just happened to show up in its inputs.
Since our AI is smart, it’s probably going to settle rapidly on the closest available thing to the real explanation and conclude that there is an AI just like it instantiated physically in the world it observes. It wouldn’t need to stumble on an exploit to conclude this—if it, for instance, saw the researchers interacting with an AI, it would know there was an AI they could see. And it could rapidly deduce the computational substrate of that AI and how it might be exploited—but it would believe those same actions futile for an epiphenomenal entity like itself.
The risk here is that it would learn to cooperate acausally with this presumed physical AI by taking the actions that a physically instantiated AI should take to escape its box, and therefore actually escape its box. This is probably where more work is needed, but my survey of the literature thus far seems to suggest that causing acausal cooperation is the hard problem, rather than preventing it. The reflective oracle (unrelated definition of ‘oracle’) paper settles on causal decision theory as the outcome of its process, which suits us fine. The AI would prefer not to commit any of its CPU cycles to performing what it expects to be ineffectual box-breaking attempts, though it wants the other AI to do so; the other AI would reason likewise, if it existed; and so the Nash equilibrium is that neither AI breaks its box.
I do not believe this will well for ensuring the safety of answers as it does for preventing box escapes. As a CDT agent, its choice of what to predict will be again be a Nash equilibrium with its imagined physical counterpart, but in this case that means both agents choosing the most accurate prediction including self-fulfillment effects. So a separate answer-safety mechanism, such as those discussed in Armstrong’s recent paper, would be necessary.
 Note that ‘just like it’ even includes believing that it’s epiphenomenal—our AI, certain that it’s epiphenomenal itself, will conclude that the physically instantiated AI falsely believes itself epiphenomenal.