AI caught by a module that counterfactually doesn’t exist

EDIT: Whenever I use colloquial phrases like “the AI believes a (false) X” I mean that we are using utility indifference to accomplish that goal, without actually giving the AI false beliefs.

A big problem with most AI control ideas is that the AI will route around them. Suppose you try to ensure that the AI stays friendly by—say—requiring it to ensure that the world’s stock markets continue to function. Then a badly motivated AI will cause the destruction it was going to cause anyway—except it will also create some system that technically counts as “the world’s stock markets” and ensure that it “continues to function”.

I think we can use counterfactuals and corrigibility (AI-acceptable motivation changes) to partially get around the problem, in some circumstances. Suppose we had a candidate AI, with utility U, that we thought was friendly—but that we weren’t sure. We have a subgoal, say SG=”ensuring that the world’s stock markets continue to function”. We’re pretty sure that if the AI is friendly, then SG will happen. We’re pretty sure that if the AI were unfriendly, and the AI did not have SG added as an objective, then SG wouldn’t happen. I deliberately chose a rather rough subgoal to illustrate this.

So we then add a module M to the AI. The task of M, which it has chosen to accept, is to look into the future, using the AI’s predictive abilities, and see if SG is accomplished. Crucially, it will do so under the assumption that the module M does not exist (or wasn’t activated, or will soon be overwritten, or will fail to do anything, or so on). If M finds that SG is not accomplished, it will turn the AI off (in a way compatible with the AI’s motivations—see corrigibility).

How is this different from the usual setup? Well, a generic FAI would accomplish SG (by assumption—we may need to work a bit on this part). A generic UFAI without M wouldn’t accomplish SG. A generic UFAI with M would accomplish SG, as this prevents it from being turned off.

But a generic UFAI with M which thought that M would have no impact, would behave exactly as a generic UFAI without M—and wouldn’t aim to accomplish SG, and thus would be turned off by M.

Note that M can only be used as a last test for a putatively friendly AI. We cannot use it if we are continually refining AIs (either through evolution or simply redesign) as then we are optimising for SG, and SG is a poor goal to be aiming for (many, many UFAI have SG as a goal—it’s just that a generic one won’t). Similarly, we can’t use a unconstrained search to find such an AI.

I wonder if this idea can be extended. Suggestions?