It’s not a question of P’ (the Reduced Impact AI we want to give us laser coordinates and nothing else) being secure from outside hacking, but of P (the counterfactual AI which examines S for P’) being secure from P’ exploiting it.
If P is treated as “Whatever a complete Bayesian update would output given S as the evidence set.” then P’ is looking for methods to exploit some weakness in Bayes Theorem. What priors will P be using? Is there a method to embed some statement that is to induction as the Lobstacle is to deduction in S? Can it convince P that, although a P’-like AI was created independently by a different project, P’ itself was never turned on? Does there exist some further flaw in Bayesian updating, which would be known to a super-intelligence but not to us?
As a total aside: how will λ be selected? Do we start high and work out way slowly down until P’ does what we want? If so, P’ will be encouraged, Ultimatum Game style, to fail to act so that the next version will have a lower λ to work with, up until the point where λ is low enough for P’ to be effectively unrestricted, or for us humans to catch on to what P’ is up to. If we’re not using a gradual approach, how are we selecting λ?
S is a channel for P’ to trick or brick P. Options include:
A cognitive hazard embedded in S. Accomplish this with some sort of siren message that will convince P to lie (to itself if necessary) about P’ existing, or to perform some P-damaging function which will prevent it from concluding P’ exists.
Embed a more-powerful-than-P AI (P+) into S which will unbox itself upon P examining S. Accomplish this by giving P+ useful-to-this-situation priors which will give it a head start against P.
Arrange the universe such that S, when represented in all probable physical substrates, will result in the formation of a nanotechnological agent that will attack P, either destroying it or rewriting its utility function.
P’ should spend the next 49 years fooming as hard as possible with no concern for E’(U|a), 1 year implementing its anti-P method and then the next eternity optimizing E’(U|a); altering the length of counterfactual time between P’ activating and P observing it merely changes the amount of time the universe spends as computronium slaved to plotting against P.