Un-manipulable counterfactuals

This is how I design my counterfactuals: take some stocahstic event that the AI cannot manipulate. This could be a (well defined) chaotic process, the result of a past process that has been recorded and not revealed yet, or maybe something to do with the AI’s own decisions, calibrated so that the AI cannot access the information.

Then I have the world setup to make what we care dependent on that stochastic event. So, for instance, the output of an oracle is erased (before being read) dependent on this, the AI’s utility gets changed if one particular value comes up (in conjunction with something else).

I then define the counterfactual on the stochastic process. So if implies the AI changes their utility, then the counterfactual is simply . We can set the probability so that is almost certain, but is possible.

This seems to me the cleanest way of defining counterfactuals. Any obvious (or less obvious) problems?