• I don’t understand this proposal so far. I’m particularly confused by the last paragraph in the “to get away” section:

• What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following forever, thus not following for any at any point.

• If A follows for turns and then follows , how are and chosen?

• It’s not clear to me that SA can act to ensure the baseline value of for all values of and unless it does nothing.

I think it might help to illustrate this proposal in your original gridworld example to make it clearer what’s going on. As far as I can tell so far, this does not address the issue I mentioned earlier where if the subagent actually achieves any of the auxiliary rewards, subagent creation will be penalized.

• I agree I should re-do these all with examples. But, to answer your questions as far as possible:

What will do: forever (maybe with some extra info for the ).

What would counterfactually do (to define ): for turns (the rollout, to define ) followed by (to define the aspect). Call this .

The penalty is defined in terms of these , entirely counterfactual except for .

What must do: iff follows , then must ensure that .

So a key question is: how early can detect if is following ? If it can detect this immediately, then it can immediately implement the policy for ensuring that (for the stricter impact measure) or the much easier (for the original impact measure).

If it has to wait turns to be sure, then it needs to preserve its power to control many different .

Now, will never actually follow , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of will determine the penalty; is rewarded for quickly gaining control of the rewards.

In practice, will signal early on that it’s not following . Maybe there’s a very small impact action, , with a tiny penalty. Then would like to use that to signal to that its not in the counterfactual world where it is following . So long as ‘s first action isn’t , that signal is enough (of course, if ’s first action isn’t , then is a sufficient signal).

After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn’t in.

The one mild constraint remaining comes from , which is not a counterfactually defined state (though is a counterfactually defined property of that state).

I’ll try and write up a further clarification of this.