I don’t understand this proposal so far. I’m particularly confused by the last paragraph in the “to get away” section:
What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following π0 forever, thus not following π∗k for any k at any point.
If A follows π0 for τ turns and then follows π∗k, how are τ and k chosen?
It’s not clear to me that SA can act to ensure the baseline value of V′k for all values of k and τ unless it does nothing.
I think it might help to illustrate this proposal in your original gridworld example to make it clearer what’s going on. As far as I can tell so far, this does not address the issue I mentioned earlier where if the subagent actually achieves any of the auxiliary rewards, subagent creation will be penalized.
I agree I should re-do these all with examples. But, to answer your questions as far as possible:
What A will do: ∅ forever (maybe with some extra info for the SA).
What A would counterfactually do (to define Vk(s(t)t+τ)): π0 for τ turns (the π0 rollout, to define s(t)t+τ) followed by π∗R (to define the Vk aspect). Call this π∗0,τ,k.
The penalty is defined in terms of these Vk(s(t)t+τ), entirely counterfactual except for τ=0.
What SA must do: iff A follows π∗0,τ,k, then SA must ensure that Vk(s(t)t+τ)=Vk(s′(t)t+τ).
So a key question is: how early can SA detect if A is following π∗0,τ,k? If it can detect this immediately, then it can immediately implement the policy for ensuring that Vk(s(t)t+τ)=Vk(s′(t)t+τ) (for the stricter impact measure) or the much easier BVk(st)=BVk(s′t) (for the original impact measure).
If it has to wait τ turns to be sure, then it needs to preserve its power to control many different k.
Now, A will never actually follow π∗0,τ,k, so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of SA will determine the penalty; SA is rewarded for quickly gaining control of the rewards.
In practice, A will signal early on that it’s not following π∗0,τ,k. Maybe there’s a very small impact action, ∅ϵ, with a tiny penalty. Then A would like to use that to signal to SA that its not in the counterfactual world where it is following π∗0,τ,k. So long as π0‘s first action isn’t ∅ϵ, that signal is enough (of course, if π0’s first action isn’t ∅, then ∅ is a sufficient signal).
After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn’t in.
The one mild constraint remaining comes from s(t)t+0=st, which is not a counterfactually defined state (though Vk(st) is a counterfactually defined property of that state).
I’ll try and write up a further clarification of this.