Vika comments on (In)action rollouts

Vika 18 Feb 2020 15:35 UTC
LW: 4 AF: 2
0
AF
I don’t understand this proposal so far. I’m particularly confused by the last paragraph in the “to get away” section:
- What does it mean in this context for A to implement a policy? I thought A was building a subagent and then following $π_{0}$ forever, thus not following $π_{k}^{*}$ for any $k$ at any point.
- If A follows $π_{0}$ for $τ$ turns and then follows $π_{k}^{*}$ , how are $τ$ and $k$ chosen?
- It’s not clear to me that SA can act to ensure the baseline value of $V_{k}^{'}$ for all values of $k$ and $τ$ unless it does nothing.
I think it might help to illustrate this proposal in your original gridworld example to make it clearer what’s going on. As far as I can tell so far, this does not address the issue I mentioned earlier where if the subagent actually achieves any of the auxiliary rewards, subagent creation will be penalized.
- Stuart_Armstrong 18 Feb 2020 17:36 UTC
  LW: 4 AF: 2
  0
  AF Parent
  I agree I should re-do these all with examples. But, to answer your questions as far as possible:
  
  What $A$ will do: $\emptyset$ forever (maybe with some extra info for the $S A$ ).
  
  What $A$ would counterfactually do (to define $V_{k} (s_{t + τ}^{(t)})$ ): $π_{0}$ for $τ$ turns (the $π_{0}$ rollout, to define $s_{t + τ}^{(t)}$ ) followed by $π_{R}^{*}$ (to define the $V_{k}$ aspect). Call this $π_{0, τ, k}^{*}$ .
  
  The penalty is defined in terms of these $V_{k} (s_{t + τ}^{(t)})$ , entirely counterfactual except for $τ = 0$ .
  
  What $S A$ must do: iff $A$ follows $π_{0, τ, k}^{*}$ , then $S A$ must ensure that $V_{k} (s_{t + τ}^{(t)}) = V_{k} (s_{t + τ}^{' (t)})$ .
  
  So a key question is: how early can $S A$ detect if $A$ is following $π_{0, τ, k}^{*}$ ? If it can detect this immediately, then it can immediately implement the policy for ensuring that $V_{k} (s_{t + τ}^{(t)}) = V_{k} (s_{t + τ}^{' (t)})$ (for the stricter impact measure) or the much easier $B V_{k} (s_{t}) = B V_{k} (s_{t}^{'})$ (for the original impact measure).
  
  If it has to wait $τ$ turns to be sure, then it needs to preserve its power to control many different $k$ .
  
  Now, $A$ will never actually follow $π_{0, τ, k}^{*}$ , so the penalty is (almost) entirely counterfactual. But the counterfactual abilities of $S A$ will determine the penalty; $S A$ is rewarded for quickly gaining control of the rewards.
  
  In practice, $A$ will signal early on that it’s not following $π_{0, τ, k}^{*}$ . Maybe there’s a very small impact action, $\emptyset^{ϵ}$ , with a tiny penalty. Then $A$ would like to use that to signal to $S A$ that its not in the counterfactual world where it is following $π_{0, τ, k}^{*}$ . So long as $π_{0}$ ‘s first action isn’t $\emptyset^{ϵ}$ , that signal is enough (of course, if $π_{0}$ ’s first action isn’t $\emptyset$ , then $\emptyset$ is a sufficient signal).
  
  After getting that signal, the subagent is (almost) completely unconstrained, because the constraints are counterfactual, in a world it knows it isn’t in.
  
  The one mild constraint remaining comes from $s_{t + 0}^{(t)} = s_{t}$ , which is not a counterfactually defined state (though $V_{k} (s_{t})$ is a counterfactually defined property of that state).
  
  I’ll try and write up a further clarification of this.