Overall summary post here.
I’ve previously looked at subagents in the context of stepwise inaction baselines. But there have been improvements to the basic stepwise inaction design, to include inaction rollouts. I’ll be looking at those in this post.
The stepwise inaction baseline compares , the current state, with , what the current state would have been had the agent previously taken the noop action , instead of , its actual action.
Fix a policy . Let be the state the environment would be in if the agent had followed from state , for turns. Let be the same, except that it started from state instead of .
The inaction rollout has being the noop policy, but that is not necessary. The basic idea is to capture delayed impacts of by comparing not just and , but the and as well.
Given some value function , define so that . Or, equivalently, if is the policy that maximises , . Then for a discount factor define the rollout value of a state as
This is just the discounted future values of , given and the policy .
The impact measure is then defined, as in this post, as
with replacing .
To get away with a subagent
I will make one extra assumption here: that grants knowledge of the policy that is following.
As in this post, assume that the subagent is constructed in a single action . And, after building the subagent, the agent will only do , so that there are no subsequent penalties (hence , so and the are strictly identical). If the first action of is , then will instead do on that turn.
So we only need to worry about the penalty between (following after ) and (following after ).
But now note that itself is just a value measure; we could write it . Then just has to program so that, if implements followed by after turns, then will act to ensure that (a term that is a sum of various value functions) is equal to (from onward, a constant).
Stricter impact measures/counterfactuals
There is something slightly dubious about the previous section. It assumes that could counterfactually implement ” followed by after turns” (which necessary to define ), but also that knows ’s policy.
But does that mean that knows at that is following that policy, or does it just know that is following , to be followed at some later date by some ? This makes a difference, because in the second case, doesn’t know from turn , so has to prepare for all .
Which is true depends on how counterfactuals are defined. But we can avoid the whole issue by using a stricter impact measure, interchanging and weighted summation:
This looks like an impact penalty with an inaction baseline—except that it starts at , rather than , and uses rather than inaction. Those changes don’t affect the results found here, so will just behave as it does in that post… until it’s clear that is not following .
Because is defined counterfactually (using ), even this stricter impact measure is weaker than the inaction baseline. As soon as knows that is not following , it has almost no further restrictions upon its actions.
The “almost” comes from which is actually defined, not counterfactually defined. But that’s the same situation as a stepwise inaction baseline without rollouts.
One way this could happen, is if can add extra information to its action without affecting the outside world; such that could include a “hey, I’m not following , just to let you know”. ↩︎