Joseph Bloom comments on Decision Transformer Interpretability

Joseph Bloom 7 Feb 2023 21:41 UTC
10 points
0
The ablations seems surprisingly clear-cut. I consider myself to be very on board with “RL-trained behaviors are contextually activated and modulated”, but even I wasn’t expecting such strong localization.
Neither was I. After the fact, it seems easy to come up with reasons why this might be the case. I think measuring this with something like excluded loss might enable a more precise quantification of exactly how localised it was. I also don’t see strong reasons to expect this to generalise to larger models. If I see similar results when the task is more complicated, the model is bigger and especially with a larger context window, then I will be more interested in trying to precisely describe how/why/when you get more localisation.
Seems to me that randomness wouldn’t prevent the agent from bieng calibrated, because even though any given episode might deviate from the prescribed number of steps, on average the randomness can (presumably) be made to add up to that number. EG it might be hard to bump into the goal after exactly 14 steps due to random obstacles, but I’d imagine ensuring this falls between 10 and 18 steps is feasible?
I think you might be right in the infinite training data regime, where I would expect it to be unbiased however I suspect that the training data being sparse, especially in the low positive RTG reason is enough to make the signal weak. The loss incurred by finishing in the wrong number of steps is likely very small compared to failing when you have positive RTG or succeeding when you have positive RTG, so it could also be that the model doesn’t allocate much capacity to being well-calibrated in this sense.
This seems like an important manifestation of “models don’t ‘get reward’, they are shaped by reward”; even on a simple task where presumably agents can fully explore all relevant options. The e.g. observation encoding (where an obstacle is “3/4″ of a goal) matters when predicting what behavioral shards & subroutines get trained into the policy, or considering what e.g. early-stage policies will be like.
I thought this too and should have remembered to cite “Reward is not the Optimization target”. I feel like that concept is now more visceral for me. A relevant takeaway that may or may not be obvious is that simulators with inductive biases might be better/worse at simulating particular stuff as a function of their inductive biases. In this case, the positive RTG range was more miscalibrated as a result of the bias.
Wait, how? Isn’t the state observation constant in this task? I’m guessing you’re discussing something else?
I’m not sure what you mean by constant. If it were constant then the agent/obstacles wouldn’t be moving? I’ll elaborate a little bit in case that helps.
Since RTG gives information about whether the agent should go forward/not in many contexts, you might have expected (although now I think I wouldn’t) the residual stream embedding for the state token not to directly contribute to one logit over another before RTG has been seen. In practice, it seems like it does. For example, in this situation (picture below from the app) with RTG = 90, there is a wall to the left of the agent and the state appears to strongly encourage forward/right. I interpret this as as “some agent behaviours are independent of RTG and can be encouraged as a function of the observation/state before RTG is seen”.

To clear up some language as well:
- state encoding → how is the state represented? weird minigrid schema.
- state token → the value of a state represented as a vector input to the model.
- state embedding → an internal representation of the state at some point in the model.

I should have written state-token not state-embedding in the quotes paragraph. Apologies if this led to confusion.