Jacob_Hilton comments on Empirical Observations of Objective Robustness Failures

Jacob_Hilton 15 Aug 2021 15:28 UTC
LW: 4 AF: 2
AF
It’s great to see these examples spelled out with clear and careful experiments. There’s no doubt that the CoinRun agent is best described as trying to get to the end of the level, not the coin.
Some in-depth comments on the interpretability experiments:
- There are actually two slightly different versions of CoinRun, the original version and the Procgen version (without the gray squares in the top left). The model was trained on the original version, but the OOD environment is based on the Procgen version, so your observations are more OOD than you may have realized! The supplementary material to URLV includes model weights and interfaces for both versions, in case you are interested in running an ablation.
- You used the OOD environment both to select the most important directions in activation space, and to compute value function attribution. A natural question is therefore: what happens if you select the directions in activation space using the unmodified environment, while still computing value function attribution using the OOD environment? It could be that the coin direction still has some value function attribution, but that it is not selected as an important direction because the attribution is weaker.
- If the coin direction has less value function attribution in the OOD environment, it could either be because it is activated less (the “bottom” of the network behaves differently), or because it is less important to the value function (the “top” of the network behaves differently). It is probably a mixture of both, but it would be interesting to check.
- Remember that the dataset example-based feature visualization selects the most extreme examples, so even if the feature only puts a small weight on coins, they may be present in most of the dataset examples. I think a similar phenomenon is going on in URLV with the “Buzzsaw obstacle or platform” feature (in light blue). If you play the interface, you’ll see that the feature is often used for platforms with no buzzsaws, even though there are buzzsaws in all the dataset examples.