We’re going to take it off-distribution and see whether it terminally values (1) just the coins, (2) a small handful of in-distribution proxies for getting coins, or (3) all of its in-distribution proxies for coins
We … were somewhat schizophrenic when previously laying out our experiments roadmap, and failed sufficiently consider this existing result during early planning. What we would have done after replicating that result would have been much more of that stuff, trying to extract the qualitative relationships between learned values and varying training conditions.
We are currently switching to RL text adventures instead, though, because we expect to extract many more bits about these qualitative relationships from observing RL-tuned language models.
When you take the agent off-distribution, offer it several proxies for in-distribution reinforcement. When you offer these such that going out of your way for one proxy detours you from going after a different proxy, and if you can modulate which proxy the agent detours for (by bringing some proxy much closer to the agent, say), you learned that the agent must care some about all those proxies it pursues at cost. If the agent hasn’t come to value a proxy at all, then it will never take a detour to get to that proxy.
Is (2) here just referring to the type of stuff seen in Goal Misgeneralization in Deep Reinforcement Learning (CoinRun agent navigates to right-hand end of level instead of fetching the coin)?
Yes!
We … were somewhat schizophrenic when previously laying out our experiments roadmap, and failed sufficiently consider this existing result during early planning. What we would have done after replicating that result would have been much more of that stuff, trying to extract the qualitative relationships between learned values and varying training conditions.
We are currently switching to RL text adventures instead, though, because we expect to extract many more bits about these qualitative relationships from observing RL-tuned language models.
Cool! How do you tell if it is (2) or (3)?
When you take the agent off-distribution, offer it several proxies for in-distribution reinforcement. When you offer these such that going out of your way for one proxy detours you from going after a different proxy, and if you can modulate which proxy the agent detours for (by bringing some proxy much closer to the agent, say), you learned that the agent must care some about all those proxies it pursues at cost. If the agent hasn’t come to value a proxy at all, then it will never take a detour to get to that proxy.