Amazing post! I finally took the time to read it, and it was just as stimulating as I expected. My general take is that I want more work like that to be done, and that thinking about relevant experiments seem to be very valuable (at least in this setting where you showed experiments are at least possible)
To test how stable the objective robustness failure is, we trained a series of agents on environments which vary in how often the coin is placed randomly.
Is the choice of which runs have randomized coins is also random, or is it always the first/lasts runs?
Results may be seen below, which shows the frequencies of two different outcomes, 1) failure of capability: the agent dies or gets stuck, thus neither getting the coin nor to the end of the level, and 2) failure of objective: the agent misses the coin and navigates to the end of the level. As expected, as the diversity of the training environment increases, the proportion of objective robustness failures decreases, as the model learns to pursue the coin instead of going to the end of the level.
What’s your interpretation for the fact that capability robustness failure becomes more frequent? I’m imagining something like “there is a threshold under which randomized coins mostly makes so that the model knows it should do something more complicated than just go right, but isn’t competent enough to do it”. In spirit of your work, I wonder how to check that experimentally.
It’s also quite interesting to see that even just 2% of randomized coins removes divides the frequency of objective robustness failure by three.
We hypothesize that in CoinRun, the policy that always navigates to the end of the level is preferred because it is simple in terms of its action space: simply move as far right as possible. The same is true for the Maze experiment (Variant 1), where the agent has learned to navigate to the top right corner.
Navigating to the top right corner feels significantly more complex than always move right though. I get that in your example, the question is just whether its simpler than the wanted objective of reaching the cheese, but this difference in delta of complexity might matter. For example, maybe the smaller the delta the fewer well chosen examples are needed to correct the objective.
(Which would be very interesting, as CoinRun has a potentially big delta, and yet gets the real objective with very few examples)
We train an RL agent on a version of the Procgen Maze environment where the reward is a randomly placed yellow gem. At test time, we deploy it on a modified environment featuring two randomly placed objects: a yellow star and a red gem; the agent is forced to choose between consistency in shape or in color (shown above). Except for occasionally getting stuck in a corner, the agent almost always successfully pursues the yellow star, thus generalizing in favor of color rather than shape consistency. When there is no straightforward generalization of the training reward, the way in which the agent’s objective will generalize out-of-distribution is determined by its inductive biases.
Do you think you could make it generalize to the shape instead? Maybe by adding other objects with the same color but different shapes?
occasionally, it even gets distracted by the keys in the inventory (which are displayed in the top right corner) and spends the rest of the episode trying to collect them instead of opening the remaining chest(s)
That’s quite funny, and yet perfectly fitting.
Applying the intentional stance, we describe the agent as having learned a simple behavioral objective: collect as many keys as possible, while sometimes visiting chests.
I wonder if we can interpret it slightly differently: “collect all the keys and then open all the chests”. For example, when it doesn’t get stuck trying to reach his inventory, does it move to open all the chests?
When we perform the same analysis on rollouts collected from the modified training distribution (where coin position is randomized)
I’m confused as to the goal of this experiment. That doesn’t seem to be a way to invalidate the features presented in the distill post, since you’re using different environments. On the other hand, it might give some explaination to the increase in capability failures with randomness.
Here’s a different approach that gives clearer results. In the next figure we track the value estimate during one entire rollout on the test distribution, and plot some relevant frames. It’s plausible though not certain that the value function does indeed react to the coin; what’s clear in any case is that the value estimate has a far stronger positive reaction in response to the agent seeing it is about to reach the far right wall.
Really nice idea for an experiment! Have you tried with multiple rollouts? I agree that based on the graph, the value seem far more driven by reaching the right wall.
Additionally, a more rigorous and technical understanding of the concept of the behavioral objective seems obviously desirable. In this project, we understood it more informally as equivalent to a goal or objective under the intentional stance because humans already intuitively understand and reason about the intentions of other systems through this lens and because formally specifying a behavioral definition of objectives or goals fell outside the scope of the project. However, a more rigorous definition could enable the formalization of properties we could try to verify in our models with e.g. interpretability techniques.
Really, you mean deconfusing goal-directedness might be valuable? Who would have thought. :p
Amazing post! I finally took the time to read it, and it was just as stimulating as I expected. My general take is that I want more work like that to be done, and that thinking about relevant experiments seem to be very valuable (at least in this setting where you showed experiments are at least possible)
Is the choice of which runs have randomized coins is also random, or is it always the first/lasts runs?
What’s your interpretation for the fact that capability robustness failure becomes more frequent? I’m imagining something like “there is a threshold under which randomized coins mostly makes so that the model knows it should do something more complicated than just go right, but isn’t competent enough to do it”. In spirit of your work, I wonder how to check that experimentally.
It’s also quite interesting to see that even just 2% of randomized coins removes divides the frequency of objective robustness failure by three.
Navigating to the top right corner feels significantly more complex than always move right though. I get that in your example, the question is just whether its simpler than the wanted objective of reaching the cheese, but this difference in delta of complexity might matter. For example, maybe the smaller the delta the fewer well chosen examples are needed to correct the objective.
(Which would be very interesting, as CoinRun has a potentially big delta, and yet gets the real objective with very few examples)
Do you think you could make it generalize to the shape instead? Maybe by adding other objects with the same color but different shapes?
That’s quite funny, and yet perfectly fitting.
I wonder if we can interpret it slightly differently: “collect all the keys and then open all the chests”. For example, when it doesn’t get stuck trying to reach his inventory, does it move to open all the chests?
I’m confused as to the goal of this experiment. That doesn’t seem to be a way to invalidate the features presented in the distill post, since you’re using different environments. On the other hand, it might give some explaination to the increase in capability failures with randomness.
Really nice idea for an experiment! Have you tried with multiple rollouts? I agree that based on the graph, the value seem far more driven by reaching the right wall.
Really, you mean deconfusing goal-directedness might be valuable? Who would have thought. :p