[Question] Did they or didn’t they learn tool use?

Daniel Kokotajlo29 Jul 2021 13:26 UTC

LW: 16 AF: 8

DeepMind claims that their recent agents learned tool use; in particular, they learned to build ramps to access inaccessible areas even though none of their training environments contained any inaccessible areas!

This is the result that impresses me most. However, I’m uncertain whether it’s really true, or just cherry-picked.

Their results showreel depicts multiple instances of the AIs accessing inaccessible areas via ramps and other clever methods. However, in the actual paper page 51, they list all the hand-authored tasks they used to test the agents on, and mention that for some of them the agents did not get >0 reward. One of the tasks that the agents got 0 reward on is:

Tool Use Climb 1: The agent must use the objects to reach a higher floor with the target object.

So… what gives? They have video of the agents using objects to reach a higher floor containing the target object, multiple separate times. But then in the chart it says failed to achieve reward >0.

EDIT: Maybe the paper just has a typo? The blog post contains this image, which appears to show non-zero reward for the “tool use” task, zero-shot:

Daniel Kokotajlo29 Jul 2021 13:26 UTC

LW: 16 AF: 8

8 comments1 min readLW link

Vanessa Kosoy 29 Jul 2021 16:49 UTC
LW: 14 AF: 9
0
AF
On page 28 they say:

Whilst some tasks do show successful ramp building (Figure 21), some hand-authored tasks require multiple ramps to be built to navigate up multiple floors which are inaccessible. In these tasks the agent fails.

From this, I’m guessing that it sometimes succeeds to build one ramp, but fails when the task requires building multiple ramps.
- Daniel Kokotajlo 29 Jul 2021 17:08 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Nice, I missed that! Thanks!

jbkjr 29 Jul 2021 14:23 UTC
3 points
0
One idea as to the source of the potential discrepancy… did any of the task prompts for the tasks in which it did figure out how to use tools tell it explicitly to “use the objects to reach a higher floor,” or something similar? I’m wondering if the cases where it did use tools are examples where doing so was instrumentally useful to achieving a prompted objective that didn’t explicitly require tool use.
- Daniel Kokotajlo 29 Jul 2021 16:02 UTC
  2 points
  0
  Parent
  None of the prompts tell it what to do, they aren’t even in english. (Or so I think? correct me if I’m wrong!) Instead they are in propositional logic, using atoms that refer to objects, colors, relations, and players. They just give the reward function in disjunctive normal form (i.e. big chain of disjunctions) and present it to the agent to observe.
Daniel Kokotajlo 29 Jul 2021 13:35 UTC
LW: 2 AF: 2
0
AF
Also, what does the “1G 38G 152G” mean in the image? I can’t tell. I would have thought it means number of games trained on, or something, except that at the top it says 0-Shot.
- Jozdien 29 Jul 2021 14:16 UTC
  3 points
  0
  Parent
  From the description of that figure in the paper, it says “three points in training” of a generation 5 agent, so probably the performance of that agent on the task at different learning steps?
  Edit: To clarify, I think it’s 0-shot learning on the six hand-authored tasks in the figure, but is being trained on other tasks to improve on normalized score percentiles. That figure is meant to show the correlation of this metric with improvement on the hand-authored tasks.
  - Daniel Kokotajlo 29 Jul 2021 16:01 UTC
    3 points
    0
    Parent
    In that case, the thing in the paper must be a typo, because the “Tool Use” graph here is clearly >0 reward, even for the 1G agent.
    - Jozdien 29 Jul 2021 17:24 UTC
      3 points
      0
      Parent
      It could be that the Tool Use in the graph is the “Tool Use Gap” task instead of the “Tool Use Climb” task. But they don’t specify anywhere I could find easily.