Rohin Shah comments on [AN #91]: Concepts, implementations, problems, and a benchmark for impact measurement

Rohin Shah 21 Mar 2020 21:18 UTC
4 points
This leaves unclear how it is decided that old “agents” should be used.
Yeah, it’s complicated and messy and not that important for the main point of the paper, so I didn’t write about it in the summary.
Was this switch to a new agent automatic or done by hand? (Was ‘the agent has plateaued’ determined by a program or the authors of the paper?)
Automatic / program. See Section 4, whose first sentence is “To generalize this observation, we first propose a simple algorithm for selecting states associated with plateaus of the last agent.”
(The algorithm cheats a bit by assuming that you can run the original agent for some additional time, but then “roll it back” to the first state at which it got the max reward along the trajectory.)
Not apparent.
I may be missing your point, but isn’t the fact that the Memento agent works on Montezuma’s Revenge evidence that learning is not generalizing across “sections” in Montezuma’s Revenge?
- Pattern 22 Mar 2020 6:53 UTC
  2 points
  Parent
  Not apparent.
  I may be missing your point, but isn’t the fact that the Memento agent works on Montezuma’s Revenge evidence that learning is not generalizing across “sections” in Montezuma’s Revenge?
  I was indicating that I hadn’t found the answer I sought (but I included those quotes because they seemed interesting, if unrelated).
  Automatic / program. See Section 4, whose first sentence is “To generalize this observation, we first propose a simple algorithm for selecting states associated with plateaus of the last agent.”
  Thanks for highlighting that. The reason I was interested is because I was thinking of the neural networks as being deployed to complete tasks rather than the entire game by themselves.
  I ended up concluding the game was being divided up into ‘parts’ or epochs, each with their own respective agent deployed in sequence. The “this method makes things easy as long as there’s not interference” thing is interesting when compared to multi-agent learning—they’re on the same team, but cooperation doesn’t seem to be easy under these circumstances (or at least not an efficient strategy, in terms of computational constraints), and reminded me of my questions about those approaches, like: Does freezing one agent (for a round) so it’s predictable, then train the other one (or have it play with a human) improve things? How can ‘learning to cooperate better’ be balanced with ‘continuing to be able to cooperate/coordinate with the other player’?