Bengio et al. mix up “the policy/AI/agent is trained with RL and gets a high (maximal?) score on the training distribution” and “the policy/AI/agent is trained such that it wants to maximize reward (or some correlates) even outside of training”.
Wait, does the friend elsewhere add ”… and the author is right” or “and sloppiness isn’t that bad”? My read of the quote you’ve provided is a critique and isn’t excusing the sloppiness.
Wait, does the friend elsewhere add ”… and the author is right” or “and sloppiness isn’t that bad”? My read of the quote you’ve provided is a critique and isn’t excusing the sloppiness.