Wei Dai comments on Concrete experiments in inner alignment

Wei Dai 12 Sep 2019 11:04 UTC
LW: 17 AF: 11
1
AF

Train an RL agent with access to its previous step reward as part of its observation.

This is making me notice a terminological ambiguity where sometimes “RL agent” refers to a model/policy trained by a reinforcement learning algorithm (such as REINFORCE) like you’re doing here, and sometimes it refers to an agent that maximizes expected reward (given as an input), such as AIXI, like in Daniel Dewey’s Learning What to Value, and a “RL agent” in the first sense is not necessarily a “RL agent” in the second sense.

To disambiguate, it seems a good idea to call the former kind of agent something like “RL-trained agent” and the second kind of agent “reward-maximizing agent” or “reward-maximizer” for short. Then we can say things like, “If a RL-trained agent is not given direct access to its step rewards during training, it seems less likely to become a reward-maximizer.” Any thoughts on this suggestion? (I’ll probably make a post about this later, but thought I’d run it by you and any others who sees this comment for a sanity check first.)
What links here?
- Some of my disagreements with List of Lethalities by TurnTrout (24 Jan 2023 0:25 UTC; 63 points)
- evhub 12 Sep 2019 16:32 UTC
  LW: 1 AF: 1
  0
  AF Parent
  When I use the term “RL agent,” I always mean an agent trained via RL. The other usage just seems confused to me in that it seems to be assuming that if you use RL you’ll get an agent which is “trying” to maximize its reward, which is not necessarily the case. “Reward-maximizer” seems like a much better term to describe that situation.
  - Wei Dai 12 Sep 2019 17:25 UTC
    LW: 11 AF: 8
    0
    AF Parent
    
    When I use the term “RL agent,” I always mean an agent trained via RL.
    
    I think the problem with this usage is that “RL agent” originally meant something like “an agent designed to solve a RL problem” where “RL problem” is something like “a class of problems with the central example being MDP”. I think it’s just not a well-defined term at this point, and if you Google it, you get plenty of results that say things like “the goal of our RL agent is to maximize the expected cumulative reward”, or “AIXI is a reinforcement learning agent”. I guess this is fine for AI capabilities work but really confusing for AI safety work.
    
    So, consider switching to “RL-trained agent” for greater clarity (unless someone has a better suggestion)? ETA: Maybe “reinforcement trained agent”?