Hoagy comments on Hedonic Loops and Taming RL

Hoagy 19 Jul 2023 19:06 UTC
5 points
0
Importantly, this policy would naturally be highly specialized to a specific reward function. Naively, you can’t change the reward function and expect the policy to instantly adapt; instead you would have to retrain the network from scratch.
I don’t understand why standard RL algorithms in the basal ganglia wouldn’t work. Like, most RL problems have elements that can be viewed as homeostatic—if you’re playing boxcart then you need to go left/right depending on position. Why can’t that generalise to seeking food iff stomach is empty? Optimizing for a specific reward function doesn’t seem to preclude that function itself being a function of other things (which just makes it a more complex function).
What am I missing?
- beren 20 Jul 2023 12:42 UTC
  4 points
  0
  Parent
  This is definitely possible and is essentially augmenting the state variables with additional homeostatic variables and then learning policies on the joint state space. However there are some clever experiments such as the linked Morrison and Berridge one demonstrating that this is not all that is going on—specifically many animals appear to be able to perform zero-shot changes in policy when rewards change even if they have not experienced this specific homeostatic variable before—I.e. mice suddenly chase after salt water which they previously disliked when put in a state of salt deprivation which they had never before experienced
  - Seth Herd 20 Jul 2023 15:18 UTC
    4 points
    0
    Parent
    The above is describing the model-free component of learning reward-function dependent policies. The Morrison and Berridge salt experiment is demonstrating the model-based side, which probably comes from imagining specific outcomes and how they’d feel.
    - beren 21 Jul 2023 15:14 UTC
      2 points
      0
      Parent
      This is where I disagree! I don’t think the Morrison and Berridge experiment demonstrates model-based side. It is consistent with model-based RL but is also consistent with model-free algorithms that can flexibly adapt to changing reward functions such as linear RL. Personally, I think this latter is more likely since it is such a low level response which can be modulated entirely by subcortical systems and so seems unlikely to require model-based planning to work