The first deals a lot with the tangled terminology used in the field; the second implements a mixed model-free and model-based neural network model of human decision-making in a multi-goal task. It makes decisions using interactions between the basal ganglia, dopamine system, and cortex.
It implements the type of system you mention: the current reward function is provided to the learning system as an input. The system learns to do both model-free and model-based decision-making using experience with outcomes of the different reward functions.
Obviously I think this is how the mammalian brain does it; I also think that the indirect evidence for this is pretty strong.
I don’t think this makes the learning space vastly larger, because the dimensionality of reward functions is much lower than that of environments. I won’t reach for the fridge handle if I’m stuffed (0 on the hunger reward function); but neither will I reach for it if there’s no fridge near me, or if it’s a stranger’s fridge, etc.
So that’s on the side of how the brain does multi-goal RL for decision-making.
As for the implications for alignment, I’m less excited.
Being able to give your agent a different goal and have it flexibly pursue that goal is great as long as you’re in control of that agent. If it actually gains autonomy (which I expect any human or better level AGI to do sooner or later, probably sooner), it’s going to do whatever it wants. Which is now more complicated than a single-goal, model-based (consequentialist) agent, but that seems to be neither here nor there for alignment purposes.
This does help lead to a human-like AGI, but I see humans as barely-aligned-on-average, so that any deviation from an accurate copy could well be deadly. In particular, sociopaths do not seem to be adequately aligned, and since sociopathy has a large genetic component, I think we’d need to figure out what that adds. Odds are it’s a specific pro-social instinct, probably implemented as a reward function. It’s this logic that leads to Steve Byrnes’ research agenda.
Maybe I’m not getting the full implications you see for alignment? I’d sure like to increase my estimate of aligning a brainlike AGI, because I think there’s a good chance we get something in that general architecture as the first real AGI. I think there’s a route there, but it’s not particularly reliable. I hope we get a language model agent as the first AGI, since aligning those seems more reliable.
Thanks for linking to your papers and definitely interesting you have been thinking along similar lines. I think the key reason I think studying this is important is that I think that these hedonic loops demonstrate that a.) Mammals including humans are actually exceptionally aligned to basic homeostatic needs and basic hedonic loops I’m practice. It is extremely hard and rare for people to choose not to follow homeostatic drives. I think humans are mostly ‘misaligned’ about higher level things like morality, empathy etc is because we dont actually have direct drives hardcoded in the hypothalamus for them the way we do for primary rewards. Higher level behaviours either socio-culturally learned through unsupervised critically based learning or derived from RL extrapolations from primary rewards. It is no surprise that alignment to these ideals is weaker. B.) That relatively simple control loops are very effective at controlling vastly more complex unsupervised cognitive systems.
I also agree this is similar to steven Byrnes agenda and maybe just my way to arrive at it
I agree that it’s odd that RL as a field hasn’t dealt much with multi-goal problems. I wrote two papers that address this in the neuroscience domain:
How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing
Neural mechanisms of human decision-making
The first deals a lot with the tangled terminology used in the field; the second implements a mixed model-free and model-based neural network model of human decision-making in a multi-goal task. It makes decisions using interactions between the basal ganglia, dopamine system, and cortex.
It implements the type of system you mention: the current reward function is provided to the learning system as an input. The system learns to do both model-free and model-based decision-making using experience with outcomes of the different reward functions.
Obviously I think this is how the mammalian brain does it; I also think that the indirect evidence for this is pretty strong.
I don’t think this makes the learning space vastly larger, because the dimensionality of reward functions is much lower than that of environments. I won’t reach for the fridge handle if I’m stuffed (0 on the hunger reward function); but neither will I reach for it if there’s no fridge near me, or if it’s a stranger’s fridge, etc.
So that’s on the side of how the brain does multi-goal RL for decision-making.
As for the implications for alignment, I’m less excited.
Being able to give your agent a different goal and have it flexibly pursue that goal is great as long as you’re in control of that agent. If it actually gains autonomy (which I expect any human or better level AGI to do sooner or later, probably sooner), it’s going to do whatever it wants. Which is now more complicated than a single-goal, model-based (consequentialist) agent, but that seems to be neither here nor there for alignment purposes.
This does help lead to a human-like AGI, but I see humans as barely-aligned-on-average, so that any deviation from an accurate copy could well be deadly. In particular, sociopaths do not seem to be adequately aligned, and since sociopathy has a large genetic component, I think we’d need to figure out what that adds. Odds are it’s a specific pro-social instinct, probably implemented as a reward function. It’s this logic that leads to Steve Byrnes’ research agenda.
Maybe I’m not getting the full implications you see for alignment? I’d sure like to increase my estimate of aligning a brainlike AGI, because I think there’s a good chance we get something in that general architecture as the first real AGI. I think there’s a route there, but it’s not particularly reliable. I hope we get a language model agent as the first AGI, since aligning those seems more reliable.
Thanks for linking to your papers and definitely interesting you have been thinking along similar lines. I think the key reason I think studying this is important is that I think that these hedonic loops demonstrate that a.) Mammals including humans are actually exceptionally aligned to basic homeostatic needs and basic hedonic loops I’m practice. It is extremely hard and rare for people to choose not to follow homeostatic drives. I think humans are mostly ‘misaligned’ about higher level things like morality, empathy etc is because we dont actually have direct drives hardcoded in the hypothalamus for them the way we do for primary rewards. Higher level behaviours either socio-culturally learned through unsupervised critically based learning or derived from RL extrapolations from primary rewards. It is no surprise that alignment to these ideals is weaker. B.) That relatively simple control loops are very effective at controlling vastly more complex unsupervised cognitive systems.
I also agree this is similar to steven Byrnes agenda and maybe just my way to arrive at it