I had totally forgotten about your subagents post.
this post doesn’t cleanly distinguish between reward-maximization and utility-maximization
I’ve been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize “the reward as currently understood by my predictive model”. Then “the reward as currently understood by my predictive model” is basically a utility function. But at the same time, there’s a separate subroutine that edits the reward prediction model (≈ utility function) to ever more closely approximate the true reward function (by some learning algorithm, presumably involving reward prediction errors).
In other words: At any given time, the part of the agent that’s making plans and taking actions looks like a utility maximizer. But if you lump together that part plus the subroutine that keeps editing the reward prediction model to better approximate the real reward signal, then that whole system is a reward-maximizing RL agent.
Please tell me if that makes any sense or not; I’ve been planning to write pretty much exactly this comment (but with a diagram) into a short post.
Not sure how all the details play out—in particular, my big question for any RL setup is “how does it avoid wireheading?”. In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.
Um, unreliably, at least by default. Like, some humans are hedonists, others aren’t.
I think there’s a “hardcoded” credit assignment algorithm. When there’s a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I’m not sure of the gory details here.)
Anyway, insofar as “the reward signal itself” is part of the world-model, it’s possible that reward-prediction / value will wind up attached to that concept. And then that’s a desire to wirehead. But it’s not inevitable. Some of the relevant dynamics are:
Timing—if credit goes mainly to signals that slightly precede the reward prediction error, then the reward signal itself is not a great fit.
Explaining away—once you have a way to accurately predict some set of reward signals, it makes the reward prediction errors go away, so the credit assignment algorithm stops running for those signals. So the first good reward-predicting model gets to stick around by default. Example: we learn early in life that the “eating candy” concept predicts certain reward signals, and then we get older and learn that the “certain neural signals in my brain” concept predicts those same reward signals too. But just learning that fact doesn’t automatically translate into “I really want those certain neural signals in my brain”. Only the credit assignment algorithm can make a thought appealing, and if the rewards are already being predicted then the credit assignment algorithm is inactive. (This is kinda like the behaviorism concept of blocking.)
There may be some kind of bias to assign credit to predictive models that are simple functions of sensory inputs, when such a model exists, other things equal. (I’m thinking here of the relation between amygdala predictions, which I think are restricted to relatively simple functions of sensory input, versus mPFC predictions, which I think can involve more abstract situational knowledge. I’m still kinda confused about how this works though.)
There’s a difference between hedonism-lite (“I want to feel good, although it’s not the only thing I care about”) and hedonism-level-10 (“I care about nothing whatsoever except feeling good”). My model would suggest that hedonism-lite is widespread, but hedonism-level-10 is vanishingly rare or nonexistent, because it requires that somehow all value gets removed from absolutely everything in the world-model except that one concept of the reward signal.
For AGIs we would probably want to do other things too, like (somehow) use transparency to find “the reward signal itself” in the world-model and manually fix its reward-prediction / value at zero, or whatever else we can think of. Also, I think the more likely failure mode is “wireheading-lite”, where the desire to wirehead is trading off against other things it cares about, and then hopefully conservatism (section 2 here) can help prevent catastrophe.
Thanks!
I had totally forgotten about your subagents post.
I’ve been thinking that they kinda blend together in model-based RL, or at least the kind of (brain-like) model-based RL AGI that I normally think about. See this comment and surrounding discussion. Basically, one way to do model-based RL is to have the agent create a predictive model of the reward and then judge plans based on their tendency to maximize “the reward as currently understood by my predictive model”. Then “the reward as currently understood by my predictive model” is basically a utility function. But at the same time, there’s a separate subroutine that edits the reward prediction model (≈ utility function) to ever more closely approximate the true reward function (by some learning algorithm, presumably involving reward prediction errors).
In other words: At any given time, the part of the agent that’s making plans and taking actions looks like a utility maximizer. But if you lump together that part plus the subroutine that keeps editing the reward prediction model to better approximate the real reward signal, then that whole system is a reward-maximizing RL agent.
Please tell me if that makes any sense or not; I’ve been planning to write pretty much exactly this comment (but with a diagram) into a short post.
Good explanation, conceptually.
Not sure how all the details play out—in particular, my big question for any RL setup is “how does it avoid wireheading?”. In this case, presumably there would have to be some kind of constraint on the reward-prediction model, so that it ends up associating the reward with the state of the environment rather than the state of the sensors.
Um, unreliably, at least by default. Like, some humans are hedonists, others aren’t.
I think there’s a “hardcoded” credit assignment algorithm. When there’s a reward prediction error, that algorithm primarily increments the reward-prediction / value associated with whatever stuff in the world model became newly active maybe half a second earlier. And maybe to a lesser extent, it also increments the reward-prediction / value associated with anything else you were thinking about at the time. (I’m not sure of the gory details here.)
Anyway, insofar as “the reward signal itself” is part of the world-model, it’s possible that reward-prediction / value will wind up attached to that concept. And then that’s a desire to wirehead. But it’s not inevitable. Some of the relevant dynamics are:
Timing—if credit goes mainly to signals that slightly precede the reward prediction error, then the reward signal itself is not a great fit.
Explaining away—once you have a way to accurately predict some set of reward signals, it makes the reward prediction errors go away, so the credit assignment algorithm stops running for those signals. So the first good reward-predicting model gets to stick around by default. Example: we learn early in life that the “eating candy” concept predicts certain reward signals, and then we get older and learn that the “certain neural signals in my brain” concept predicts those same reward signals too. But just learning that fact doesn’t automatically translate into “I really want those certain neural signals in my brain”. Only the credit assignment algorithm can make a thought appealing, and if the rewards are already being predicted then the credit assignment algorithm is inactive. (This is kinda like the behaviorism concept of blocking.)
There may be some kind of bias to assign credit to predictive models that are simple functions of sensory inputs, when such a model exists, other things equal. (I’m thinking here of the relation between amygdala predictions, which I think are restricted to relatively simple functions of sensory input, versus mPFC predictions, which I think can involve more abstract situational knowledge. I’m still kinda confused about how this works though.)
There’s a difference between hedonism-lite (“I want to feel good, although it’s not the only thing I care about”) and hedonism-level-10 (“I care about nothing whatsoever except feeling good”). My model would suggest that hedonism-lite is widespread, but hedonism-level-10 is vanishingly rare or nonexistent, because it requires that somehow all value gets removed from absolutely everything in the world-model except that one concept of the reward signal.
For AGIs we would probably want to do other things too, like (somehow) use transparency to find “the reward signal itself” in the world-model and manually fix its reward-prediction / value at zero, or whatever else we can think of. Also, I think the more likely failure mode is “wireheading-lite”, where the desire to wirehead is trading off against other things it cares about, and then hopefully conservatism (section 2 here) can help prevent catastrophe.