Seth Herd comments on Human preferences as RL critic values—implications for alignment

Seth Herd 19 Mar 2023 22:11 UTC
3 points
0
Me avoiding heroin isn’t “not governed by the critic,” instead what’s going on is that it’s learned behavior based largely on how the critic has acted so far in my life, which happens to generalize in a way that contradicts what the critic would do if I actually tried heroin.
I think we’re largely in agreement on this. The actor system is controlling a lot of our behavior. But it’s doing so as the critic system trained it to do. So the critic is in charge, minus generalization errors.
However, I also want to claim that the critic system is directly in charge when we’re using model-based thinking- when we come up with a predicted outcome before acting, the critic is supplying the estimate of how good that outcome is. But I’m not even sure this is a crux. The critic is still in charge in a pretty important way.
If I go out and become a heroin addict and start to value heroin, that information would also be found in the actor, not in the critic.
I think that information would be found in both the actor and the critic. But not to exactly the same degree. I think the critic probably updates faster. And the end result of the process can be a complex interaction between the actor, a world model (which I didn’t even bring into it in the article) and the critic. For instance, if it doesn’t occur to you to think about the likely consequences of doing heroin, the decision is based on the critic’s prediction that the heroin will be awesome. If the process, governed probably by the actor, does make a prediction of withdrawals and degradation as a result, then the decision is based on a rough sum that includes the critic’s very negative assignment of value to that part of the outcome.
The problem faced by evolution (and also by humans trying to align AI) is that the critic doesn’t start out omniscient, or even particularly clever—it doesn’t actually know what the expectation-discounted reward is.
I totally agree. That’s why the key question here is whether the critic can be reprogrammed after there’s enough knowledge in the actor and the world model.
As for the idea that the critic nudges, I agree. I think the early nudges are provided by a small variety of innate reward signals, and the critic then expands those with theories of the next thing we should explore, as it learns to connect those innate rewards to other sensory representations.
The critic is only representing adult human “values” as the result of tons of iterative learning between the systems. That’s the theory, anyway.
It’s also worth noting that, even if this isn’t how the human system works, it might be a workable scheme to make more alignable AGI systems.