What is narrow value learning?
Ambitious value learning aims to achieve superhuman performance by figuring out the underlying latent “values” that humans have, and evaluating new situations according to these values. In other words, it is trying to infer the criteria by which we judge situations to be good. This is particularly hard because in novel situations that humans haven’t seen yet, we haven’t even developed the criteria by which we would evaluate. (This is one of the reasons why we need to model humans as suboptimal, which causes problems.)
Instead of this, we can use narrow value learning, which produces behavior that we want in some narrow domain, without expecting generalization to novel circumstances. The simplest form of this is imitation learning, where the AI system simply tries to imitate the supervisor’s behavior. This limits the AI’s performance to that of its supervisor. We could also learn from preferences over behavior, which can scale to superhuman performance, since the supervisor can often evaluate whether a particular behavior meets our preferences even if she can’t perform it herself. We could also teach our AI systems to perform tasks that we would not want to do ourselves, such as handling hot objects.
Nearly all of the work on preference learning, including most work on inverse reinforcement learning (IRL), is aimed at narrow value learning. IRL is often explicitly stated to be a technique for imitation learning, and early algorithms phrase the problem as matching the features in the demonstration, not exceeding them. The few algorithms that try to generalize to different test distributions, such as AIRL, are only aiming for relatively small amounts of generalization.
(Why use IRL instead of behavioral cloning, where you mimic the actions that the demonstrator took? The hope is that IRL gives you a good inductive bias for imitation, allowing you to be more sample efficient and to generalize a little bit.)
You might have noticed that I talk about narrow value learning in terms of actual observed behavior from the AI system, as opposed to any sort of “preferences” or “values” that are inferred. This is because I want to include approaches like imitation learning, or meta learning for quick task identification and performance. These approaches can produce behavior that we want without having an explicit representation of “preferences”. In practice any method that scales to human intelligence is going to have to infer preferences, though perhaps implicitly.
Since any instance of narrow value learning is defined with respect to some domain or input distribution on which it gives sensible results, we can rank them according to how general this input distribution is. An algorithm that figures out what food I like to eat is very domain-specific, whereas one that determines my life goals and successfully helps me achieve them in both the long and short term is very general. When the input distribution is “all possible inputs”, we have a system that has good behavior everywhere, reminiscent of ambitious value learning.
(Annoyingly, I defined ambitious value learning to be about the definition of optimal behavior, such as an inferred utility function, while narrow value learning is about the observed behavior. So really the most general version of narrow value learning is equivalent to “ambitious value learning plus some method of actually obtaining the defined behavior in practice, such as by using deep RL”.)
Tomorrow’s AI Alignment Forum sequences post will be ‘Directions for AI Alignment’ by Paul Christiano in the sequence on iterated amplification.
The next post in this sequence will be ‘Ambitious vs. narrow value learning’ by Paul Christiano, on Friday 11th January.