If our alignment training data correctly favors aligned behavior over unaligned behavior, then we have solved outer alignment.
I’m curious to understand what this means, what “data favoring aligned behavior” means particularly. I’ll take for granted as background that there are some policies that are good (“aligned” and capable) and some that are bad. I see two problems with the concept of data favoring a certain kind of policy:
Data doesn’t specify generalization. For any achievable training loss on some dataset, there are many policies that achieve that loss, and some of them will be good, some bad.
There’s ambiguity in what it means for training data to favor some behavior. On the one hand, there’s the straightforward interpretation of the labels: the data specifies a preference for one kind of behavior over another. On the other hand, there’s the behavior of the policies actually found by training on this data. These can come far apart if the data’s main effect on the policies found is to imbue them with a rich world model and goals. There isn’t necessarily a straightforward relationship between the labels and the goals.
I realise you’re focusing on “outer alignment” here, and maybe these are not outer alignment problems.
This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.
For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.
I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.
I’m curious to understand what this means, what “data favoring aligned behavior” means particularly. I’ll take for granted as background that there are some policies that are good (“aligned” and capable) and some that are bad. I see two problems with the concept of data favoring a certain kind of policy:
Data doesn’t specify generalization. For any achievable training loss on some dataset, there are many policies that achieve that loss, and some of them will be good, some bad.
There’s ambiguity in what it means for training data to favor some behavior. On the one hand, there’s the straightforward interpretation of the labels: the data specifies a preference for one kind of behavior over another. On the other hand, there’s the behavior of the policies actually found by training on this data. These can come far apart if the data’s main effect on the policies found is to imbue them with a rich world model and goals. There isn’t necessarily a straightforward relationship between the labels and the goals.
I realise you’re focusing on “outer alignment” here, and maybe these are not outer alignment problems.
This is just supposed to be an (admittedly informal) restatement of the definition of outer alignment in the context of an objective function where the data distribution plays a central role.
For example, assuming a reinforcement learning objective function, outer alignment is equivalent to the statement that there is an aligned policy that gets higher average reward on the training distribution than any unaligned policy.
I did not intend to diminish the importance of robustness by focusing on outer alignment in this post.