Thanks for sharing, I agree with most of these arguments.
Another possible shortcoming of RLHF is that it teaches human preferences within the narrow context of a specific task, rather than teaching a more principled and generalizable understanding of human values. For example, ChatGPT learns from human feedback to not give racist responses in conversation. But does the model develop a generalizable understanding of what racism is, why it’s wrong, and how to avoid moral harm from racism in a variety of contexts? It seems like the answer is no, given the many ways to trick ChatGPT into providing dangerously racist outputs. Redwood’s work on high reliability filtering came to a similar conclusion: current RLHF techniques don’t robustly generalize to cover similar examples of corrected failures.
Rather than learning by trial and error, it’s possible that learning human values in a systematic, principled way could generalize further than RLHF. For example, the ETHICS benchmark measures whether language models understand the implications of various moral theories. Such an understanding could be used to filter the outputs of another language model or AI agent, or as a reward model to train other models. Similarly, law is a codified set of human behavioral norms with established procedures for oversight and dispute resolution. There has been discussion of how AI could learn to follow human laws, though law has significant flaws as a codification of human values. I’d be interested to see more work evaluating and improving AI understanding of principled systems of human value, and using that understanding to better align AI behavior.
(h/t Michael Chen and Dan Hendrycks for making this argument before I understood it.)
Thanks for sharing, I agree with most of these arguments.
Another possible shortcoming of RLHF is that it teaches human preferences within the narrow context of a specific task, rather than teaching a more principled and generalizable understanding of human values. For example, ChatGPT learns from human feedback to not give racist responses in conversation. But does the model develop a generalizable understanding of what racism is, why it’s wrong, and how to avoid moral harm from racism in a variety of contexts? It seems like the answer is no, given the many ways to trick ChatGPT into providing dangerously racist outputs. Redwood’s work on high reliability filtering came to a similar conclusion: current RLHF techniques don’t robustly generalize to cover similar examples of corrected failures.
Rather than learning by trial and error, it’s possible that learning human values in a systematic, principled way could generalize further than RLHF. For example, the ETHICS benchmark measures whether language models understand the implications of various moral theories. Such an understanding could be used to filter the outputs of another language model or AI agent, or as a reward model to train other models. Similarly, law is a codified set of human behavioral norms with established procedures for oversight and dispute resolution. There has been discussion of how AI could learn to follow human laws, though law has significant flaws as a codification of human values. I’d be interested to see more work evaluating and improving AI understanding of principled systems of human value, and using that understanding to better align AI behavior.
(h/t Michael Chen and Dan Hendrycks for making this argument before I understood it.)