We should probably distinguish between a bunch of different things, such as: the ‘race’ between capabilities research and alignment research, AI supervision and control, some notion of “near-term default alignedness” of AIs, interpretability, and value alignment.
I think it’s pretty obvious what’s going on with the race between capabilities and alignment.
This doesn’t have a big impact on AI supervision or control, because the way we currently plan to do these things involves mostly treating the AIs as black boxes—strategies like untrusted model supervision with a small percent of honeypots should still work about as well.
I agree this is bad for the near-term default alignedness of AI, because currently all RL is bad for the near-term default alignedness of AI. But note that this means I disagree with your reasoning: it’s not “control over its training” that’s the problem, it’s the distinction between learning from human data versus learning from a reward signal.
Probably not much impact on interpretability, I just included that for completeness.
The impact on alignment per se is an interesting question (if not about this paper specifically, then about this general direction of progress) that I think people have understandably dodged. I’ll try my hand at a top-level comment on that one.
“it’s the distinction between learning from human data versus learning from a reward signal.” That’s an interesting distinction. The difference I currently see between the two is that currently a reward signal can be hacked by the AI, while human data cannot. Is that an accurate thing to say?
Are there any resources you could recommend for alignment methods that take into account the distinction you mentioned?
That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.
We should probably distinguish between a bunch of different things, such as: the ‘race’ between capabilities research and alignment research, AI supervision and control, some notion of “near-term default alignedness” of AIs, interpretability, and value alignment.
I think it’s pretty obvious what’s going on with the race between capabilities and alignment.
This doesn’t have a big impact on AI supervision or control, because the way we currently plan to do these things involves mostly treating the AIs as black boxes—strategies like untrusted model supervision with a small percent of honeypots should still work about as well.
I agree this is bad for the near-term default alignedness of AI, because currently all RL is bad for the near-term default alignedness of AI. But note that this means I disagree with your reasoning: it’s not “control over its training” that’s the problem, it’s the distinction between learning from human data versus learning from a reward signal.
Probably not much impact on interpretability, I just included that for completeness.
The impact on alignment per se is an interesting question (if not about this paper specifically, then about this general direction of progress) that I think people have understandably dodged. I’ll try my hand at a top-level comment on that one.
“it’s the distinction between learning from human data versus learning from a reward signal.” That’s an interesting distinction. The difference I currently see between the two is that currently a reward signal can be hacked by the AI, while human data cannot. Is that an accurate thing to say?
Are there any resources you could recommend for alignment methods that take into account the distinction you mentioned?
That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.