I think that from an AI Alignment perspective, giving AI so much control over its training seems to be very problematic. What we are mostly left with is to control the interface that AI has to physical reality i.e. sensors and actuators.
For now, it seems to me that AI is mostly affecting the virtual world. I think the moment when AI can competently and more directly influence physical reality would be a tipping point, because then it can cause a lot more changes to the world.
I would say that the ability to do continuous learning is required to adapt well to the complexity of physical reality. So a big improvement in continuous learning might be an important next goalpost to watch for.
We should probably distinguish between a bunch of different things, such as: the ‘race’ between capabilities research and alignment research, AI supervision and control, some notion of “near-term default alignedness” of AIs, interpretability, and value alignment.
I think it’s pretty obvious what’s going on with the race between capabilities and alignment.
This doesn’t have a big impact on AI supervision or control, because the way we currently plan to do these things involves mostly treating the AIs as black boxes—strategies like untrusted model supervision with a small percent of honeypots should still work about as well.
I agree this is bad for the near-term default alignedness of AI, because currently all RL is bad for the near-term default alignedness of AI. But note that this means I disagree with your reasoning: it’s not “control over its training” that’s the problem, it’s the distinction between learning from human data versus learning from a reward signal.
Probably not much impact on interpretability, I just included that for completeness.
The impact on alignment per se is an interesting question (if not about this paper specifically, then about this general direction of progress) that I think people have understandably dodged. I’ll try my hand at a top-level comment on that one.
“it’s the distinction between learning from human data versus learning from a reward signal.” That’s an interesting distinction. The difference I currently see between the two is that currently a reward signal can be hacked by the AI, while human data cannot. Is that an accurate thing to say?
Are there any resources you could recommend for alignment methods that take into account the distinction you mentioned?
That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.
I think that from an AI Alignment perspective, giving AI so much control over its training seems to be very problematic. What we are mostly left with is to control the interface that AI has to physical reality i.e. sensors and actuators.
For now, it seems to me that AI is mostly affecting the virtual world. I think the moment when AI can competently and more directly influence physical reality would be a tipping point, because then it can cause a lot more changes to the world.
I would say that the ability to do continuous learning is required to adapt well to the complexity of physical reality. So a big improvement in continuous learning might be an important next goalpost to watch for.
We should probably distinguish between a bunch of different things, such as: the ‘race’ between capabilities research and alignment research, AI supervision and control, some notion of “near-term default alignedness” of AIs, interpretability, and value alignment.
I think it’s pretty obvious what’s going on with the race between capabilities and alignment.
This doesn’t have a big impact on AI supervision or control, because the way we currently plan to do these things involves mostly treating the AIs as black boxes—strategies like untrusted model supervision with a small percent of honeypots should still work about as well.
I agree this is bad for the near-term default alignedness of AI, because currently all RL is bad for the near-term default alignedness of AI. But note that this means I disagree with your reasoning: it’s not “control over its training” that’s the problem, it’s the distinction between learning from human data versus learning from a reward signal.
Probably not much impact on interpretability, I just included that for completeness.
The impact on alignment per se is an interesting question (if not about this paper specifically, then about this general direction of progress) that I think people have understandably dodged. I’ll try my hand at a top-level comment on that one.
“it’s the distinction between learning from human data versus learning from a reward signal.” That’s an interesting distinction. The difference I currently see between the two is that currently a reward signal can be hacked by the AI, while human data cannot. Is that an accurate thing to say?
Are there any resources you could recommend for alignment methods that take into account the distinction you mentioned?
That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.