I think that data quality is a helpful framing of outer alignment for a few reasons:
Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
If we had the perfect alignment solution on paper, we would still need to implement it. Since we don’t yet have the perfect alignment solution on paper, we should entertain the possibility that implementing it involves paying attention to data quality (whether in the sense of scalable oversight or in a more mundane sense).
It’s not a framing I’ve seen before, and I think it’s helpful to have different framings for things.
I do think that the framing is less helpful if the answer to my question is “not much”, but that’s currently still unclear to me, for the reasons I give in the post.
I agree that data quality doesn’t guarantee robustness, but that’s a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.
I think my big disagreement is with point one—yes, if you fix the architecture as something with bad alignment properties, then there is probably some dataset / reward signal that still gives you a good outcome. But this doesn’t work in real life, and it’s not something I see people working on such that there needs to be a word for it.
What deserves a word is people starting by thinking about both what we want the AI to learn and how, and picking datasets and architectures in tandem based on a theoretical story of how the AI is going to learn what we want it to.
A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.
I think that data quality is a helpful framing of outer alignment for a few reasons:
Under the assumption of a generic objective such as reinforcement learning, outer alignment is definitionally equivalent to having high enough data quality. (More precisely, if the objective is generic enough that it is possible for it to produce an aligned policy, then outer alignment is equivalent to the data distribution being such that an aligned policy is preferred to any unaligned policy.)
If we had the perfect alignment solution on paper, we would still need to implement it. Since we don’t yet have the perfect alignment solution on paper, we should entertain the possibility that implementing it involves paying attention to data quality (whether in the sense of scalable oversight or in a more mundane sense).
It’s not a framing I’ve seen before, and I think it’s helpful to have different framings for things.
I do think that the framing is less helpful if the answer to my question is “not much”, but that’s currently still unclear to me, for the reasons I give in the post.
I agree that data quality doesn’t guarantee robustness, but that’s a general argument about how helpful it is to decompose alignment into outer alignment and robustness. I have some sympathy for that, but it seems distinct from the question of whether data quality is a helpful framing of outer alignment.
I think my big disagreement is with point one—yes, if you fix the architecture as something with bad alignment properties, then there is probably some dataset / reward signal that still gives you a good outcome. But this doesn’t work in real life, and it’s not something I see people working on such that there needs to be a word for it.
What deserves a word is people starting by thinking about both what we want the AI to learn and how, and picking datasets and architectures in tandem based on a theoretical story of how the AI is going to learn what we want it to.
A number of reasonable outer alignment proposals such as iterated amplification, recursive reward modeling and debate use generic objectives such as reinforcement learning (and indeed, none of them would work in practice without sufficiently high data quality), so it seems strange to me to dismiss these objectives.