Yeah I just meant the upper bound of “within 2 OOM.” :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I’d be all for it.
I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that’s the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important.
As for why I think it’s possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it’s pretty likely that we’ll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don’t want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.
Yeah I just meant the upper bound of “within 2 OOM.” :) If we could somehow beat the lower bound and get aligned AI with just a few minutes of human feedback, I’d be all for it.
I think aiming for under a few hundred hours of feedback is a good goal because we want to keep the alignment tax low, and that’s the kind of tax I see as being easily payable. An unstated assumption I made is that I expect we can use unlabeled data to do a lot of the work of alignment, making labeled data somewhat superfluous, but that I still think amount of feedback is important.
As for why I think it’s possible, I can only plead intuition about what I expect from on-the-horizon advances in priors over models of humans, and ability to bootstrap models from unlabeled data plus feedback.
I share your intuitions about ultimately not needing much alignment data (and tried to get that across in the post), but quantitatively:
Recent implementations of RLHF have used on the order of thousands of hours of human feedback, so 2 orders of magnitude more than that is much more than a few hundred hours of human feedback.
I think it’s pretty likely that we’ll be able to pay an alignment tax upwards of 1% of total training costs (essentially because people don’t want to die), in which case we could afford to spend significantly more than an additional 2 orders of magnitude on alignment data, if that did in fact turn out to be required.