I think it is reasonable to say that your plan is not ‘have the AIs do your homework’ to the extent that your research has been roadmapped by humans. This is a spectrum, here’s some points on it:
Using AIs to monitor for potential reward hacks, or using AIs as automated interpretability agents. At a stretch, maybe I’d call this ‘have the AIs do our reward hacking homework’ or similar. We’re here already.
Using AIs to come up with new debate algorithms that better satisfy desiderata outlined by humans, or SAE variants that perform better on SAEbench. We might call this ‘have the AIs do our algorithm design homework’.
Using the AIs to make progress on a goal we can’t articulate super precisely. eg make progress in interpretability/design training processes that incentivise honesty better (including iterating on algorithms and desiderata/evals, and whatever else interpretability researchers do). Insofar as we’ve pulled out a particular approach to AI alignment, and we sorta know what we’re looking for here, Maybe we could call this ‘have the AIs do our interpretability/honesty training homework’.
The distinction between (3) and (4) is pretty fuzzy, and if you’re expecting ‘doing our interpretability homework’ to involve a bunch of conceptual research and ultimately radically different techniques, and humans are hardly involved in any of that, then it’s getting closer to what I think we should mean by ‘doing our alignment homework’.
Throwing up our hands, fully handing off every part of the research process (except possibly a small amount of checking the AIs’ work), and just asking the AIs to make sure that future AI systems aren’t misaligned. This is definitely getting AIs to do our alignment homework.
We can avoid ending up towards the latter end of the spectrum in two ways that I can think of:
We expect the most scalable alignment approaches of today to scale as far as capabilities does. It’s not clear to me from your message whether you think this is true?
If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
Both of these possibilities seem quite fraught to me. If you believe (1), I’d be grateful if you could point me in the direction of arguments that we have approaches that will scale all the way (and if you just mean that interpretability, broadly construed enough to include arbitrary conceptual progress and changes, will scale, then as I mention above I think it’s reasonable-ish to call that giving the AIs our alignment homework’). If you believe (2), is this because you’re expecting GDM/the world to dramatically slow down the intelligence explosion for some time to allow humans to keep up, or because you think humans will have an easy time keeping up, or for some other reason?
To be clear, I am not trying to argue here that ‘getting AIs to do our alignment homework’ is a bad plan (I think it is scary as hell but the best course of action likely involves a bunch of it). I’m just trying to articulate why I think that if your plan doesn’t involve that step, then it is likely to be baking in some assumptions that seem dubious to me and are IMO worth stating explicitly.
Rohin Shah’s position seems perfectly clear to me.
More research good. AI will speed up research along the whole range of 1-4.
>> If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
This seems a weird way to phrase things. Or seems to be predicated on some assumptions I find surprising. It seems to suggest that not only that alignment is a continual neverending process [likely] but that finding alignment techniques is a continual neverending process ..!
If we would analogize ‘alignment’ to ‘steering rockets so they hit their target’ this would suggest that not only that rockets need to be continually steered and course-corrected to hit their target—the entire field of rocket steering needs continual updating and course-correcting. There is a perspective from which this is true since as humanity invents faster and better rockets—there is another perspective from which this is sort of misleading: future rockets will be governed by the Laws of Newton and steering comes down to some variant on Kalman filters and the like.
I think it is reasonable to say that your plan is not ‘have the AIs do your homework’ to the extent that your research has been roadmapped by humans. This is a spectrum, here’s some points on it:
Using AIs to monitor for potential reward hacks, or using AIs as automated interpretability agents. At a stretch, maybe I’d call this ‘have the AIs do our reward hacking homework’ or similar. We’re here already.
Using AIs to come up with new debate algorithms that better satisfy desiderata outlined by humans, or SAE variants that perform better on SAEbench. We might call this ‘have the AIs do our algorithm design homework’.
Using the AIs to make progress on a goal we can’t articulate super precisely. eg make progress in interpretability/design training processes that incentivise honesty better (including iterating on algorithms and desiderata/evals, and whatever else interpretability researchers do). Insofar as we’ve pulled out a particular approach to AI alignment, and we sorta know what we’re looking for here, Maybe we could call this ‘have the AIs do our interpretability/honesty training homework’.
The distinction between (3) and (4) is pretty fuzzy, and if you’re expecting ‘doing our interpretability homework’ to involve a bunch of conceptual research and ultimately radically different techniques, and humans are hardly involved in any of that, then it’s getting closer to what I think we should mean by ‘doing our alignment homework’.
Throwing up our hands, fully handing off every part of the research process (except possibly a small amount of checking the AIs’ work), and just asking the AIs to make sure that future AI systems aren’t misaligned. This is definitely getting AIs to do our alignment homework.
We can avoid ending up towards the latter end of the spectrum in two ways that I can think of:
We expect the most scalable alignment approaches of today to scale as far as capabilities does. It’s not clear to me from your message whether you think this is true?
If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
Both of these possibilities seem quite fraught to me. If you believe (1), I’d be grateful if you could point me in the direction of arguments that we have approaches that will scale all the way (and if you just mean that interpretability, broadly construed enough to include arbitrary conceptual progress and changes, will scale, then as I mention above I think it’s reasonable-ish to call that giving the AIs our alignment homework’). If you believe (2), is this because you’re expecting GDM/the world to dramatically slow down the intelligence explosion for some time to allow humans to keep up, or because you think humans will have an easy time keeping up, or for some other reason?
To be clear, I am not trying to argue here that ‘getting AIs to do our alignment homework’ is a bad plan (I think it is scary as hell but the best course of action likely involves a bunch of it). I’m just trying to articulate why I think that if your plan doesn’t involve that step, then it is likely to be baking in some assumptions that seem dubious to me and are IMO worth stating explicitly.
Rohin Shah’s position seems perfectly clear to me.
More research good. AI will speed up research along the whole range of 1-4.
>> If our most scalable current approaches don’t scale all the way through takeoff, then we (humans) will be able to be keeping up and being critically intellectually involved with developing new alignment techniques when our existing ones stop scaling, at least at the level of granularity of (3) so that we can continue to leverage AIs in ways (1-3) above. We need to keep being able to do this indefinitely, or until we come up with an approach to alignment that does scale as far as capabilities will.
This seems a weird way to phrase things. Or seems to be predicated on some assumptions I find surprising. It seems to suggest that not only that alignment is a continual neverending process [likely] but that finding alignment techniques is a continual neverending process ..!
If we would analogize ‘alignment’ to ‘steering rockets so they hit their target’ this would suggest that not only that rockets need to be continually steered and course-corrected to hit their target—the entire field of rocket steering needs continual updating and course-correcting. There is a perspective from which this is true since as humanity invents faster and better rockets—there is another perspective from which this is sort of misleading: future rockets will be governed by the Laws of Newton and steering comes down to some variant on Kalman filters and the like.