I think my response heavily depends on the operationalization of alignment for the initial AIs, and I’m struggling to keep things from becoming circular in my decomposition of various operationalizations. The crude response is that you’re begging the question here by first positing aligned AIs, but I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I think there’s a better-specified (from my end; you’re doing great) version of this conversation that focuses on three different categories of techniques, based on the capability level at which we expect each to be effective:
Current model-level
Useful autonomous AI researcher level
Superintelligence
However, I think that disambiguating between proposed agendas for 2 + 3 is very hard, and assuming agendas that plausibly work for 1 also work for 2 is a mistake. It’s not clear to me why the ‘it’s a god, it fucks you, there’s nothing you can do about that’ concerns don’t apply for models capable of:
hard-to-check, conceptual, and open-ended tasks
I feel pretty good about this exchange if you want to leave it here, btw! Probably I’ll keep engaging far beyond the point at which its especially useful (although we’re likely pretty far from the point where it stops being useful to me rn).
Ok, sound it sounds like your view is “indeed if we got ~totally aligned AIs capable of fully automating safety work (but not notably more capable than the bare minimum requirement for this), we’d probably fine (even if there is still a small fraction of effort spent on safety) and the crux is earlier than this”.
Is this right? If so, it seems notable if the problem can be mostly reduced to sufficiently aligning (still very capable) human-ish level AIs and handing off to these systems (which don’t have the scariest properties of an ASI from an alignment perspective).
I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I’d say my position is more like:
Scheming might just not happen: It’s basically a toss up whether systems at this level of capability would end up scheming “by default” (as in, without active effort researching preventing scheming and just work motivated by commercial utility along the way). Maybe I’m at ~40% scheming for such systems, though the details alter my view a lot.
The rest of the problem if we assume no scheming doesn’t obviously seem that hard: It’s unclear how hard it will be to make non-scheming AIs of the capability level discussed above be sufficiently aligned for the strong sense of alignment I discussed above. I think it’s unlikely that the default course gets us there, but it seems pretty plausible to me that modest effort along the way does. It just requires some favorable generalization of the sort that doesn’t seem that surprising and we’ll have some AI labor along the way to help. And, for this part of the problem, we totally can get multiple tries and study things pretty directly with empiricism using behavioral tests (though we’re still depending on some cleverness and transfer as we can’t directly verify the things we ultimately want the AI to do).
Further prosaic effort seems helpful for both avoiding scheming and the rest of the problem: I don’t see strong arguments for thinking that at the level of capability we’re discussing scheming will be intractable to prosaic methods or experimentation. I can see why this might happen and I can certainly imagine worlds where no on really tries. Similarly, I don’t see a strong argument for further effort on relatively straightforward methods can’t help a bunch in getting you sufficiently aligned systems (supposing they aren’t scheming): we can measure what we want somewhat well with a bunch of effort and I can imagine many things which could make a pretty big difference (again, this isn’t to say that this effort will happen in practice).
This isn’t to say that I can’t imagine worlds where pretty high effort and well orchestrated prosaic iteration totally fails. This seems totally plausible, especially given how fast this might happen, so risks seem high. And, it’s easy for me to imagine ways the world could be such that relatively prosaic methods and iteration is ~doomed without much more time than we can plausibly hope for, it’s just that these seem somewhat unlikely in aggregate to me.
So, I’d be pretty skeptical of someone claiming that the risk of this type of approach would be <3% (without at the very least preserving the optionality for a long pause during takeoff depending on empirical evidence), but I don’t see a case for thinking “it would be very surprising or wild if prosaic iteration sufficed”.
I think my response heavily depends on the operationalization of alignment for the initial AIs, and I’m struggling to keep things from becoming circular in my decomposition of various operationalizations. The crude response is that you’re begging the question here by first positing aligned AIs, but I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I think there’s a better-specified (from my end; you’re doing great) version of this conversation that focuses on three different categories of techniques, based on the capability level at which we expect each to be effective:
Current model-level
Useful autonomous AI researcher level
Superintelligence
However, I think that disambiguating between proposed agendas for 2 + 3 is very hard, and assuming agendas that plausibly work for 1 also work for 2 is a mistake. It’s not clear to me why the ‘it’s a god, it fucks you, there’s nothing you can do about that’ concerns don’t apply for models capable of:
I feel pretty good about this exchange if you want to leave it here, btw! Probably I’ll keep engaging far beyond the point at which its especially useful (although we’re likely pretty far from the point where it stops being useful to me rn).
Ok, sound it sounds like your view is “indeed if we got ~totally aligned AIs capable of fully automating safety work (but not notably more capable than the bare minimum requirement for this), we’d probably fine (even if there is still a small fraction of effort spent on safety) and the crux is earlier than this”.
Is this right? If so, it seems notable if the problem can be mostly reduced to sufficiently aligning (still very capable) human-ish level AIs and handing off to these systems (which don’t have the scariest properties of an ASI from an alignment perspective).
I’d say my position is more like:
Scheming might just not happen: It’s basically a toss up whether systems at this level of capability would end up scheming “by default” (as in, without active effort researching preventing scheming and just work motivated by commercial utility along the way). Maybe I’m at ~40% scheming for such systems, though the details alter my view a lot.
The rest of the problem if we assume no scheming doesn’t obviously seem that hard: It’s unclear how hard it will be to make non-scheming AIs of the capability level discussed above be sufficiently aligned for the strong sense of alignment I discussed above. I think it’s unlikely that the default course gets us there, but it seems pretty plausible to me that modest effort along the way does. It just requires some favorable generalization of the sort that doesn’t seem that surprising and we’ll have some AI labor along the way to help. And, for this part of the problem, we totally can get multiple tries and study things pretty directly with empiricism using behavioral tests (though we’re still depending on some cleverness and transfer as we can’t directly verify the things we ultimately want the AI to do).
Further prosaic effort seems helpful for both avoiding scheming and the rest of the problem: I don’t see strong arguments for thinking that at the level of capability we’re discussing scheming will be intractable to prosaic methods or experimentation. I can see why this might happen and I can certainly imagine worlds where no on really tries. Similarly, I don’t see a strong argument for further effort on relatively straightforward methods can’t help a bunch in getting you sufficiently aligned systems (supposing they aren’t scheming): we can measure what we want somewhat well with a bunch of effort and I can imagine many things which could make a pretty big difference (again, this isn’t to say that this effort will happen in practice).
This isn’t to say that I can’t imagine worlds where pretty high effort and well orchestrated prosaic iteration totally fails. This seems totally plausible, especially given how fast this might happen, so risks seem high. And, it’s easy for me to imagine ways the world could be such that relatively prosaic methods and iteration is ~doomed without much more time than we can plausibly hope for, it’s just that these seem somewhat unlikely in aggregate to me.
So, I’d be pretty skeptical of someone claiming that the risk of this type of approach would be <3% (without at the very least preserving the optionality for a long pause during takeoff depending on empirical evidence), but I don’t see a case for thinking “it would be very surprising or wild if prosaic iteration sufficed”.