Currently, it is very challenging to train AIs to achieve tasks that are too hard for humans to supervise. I believe that (a) we will need to solve this challenge to unlock “self-improvement” and superhuman AI, and (b) we will be able to solve it. I do not want to discuss here why I believe these statements. I would just say that if you assume (a) and not (b), then the capabilities of AIs, and hence their risks, will be much more limited.
Ensuring AIs accurately follow our instructions, even in settings too complex for direct human oversight, is one such hard-to-supervise objective. Getting AIs to be maximally honest, even in settings that are too hard for us to verify, is another such objective. I am optimistic about the prospects of getting AIs to maximize these objectives.
I agree that to train AIs which are generally very superhuman, you’ll need to be able to make AIs highly capable on tasks that are too hard for humans to supervise. And, that if we have no ability to make AIs capable on tasks which are hard for humans to supervise, risks are much more limited.[1]
However, I don’t think that making AIs highly capable on tasks which are too hard for humans to supervise necessarily requires being able to ensure AIs do what we want in these settings nor does it require being able to train AIs for specific objectives in these settings.
Instead, you could in principle create very superhuman AIs through transfer (as humans do in many cases) and this wouldn’t require any ability to directly supervise on domains where the AI ends up being superhuman nevertheless. Further, you might be able to directly train AIs to be highly capable (as in, without depending on much transfer) using flawed feedback in a given domain (e.g. feedback which is often possible to reward hack but which still teaches the AI the relevant abilities).
So, I agree that the ability to make very superhuman AIs implies that we’ll (very likely) be able to make AIs which are capable of following our instructions and which are capable of being maximally honest, but this doesn’t imply that we’ll be able to ensure these properties (e.g. the AI could intentionally disobey instructions or lie). Further, there is a difference between being able to supervise instruction following and honesty in any given task and being able to produce an AI which robustly instruction follows and is honest. (Things like online training using this supervision only give you average case guarantees, and that’s if you actually use online training.)
It’s certainly possible that there is substantial transfer from the task of “train AIs to be highly capable (and useful) in harder-to-check domains” to the task of ensuring AIs are robustly honest and instruction following, but it is also easy to imagine ways this could go wrong. E.g., the AIs are faking alignment or at some level of capability, increased capabilities (from transfer or whatever) still makes the AIs (seem) more useful while simultaneously it makes them less instruction following and honest due to issues in the training signal.
(My all-things-considered view is that the default course of advancing capabilities and usefulness will end up figuring out some ways to supervise AIs in training sufficiently well to train them to perform reasonably well on average in most cases (according to human judgment of outcomes), but this performance will substantially depend on transfer and generalization in ways which aren’t robust to egregious misalignment. And that this will solve some alignment problems that would have otherwise existed if not for people optimizing for usefulness over pure capabilities. That said I also think it’s possible that we’ll be seeing increasingly sophisticated and egregious reward hacking rise with capabilities but with a sufficient increase in usefulness in many domains despite this reward hacking such that scaling continues. And, I don’t think handling worst case scheming/alignment-faking will be very incentivized by commercial/capabilities incentives by default.)
You might separately be optimistic about getting superhuman AIs to robustly follow instructions, but I don’t think “we have a way to make the AIs superhumanly capable in general” implies “we can ensure the AIs actually robustly follow instructions (rather than just being capable of following instructions)”.
To the extent that you’re defining “capable” in a somewhat non-standard way, it seems good to be careful about this and consider explaining how you are using these terms (or consider defining a new term).
That said, I do think it would be possible in principle to automate AI R&D or at least automate most of AI R&D (perhaps what you mean by unlocking “self-improvement”) even if we could only initially make AIs highly capable on tasks which humans can supervise. Humans can supervise the tasks involved in AI R&D and superhumanness isn’t necessarily required for this automation. Also, there are verification generation gaps in AI R&D because we can use outcome based feedback, so you can in principle get substantially superhuman AI R&D while only training AIs on tasks you could have in principle supervised. In practice, the way AI R&D is done today often requires doing tasks which are expensive to run and only happen a few times (e.g. big training runs), so literally doing outcome based RL over the whole process wouldn’t work with the current setup. But, humans can in principle supervise the process, it’s just that the process is expensive to run.
I agree that I am not justifying in this essay my optimism about getting superhuman AI to robustly follow instructions. You can think as the above as more like the intuition that generally points out in that direction.
Justifying this optimism is a side discussion, but TBH rather than discussions I hope that we can make empirical progress toward justifying it.
I agree that to train AIs which are generally very superhuman, you’ll need to be able to make AIs highly capable on tasks that are too hard for humans to supervise. And, that if we have no ability to make AIs capable on tasks which are hard for humans to supervise, risks are much more limited.[1]
However, I don’t think that making AIs highly capable on tasks which are too hard for humans to supervise necessarily requires being able to ensure AIs do what we want in these settings nor does it require being able to train AIs for specific objectives in these settings.
Instead, you could in principle create very superhuman AIs through transfer (as humans do in many cases) and this wouldn’t require any ability to directly supervise on domains where the AI ends up being superhuman nevertheless. Further, you might be able to directly train AIs to be highly capable (as in, without depending on much transfer) using flawed feedback in a given domain (e.g. feedback which is often possible to reward hack but which still teaches the AI the relevant abilities).
So, I agree that the ability to make very superhuman AIs implies that we’ll (very likely) be able to make AIs which are capable of following our instructions and which are capable of being maximally honest, but this doesn’t imply that we’ll be able to ensure these properties (e.g. the AI could intentionally disobey instructions or lie). Further, there is a difference between being able to supervise instruction following and honesty in any given task and being able to produce an AI which robustly instruction follows and is honest. (Things like online training using this supervision only give you average case guarantees, and that’s if you actually use online training.)
It’s certainly possible that there is substantial transfer from the task of “train AIs to be highly capable (and useful) in harder-to-check domains” to the task of ensuring AIs are robustly honest and instruction following, but it is also easy to imagine ways this could go wrong. E.g., the AIs are faking alignment or at some level of capability, increased capabilities (from transfer or whatever) still makes the AIs (seem) more useful while simultaneously it makes them less instruction following and honest due to issues in the training signal.
(My all-things-considered view is that the default course of advancing capabilities and usefulness will end up figuring out some ways to supervise AIs in training sufficiently well to train them to perform reasonably well on average in most cases (according to human judgment of outcomes), but this performance will substantially depend on transfer and generalization in ways which aren’t robust to egregious misalignment. And that this will solve some alignment problems that would have otherwise existed if not for people optimizing for usefulness over pure capabilities. That said I also think it’s possible that we’ll be seeing increasingly sophisticated and egregious reward hacking rise with capabilities but with a sufficient increase in usefulness in many domains despite this reward hacking such that scaling continues. And, I don’t think handling worst case scheming/alignment-faking will be very incentivized by commercial/capabilities incentives by default.)
You might separately be optimistic about getting superhuman AIs to robustly follow instructions, but I don’t think “we have a way to make the AIs superhumanly capable in general” implies “we can ensure the AIs actually robustly follow instructions (rather than just being capable of following instructions)”.
To the extent that you’re defining “capable” in a somewhat non-standard way, it seems good to be careful about this and consider explaining how you are using these terms (or consider defining a new term).
That said, I do think it would be possible in principle to automate AI R&D or at least automate most of AI R&D (perhaps what you mean by unlocking “self-improvement”) even if we could only initially make AIs highly capable on tasks which humans can supervise. Humans can supervise the tasks involved in AI R&D and superhumanness isn’t necessarily required for this automation. Also, there are verification generation gaps in AI R&D because we can use outcome based feedback, so you can in principle get substantially superhuman AI R&D while only training AIs on tasks you could have in principle supervised. In practice, the way AI R&D is done today often requires doing tasks which are expensive to run and only happen a few times (e.g. big training runs), so literally doing outcome based RL over the whole process wouldn’t work with the current setup. But, humans can in principle supervise the process, it’s just that the process is expensive to run.
I agree that I am not justifying in this essay my optimism about getting superhuman AI to robustly follow instructions. You can think as the above as more like the intuition that generally points out in that direction.
Justifying this optimism is a side discussion, but TBH rather than discussions I hope that we can make empirical progress toward justifying it.