Thanks for the post, Simon! I think having more discussion giving specific criticisms and demands for the mainline alignment plan by the labs is needed.
I’d like to eventually put forth my strongest arguments for superalignment as a whole and what we need to happen to realistically convince/force the labs to stop.
Quick comments:
“AIs are unlikely to speed up alignment before capabilities”: I think this can also be used as an argument for accelerating automated alignment ASAP if you believe that we won’t get alignment value out of AIs soon enough (well, some people already are). Unless the crux is “doesn’t matter how hard you try to differentially accelerate alignment work, you won’t make any dent in progress”, in which case I disagree but think it’s more likely in a world where people are too concerned about dual-use to attempt it.
Dual-use actually seems overblown due to the labs being full steam ahead on automating AI R&D, and automated alignment will just lag behind.
I think as part of automated alignment, we should definitely try to make it easier for researchers to make verification and detecting hard-to-spot mistakes faster. There might be probes that help with this (like the hallucination probes help with identifying hallucinations). My guess is that making verification with humans in the loop can be made much faster.
Automated alignment can be leveraged as part of the scary demo strategy and can make a more convincing case for a pause.
I think our main crux is that I think AIs will accelerate prosaic alignment research, though agent foundations will be much rougher. Luckily, the first thing you want to do as AIs get better at control is probably make control even more useful and lengthen the time in which we can take to leverage AIs. In addition, I think there is a belief that traditional alignment theory basically won’t matter and is just another step removed from the problem.
I think that given this crux and the fact that the companies will indeed automate AI R&D (and alignment), we should force companies to stop at some point in this process. There should be an even stronger push for some If-Then commitment or RSP style thing. One thing I’m ruminating on is to propose a Responsible Automation Policy that would slow down progress between training and internal deployment.
Maybe this also means companies are forced to accelerate a guaranteed safe AI style plan beyond a level of capability.
For the record, I think that this blog post was mostly intended for frontier labs pushing this plan, the situation is different for independent orgs. I think that there is useful work to be done on subproblems with AI-assisted alignment such as interpretability. So I agree that there is prosaic alignment that can be done, though I am probably still much less optimistic than you.
Thanks for the post, Simon! I think having more discussion giving specific criticisms and demands for the mainline alignment plan by the labs is needed.
I’d like to eventually put forth my strongest arguments for superalignment as a whole and what we need to happen to realistically convince/force the labs to stop.
Quick comments:
“AIs are unlikely to speed up alignment before capabilities”: I think this can also be used as an argument for accelerating automated alignment ASAP if you believe that we won’t get alignment value out of AIs soon enough (well, some people already are). Unless the crux is “doesn’t matter how hard you try to differentially accelerate alignment work, you won’t make any dent in progress”, in which case I disagree but think it’s more likely in a world where people are too concerned about dual-use to attempt it.
Dual-use actually seems overblown due to the labs being full steam ahead on automating AI R&D, and automated alignment will just lag behind.
I think as part of automated alignment, we should definitely try to make it easier for researchers to make verification and detecting hard-to-spot mistakes faster. There might be probes that help with this (like the hallucination probes help with identifying hallucinations). My guess is that making verification with humans in the loop can be made much faster.
Automated alignment can be leveraged as part of the scary demo strategy and can make a more convincing case for a pause.
I think our main crux is that I think AIs will accelerate prosaic alignment research, though agent foundations will be much rougher. Luckily, the first thing you want to do as AIs get better at control is probably make control even more useful and lengthen the time in which we can take to leverage AIs. In addition, I think there is a belief that traditional alignment theory basically won’t matter and is just another step removed from the problem.
I think that given this crux and the fact that the companies will indeed automate AI R&D (and alignment), we should force companies to stop at some point in this process. There should be an even stronger push for some If-Then commitment or RSP style thing. One thing I’m ruminating on is to propose a Responsible Automation Policy that would slow down progress between training and internal deployment.
Maybe this also means companies are forced to accelerate a guaranteed safe AI style plan beyond a level of capability.
For the record, I think that this blog post was mostly intended for frontier labs pushing this plan, the situation is different for independent orgs. I think that there is useful work to be done on subproblems with AI-assisted alignment such as interpretability. So I agree that there is prosaic alignment that can be done, though I am probably still much less optimistic than you.