A three-pronged approach to AGI safety. (This is assuming we couldn’t just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).
Prong 1: boxing & capability control (aka ‘careful bootstrapping’)
Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff.
Assume the AGI is misaligned. Be super paranoid
Goal: get useful work out of boxed AGIs.
For example, AIs might be able to do interpretability really well.
More generally, for any field there is an effective 2x-10x research productivity boost you can probably get, depending on the field (maybe 100x? unclear, I’m sceptical).
Goal: learn more about AGI by studying the systems you’ve built.
To test your security & oversight procedures, you run tests where you train your AGI to escape in a sandbox.
This is operationally hard and needs lots of buy-in
Research directions: scalable oversight, interpretability for oversight, auditing, adversarial training, capability control / “unlearning”, scaling laws & capabilities forecasting.
Prong 2: scary demos and and convincing people that AGI is dangerous
Goal 1: shut it all down, or failing that slow down capabilities research.
Goal 2: get operational & political support for the entire approach, which is going to need lots of support, esp first prong
In particular make sure that research productivity boosts from AGI don’t feed back into capabilities research, which requires high levels of secrecy + buy-in from a large number of people.
Avoiding a speed-up is probably a little bit easier than enacting a slow-down, though maybe not much easier.
Demos can get very scary if we get far into prong 1, e.g. we have AGIs that are clearly misaligned or show that they are capable of breaking many of our precautions.
Prong 3: alignment research aka “understanding minds”
Goal: understand the systems well enough to make sure they are at least corrigible, or at best ‘fully aligned’.
Roughly this involves understanding how the behaviour of the system emerges in enough generality that we can predict and control what happens once the system is deployed OOD, made more capable, etc.
Relevant directions: agent foundations / embedded agency, interpretability, some kinds of “science of deep learning”
Successfully containing & using more capable models (p1) gives you more scary demos for p2
Success in p1 also speeds up p3 a lot, because:
1) You can empirically study AGI directly,
2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
3) Maybe you can even have (proto-)AGIs do research for you
You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour it into capabilities research.
Better alignment research (p3) lets you do more p1 type risky stuff with SOTA models (on the margin)
If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.
A three-pronged approach to AGI safety. (This is assuming we couldn’t just avoid building AGI or proto-AGIs at all until say ~2100, which would of course be much better).
Prong 1: boxing & capability control (aka ‘careful bootstrapping’)
Make the AGI as capable as possible, under the constraint that you can make sure it can’t break out of the box or do other bad stuff.
Assume the AGI is misaligned. Be super paranoid
Goal: get useful work out of boxed AGIs.
For example, AIs might be able to do interpretability really well.
More generally, for any field there is an effective 2x-10x research productivity boost you can probably get, depending on the field (maybe 100x? unclear, I’m sceptical).
Goal: learn more about AGI by studying the systems you’ve built.
To test your security & oversight procedures, you run tests where you train your AGI to escape in a sandbox.
This is operationally hard and needs lots of buy-in
Research directions: scalable oversight, interpretability for oversight, auditing, adversarial training, capability control / “unlearning”, scaling laws & capabilities forecasting.
Prong 2: scary demos and and convincing people that AGI is dangerous
Goal 1: shut it all down, or failing that slow down capabilities research.
Goal 2: get operational & political support for the entire approach, which is going to need lots of support, esp first prong
In particular make sure that research productivity boosts from AGI don’t feed back into capabilities research, which requires high levels of secrecy + buy-in from a large number of people.
Avoiding a speed-up is probably a little bit easier than enacting a slow-down, though maybe not much easier.
Demos can get very scary if we get far into prong 1, e.g. we have AGIs that are clearly misaligned or show that they are capable of breaking many of our precautions.
Prong 3: alignment research aka “understanding minds”
Goal: understand the systems well enough to make sure they are at least corrigible, or at best ‘fully aligned’.
Roughly this involves understanding how the behaviour of the system emerges in enough generality that we can predict and control what happens once the system is deployed OOD, made more capable, etc.
Relevant directions: agent foundations / embedded agency, interpretability, some kinds of “science of deep learning”
There are positive feedback loops between prongs:
Successfully containing & using more capable models (p1) gives you more scary demos for p2
Success in p1 also speeds up p3 a lot, because:
1) You can empirically study AGI directly,
2) Very advanced but “narrow” AI tools accelerate research (“narrow” here still means maybe more general than GPT-4)
3) Maybe you can even have (proto-)AGIs do research for you
You definitely need a lot of success in p2 for anything to work, otherwise people will take all the useful work we can get from proto-AGIs and pour it into capabilities research.
Better alignment research (p3) lets you do more p1 type risky stuff with SOTA models (on the margin)
If p1 is very successful, maybe we can punt most of p3 to the AIs; conversely, if p1 seems very hard then we probably only get ‘narrow’ tools to help with p3 and need to mostly do it ourselves, and hopefully get ML researchers to delay for long enough.