> The control approach we’re imagining won’t work for arbitrarily powerful AIs
Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?
Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.
Let the AI come up with alignment plans such as ELK that would be sufficient for aligning an ASI
Use examples of “we caught the AI doing something bad” to convince governments to regulate and give us more time before scaling up
Study the misaligned AI behaviour in order to develop a science of alignment
We use this AI to produce a ton of value (GDP goes up by ~10%/year), people are happy and don’t push that much to advance capabilities even more, and this can be combined with regulation preventing an arms race and pause our advancement
We use this AI to invent a new paradigm for AI which isn’t based on deep learning and is easier to align
We teach the AI to reason about morality (such as consider hypothetical situations) instead of responding with “the first thing that comes to its mind”, which will allow it to generalize human values not just better than RLHF but also better than many humans, and this passes a bar for friendliness
I think all 7 of those plans are far short of adequate to count as a real plan. There are a lot of more serious plans out there, but I don’t know where they’re nicely summarized.
What’s the short timeline plan? poses this question but also focuses on control, testing, and regulation—almost skipping over alignment.
Paul Christiano’s and Rohin Shah’s work are the two most serious. Neither of them have published a “this is the plan” concise statement, and both have probably substantially updated their plans.
These are the standard-bearers for “prosaic alignment” as a real path to alignment of AGI and ASI. There is tons of work on aligning LLMs, but very little work AFAICT on how and whether that extends to AGIs based on LLMs. That’s why Paul and Rohin are the standard bearers despite not working publicly directly on this for a few years.
I work primaily on this, since I think it’s the most underserved area of AGI x-risk—aligning the type of AGI people are most likely to build on the current path.
My plan can perhaps be described as extending prosaic alignment to LLM agents with new techniques, and from there to real AGI. A key strategy is using instruction-following as the alignment target. It is currently probably best summarized in my response to “what’s the short timeline plan?”
Improved governance design. Designing treaties that are easier sells for competitors. Using improved coordination to enter win-win races to the top, and to escape lose-lose races to the bottom and other inadequate equilibria.
Lowering the costs of enforcement. For example, creating privacy-preserving inspections with verified AI inspectors which report on only a strict limited set of pre-agreed things and then are deleted.
Ryan and Buck wrote:
> The control approach we’re imagining won’t work for arbitrarily powerful AIs
Okay, so if AI Control works well, how do we plan to use our controlled AI to reach a safe/aligned ASI?
Different people have different opinions. I think it would be good to have a public plan so that people can notice if they disagree and comment if they see problems.
Opinions I’ve heard so far:
Solve ELK / mechanistic anomaly detection / something else that ARC suggested
Let the AI come up with alignment plans such as ELK that would be sufficient for aligning an ASI
Use examples of “we caught the AI doing something bad” to convince governments to regulate and give us more time before scaling up
Study the misaligned AI behaviour in order to develop a science of alignment
We use this AI to produce a ton of value (GDP goes up by ~10%/year), people are happy and don’t push that much to advance capabilities even more, and this can be combined with regulation preventing an arms race and pause our advancement
We use this AI to invent a new paradigm for AI which isn’t based on deep learning and is easier to align
We teach the AI to reason about morality (such as consider hypothetical situations) instead of responding with “the first thing that comes to its mind”, which will allow it to generalize human values not just better than RLHF but also better than many humans, and this passes a bar for friendliness
These 7 answers are from 4 people.
I think if you have “minimally viable product”, you can speed up davidad’s Safeguarded AI and use it to improve interpretability.
I think all 7 of those plans are far short of adequate to count as a real plan. There are a lot of more serious plans out there, but I don’t know where they’re nicely summarized.
What’s the short timeline plan? poses this question but also focuses on control, testing, and regulation—almost skipping over alignment.
Paul Christiano’s and Rohin Shah’s work are the two most serious. Neither of them have published a “this is the plan” concise statement, and both have probably substantially updated their plans.
These are the standard-bearers for “prosaic alignment” as a real path to alignment of AGI and ASI. There is tons of work on aligning LLMs, but very little work AFAICT on how and whether that extends to AGIs based on LLMs. That’s why Paul and Rohin are the standard bearers despite not working publicly directly on this for a few years.
I work primaily on this, since I think it’s the most underserved area of AGI x-risk—aligning the type of AGI people are most likely to build on the current path.
My plan can perhaps be described as extending prosaic alignment to LLM agents with new techniques, and from there to real AGI. A key strategy is using instruction-following as the alignment target. It is currently probably best summarized in my response to “what’s the short timeline plan?”
Improved governance design. Designing treaties that are easier sells for competitors. Using improved coordination to enter win-win races to the top, and to escape lose-lose races to the bottom and other inadequate equilibria.
Lowering the costs of enforcement. For example, creating privacy-preserving inspections with verified AI inspectors which report on only a strict limited set of pre-agreed things and then are deleted.