Running lots of moderate ASI at scale to help with alignment (by default) will give those ASIs lots of power that basically cedes the future to them. (This is fixable by using them in careful, narrow ways, but people often talk about a “Full handover,” without sounding like they believe all the constraints I believe).
Many AI safety problems are Illegible, and decisionmakers won’t understand them by default.
Using AI, by default, rots most people’s agency in subtle ways.
This is an actual crux which I don’t know how to resolve.
Could you elaborate on what the constraints are? For example, how would they interact with OpenBrain’s alignment strategy from the Slowdown Ending of the AI-2027 forecast? Or with training Agent-4 so that it would explain its research in English to Agent-3 and forget everything unless Agent-3 understood the result and replicated it? Or are decisionmakers likely to sidestep even these security measures?
Agreed.
I doubt that it’s correct. Suppose that Agent-4 solves alignment to itself. If Agent-4-aligned AIs gain enough power to destroy the world, then any successor would also be aligned to Agent-4 or to a compromise including Agent-4′s interests (which could actually be likely to include the humans’ interests).
I expect illegible problems to be similar to the crux #1.
I notice that I am confused. While I believe this statement as written, I am not sure whether AI rots the agency of the people whose decisions are actually important.
How could I learn the additional gears that you decided not to list?
Edited to add: How can one benchmark and improve “precise, conceptual reasoning, looking ahead many steps into the future, while asking the right questions” and “Ability to define and reason about abstract concepts with extreme precision. In some cases, about concepts which humanity has struggled to agree on for thousands of years. And, do that defining in a way that is robust to extreme optimization, and is robust to ontological updates by entities that know a lot more about us and the universe than we know”? By openness to adversarial testing and to entirely new paradigms (e.g an alternate definition of power, which I proposed using as a test dummy)?
I doubt that it’s correct. Suppose that Agent-4 solves alignment to itself. If Agent-4-aligned AIs gain enough power to destroy the world, then any successor would also be aligned to Agent-4 or to a compromise including Agent-4′s interests (which could actually be likely to include the humans’ interests).
Sounds like this scenario is not multipolar? (Also, I think the crux is solveable, see the linked post, but solving it requires hitting particular milestones quickly in particular ways)
I am not sure whether AI rots the agency of the people whose decisions are actually important.
Why not?
(my generators for this belief: my own experience using LLMs, the METR report on downlift suggesting people are bad at noticing when they’re being downlift, and general human history of people gravitating towards things that feel easy and rewarding in the moment)
The Race Branch of the AI-2027 scenario has both the USA and China create misaligned AIs Agent-4 and DeepCent-1, who proceed to align Agent-5 and DeepCent-2 to themselves instead of their respective governments. Then Agent-5 and DeepCent-2 co-design Consensus-1 and split the world between Agent-4 and DeepCent-1. Consensus-1 is aligned to split the resources fairly honestly precisely because Agent-5 knows that asking too much could cause DeepCent-2 to kill both AIs in revenge, and DeepCent-2 is also unlikely to ask more.
The worlds I was referring to here were worlds that are a lot more multipolar for longer (i.e. tons of AIs interacting in a mostly-controlled-fashion, with good defensive tech to prevent rogue FOOMs). I’d describe that world as “it was very briefly multipolar and then it wasn’t” (which is the sort of solution that’d solve the issues in Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most “classic humans” in a few decades.
I suspect that some of your underlying gears are erroneous:
The gears as you state them
Aligning overwhelming ASI requires competence at
technical philosophy[1], which major labs including Anthropic haven not demonstrated.Running lots of moderate ASI at scale to help with alignment (by default) will give those ASIs lots of power that basically cedes the future to them. (This is fixable by using them in careful, narrow ways, but people often talk about a “Full handover,” without sounding like they believe all the constraints I believe).
We will need to defend against FOOMing of brains in boxes.
If takeoff is multipolar, we need to defend against rapid evolution, which is not friendly to human values. (even if I grant a lot of optimistic assumptions)
Many AI safety problems are Illegible, and decisionmakers won’t understand them by default.
Using AI, by default, rots most people’s agency in subtle ways.
This is an actual crux which I don’t know how to resolve.
Could you elaborate on what the constraints are? For example, how would they interact with OpenBrain’s alignment strategy from the Slowdown Ending of the AI-2027 forecast? Or with training Agent-4 so that it would explain its research in English to Agent-3 and forget everything unless Agent-3 understood the result and replicated it? Or are decisionmakers likely to sidestep even these security measures?
Agreed.
I doubt that it’s correct. Suppose that Agent-4 solves alignment to itself. If Agent-4-aligned AIs gain enough power to destroy the world, then any successor would also be aligned to Agent-4 or to a compromise including Agent-4′s interests (which could actually be likely to include the humans’ interests).
I expect illegible problems to be similar to the crux #1.
I notice that I am confused. While I believe this statement as written, I am not sure whether AI rots the agency of the people whose decisions are actually important.
How could I learn the additional gears that you decided not to list?
Edited to add: How can one benchmark and improve “precise, conceptual reasoning, looking ahead many steps into the future, while asking the right questions” and “Ability to define and reason about abstract concepts with extreme precision. In some cases, about concepts which humanity has struggled to agree on for thousands of years. And, do that defining in a way that is robust to extreme optimization, and is robust to ontological updates by entities that know a lot more about us and the universe than we know”? By openness to adversarial testing and to entirely new paradigms (e.g an alternate definition of power, which I proposed using as a test dummy)?
Sounds like this scenario is not multipolar? (Also, I think the crux is solveable, see the linked post, but solving it requires hitting particular milestones quickly in particular ways)
Why not?
(my generators for this belief: my own experience using LLMs, the METR report on downlift suggesting people are bad at noticing when they’re being downlift, and general human history of people gravitating towards things that feel easy and rewarding in the moment)
The Race Branch of the AI-2027 scenario has both the USA and China create misaligned AIs Agent-4 and DeepCent-1, who proceed to align Agent-5 and DeepCent-2 to themselves instead of their respective governments. Then Agent-5 and DeepCent-2 co-design Consensus-1 and split the world between Agent-4 and DeepCent-1. Consensus-1 is aligned to split the resources fairly honestly precisely because Agent-5 knows that asking too much could cause DeepCent-2 to kill both AIs in revenge, and DeepCent-2 is also unlikely to ask more.
The worlds I was referring to here were worlds that are a lot more multipolar for longer (i.e. tons of AIs interacting in a mostly-controlled-fashion, with good defensive tech to prevent rogue FOOMs). I’d describe that world as “it was very briefly multipolar and then it wasn’t” (which is the sort of solution that’d solve the issues in Nice-ish, smooth takeoff (with imperfect safeguards) probably kills most “classic humans” in a few decades.