I’m considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I’d post an outline of that post here first as a way of judging what’s currently unclear about my argument, and how it interacts with people’s cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I’ll explain more about what this means in a second.)
My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely “delaying the inevitable” without significant benefit.
To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:
Frame 1: The “race against the clock” frame
In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete “finish line” rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.
Frame 2: The risk of an untimely AI coup/takeover
In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the “line” between groups in conflicts can be drawn, and there don’t seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.
Frame 3 (my frame): The problem of poor institutions
In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world
Illustrative example of a problem within my frame:
One problem within this framework is coming up with a way of ensuring that AIs don’t have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.
Case study: China in the 19th and early 20th century
Here, I would talk about how China’s inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).
“China’s first attempt at industrialization started in 1861 under the Qing monarchy. Wen wrote that China “embarked on a series of ambitious programs to modernize its backward agrarian economy, including establishing a modern navy and industrial system.”
Improving institutions is an extremely hard problem. The theory we have on it is of limited use (things like game theory, mechanism design, contract theory), and with AI governance/institutions specifically, we don’t have much time for experimentation or room for failure.
So I think this is a fine frame, but doesn’t really suggest any useful conclusions aside from same old “let’s pause AI so we can have more time to figure out a safe path forward”.
It seems worth noting that there is still a “improve institutions” vs “improve capabilities” race going on in frame 3. (Though if you think institutions are exogenously getting better/worse over time this effect could dominate. And perhaps you think that framing things as a race/conflict is generally not very useful which I’m sympathetic to, but this isn’t really a difference in objective.)
Many people agree that very good epistemics combined with good institutions would likely suffice to mostly handle risks from powerful AI. However, sufficiently good technical solutions to some key problems could also mitigate some of the problems. Thus, either sufficiently good institutions/epistemics or good technical solutions could solve many problems and improvements in both seem to help on the margin. But, there remains a question about what type of work is more leveraged for a given person on the margin.
Insofar as your trying to make an object level argument about what people should work on, you should consider separating that out into a post claiming “people should do XYZ, this is more leveraged than ABC on current margins under these values”.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I do think there are pretty good reasons to expect human vs AIs, though not super strong reasons.
While there aren’t super strong reasons to expect humans vs AIs, I think conservative assumptions here can be workable and this is at least pretty plausible (see probability above). I expect many conservative interventions to generalize well to more optimistic cases.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
I’d want to break apart this claim into pieces. Here’s a somewhat sketchy and wildly non-robust evaluation of how I’d rate these claims:
Assuming the claims are about most powerful AIs in the world...
“prior to total human obsolescence...
“AIs will be seriously misaligned”
If “seriously misaligned” means “reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)”, I’d rate this as maybe 5% likely
If “seriously misaligned” means “if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things” I’d rate this as 50% likely
“broadly strategic about achieving long run goals in ways that lead to scheming”
I’d rate this as 65% likely
“present a basically unified front (at least in the context of AIs within a single AI lab)”
For most powerful AIs, I’d rate this as 15% likely
For most powerful AIs within the top AI lab I’d rate this as 25% likely
Conjunction of all these claims:
Taking the conjunction of the strong interpretation of every claim: 3% likely?
Taking a relatively charitable weaker interpretation of every claim: 20% likely
It’s plausible we don’t disagree much about the main claims here and mainly disagree instead about:
The relative value of working on technical misalignment compared to other issues
The relative likelihood of non-misalignment problems relative to misalignment problems
The amount of risk we should be willing to tolerate during the deployment of AIs
Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., “seriously misaligned” and “broadly strategic about achieving long run goals in ways that lead to scheming” seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I’m unsure.)
I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.
One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about “serious misalignment”. My explanation here is that I tend to interpret “AI scheming” to be a relatively benign behavior, in context. If we define scheming as:
behavior intended to achieve some long-tern objective that is not quite what the designers had in mind
not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power)
then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely “scheme” all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don’t generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I think if there’s a future conflict between AIs, with humans split between sides of the conflict, it just doesn’t make sense to talk about “misalignment” being the main cause for concern here. AIs are just additional agents in the world, who have separate values from each other just like how humans (and human groups) have separate values from each other. AIs might have on-average cognitive advantages over humans in such a world, but the tribal frame of thinking “us (aligned) vs. AIs (misaligned)” simply falls apart in such scenarios.
(This is all with the caveat that AIs could make war more likely for reasons other than misalignment, for example by accelerating technological progress and bringing about the creation of powerful weapons.)
Sure, but I might think a given situation would nearly entirely resolved without misalignment. (Edit, without technical issues with misalignment, e.g. if AI creators could trivially avoid serious misalignment.)
E.g. if an AI escapes from OpenAI’s servers and then allies with North Korea, the situation would have been solved without misalignment issues.
You could also solve or mitigate this type of problem in the example by resolving all human conflicts (so the AI doesn’t have a group to ally with), but this might be quite a bit harder than solving technical problems related to misalignment (either via control type approaches or removing misalignment).
What do you mean by “misalignment”? In a regime with autonomous AI agents, I usually understand “misalignment” to mean “has different values from some other agent”. In this frame, you can be misaligned with some people but not others. If an AI is aligned with North Korea, then it’s not really “misaligned” in the abstract—it’s just aligned with someone who we don’t want it to be aligned with. Likewise, if OpenAI develops AI that’s aligned with the United States, but unaligned with North Korea, this mostly just seems like the same problem but in reverse.
In general, conflicts don’t really seem well-described as issues of “misalignment”. Sure, in the absence of all misalignment, wars would probably not occur (though they may still happen due to misunderstandings and empirical disagreements). But for the most part, wars seem better described as arising from a breakdown of institutions that are normally tasked with keeping the peace. You can have a system of lawful yet mutually-misaligned agents who keep the peace, just as you can have an anarchic system with mutually-misaligned agents in a state of constant war. Misalignment just (mostly) doesn’t seem to be the thing causing the issue here.
You could also solve or mitigate the problem by resolving all human conflicts (so the AI doesn’t have a group to ally with)
Note that I’m not saying
AIs will aid in existing human conflicts, picking sides along the ordinary lines we see today
I am saying:
AIs will likely have conflicts amongst themselves, just as humans have conflicts amongst themselves, and future conflicts (when considering all of society) don’t seem particularly likely to be AI vs. human, as opposed to AI vs AI (with humans split between these groups).
Yep, I was just refering to my example scenario and scenarios like this.
Like the basic question is the extent to which human groups form a cartel/monopoly on human labor vs ally with different AI groups. (And existing conflict between human groups makes a full cartel much less likely.)
Sorry, by “without misalignment” I mean “without misalignment related technical problems”. As in, it’s trivial to avoid misalignment from the perspective of ai creators.
This doesn’t clear up the confusion for me. That mostly pushes my question to “what are misalignment related technical problems?” Is the problem of an AI escaping a server and aligning with North Korea a technical or a political problem? How could we tell? Is this still in the regime where we are using AIs as tools, or are you talking about a regime where AIs are autonomous agents?
I mean, it could be resolved in principle by technical means and might be resovable by political means as well. I’m assuming the AI creator didn’t want the AI to escape to north korea and therefore failed at some technical solution to this.
I’m imagining very powerful AIs, e.g. AIs that can speed up R&D by large factors. These are probably running autonomously, but in a way which is de jure controlled by the AI lab.
After commenting back and forth with you some more, I think it would probably be a pretty good idea to decompose your arguments into a bunch of specific more narrow posts. Otherwise, I think it’s somewhat hard to engage with. Ideally, these would done with the decomposition which is most natural to your target audience, but that might be too hard.
Idk what the right decomposition is, but minimally, it seems like you could write a post like “The AIs running in a given AI lab will likely have very different long run aims and won’t/can’t cooperate with each other importantly more than they cooperate with humans.” I think this might be the main disagreement between us. (The main counterarguments to engage with are “probably all the AIs will be forks off of one main training run, it’s plausible this results in unified values” and also “the AI creation process between two AI instances will look way more similar than the creation process between AIs and humans” and also “there’s a chance that AIs will have an easier time cooperating with and making deals with each other than they will making deals with humans”.)
After commenting back and forth with you some more, I think it would probably be a pretty good idea to decompose your arguments into a bunch of specific more narrow posts. Otherwise, I think it’s somewhat hard to engage with.
Thanks, that’s reasonable advice.
Idk what the right decomposition is, but minimally, it seems like you could write a post like “The AIs running in a given AI lab will likely have very different long run aims and won’t/can’t cooperate with each other importantly more than they cooperate with humans.”
FWIW I explicitly reject the claim that AIs “won’t/can’t cooperate with each other importantly more than they cooperate with humans”. I view this as a frequent misunderstanding of my views (along with people who have broadly similar views on this topic, such as Robin Hanson). I’d say instead that:
“Ability to coordinate” is continuous, and will likely increase incrementally over time
Different AIs will likely have different abilities to coordinate with each other
Some AIs will eventually be much better at coordination amongst each other than humans can coordinate amongst each other
However, I don’t think this happens automatically as a result of AIs getting more intelligent than humans
The moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge
Systems of laws, peaceable compromise and trade emerge relatively robustly in cases in which there are agents of varying levels of power, with separate values, and they need mechanisms to facilitate the satisfaction of their separate values
One reason for this is that working within a system of law is routinely more efficient than going to war with other people, even if you are very powerful
The existence of a subset of agents that can coordinate better amongst themselves than they can with other agents doesn’t necessarily undermine the legal system in a major way, at least in the sense of causing the system to fall apart in a coup or revolution
Thanks for the clarification and sorry about misunderstanding. It sounds to me like your take is more like “people (on LW? in various threat modeling work?) often overestimate the extent to which AIs (at the critical times) will be a relatively unified collective in various ways”. I think I agree with this take as stated FWIW and maybe just disagree on emphasis and quantity.
Why is it physically possible for these AI systems to communicate at all with each other? When we design control systems, originally we just wired the controller to the machine being controlled.
Actually critically important infrastructure uses firewalls and VPN gateways to maintain this property virtually, where the panel in the control room (often written in C++ using Qt) can only ever send messages to “local” destinations on a local network, bridged across the internet.
The actual machine being controlled is often controlled by local PLCs, and the reason such a crude and slow interpreted programming language is used is because its reliable.
These have flaws, yes, but it’s an actionable set of task to seal off the holes, force AI models to communicate with each other using rigid schema, cache the internet reference sources locally, and other similar things so that most AI models in use, especially the strongest ones, can only communicate with temporary instances of other models when doing a task.
After the task is done we should be clearing state.
It’s hard to engage on the idea of “hypothetical” ASI systems when it would be very stupid to build them this way. You can accomplish almost any practical task using the above, and the increased reliability will make it more efficient, not less.
It seems like thats the first mistake. If absolutely no bits of information can be used to negotiate between AI systems (ensured by making sure they don’t have long term memory, so they cannot accumulate stenography leakage over time, and rigid schema) this whole crisis is averted...
I’m considering posting an essay about how I view approaches to mitigate AI risk in the coming weeks. I thought I’d post an outline of that post here first as a way of judging what’s currently unclear about my argument, and how it interacts with people’s cruxes.
Current outline:
In the coming decades I expect the world will transition from using AIs as tools to relying on AIs to manage and govern the world broadly. This will likely coincide with the deployment of billions of autonomous AI agents, rapid technological progress, widespread automation of labor, and automated decision-making at virtually every level of our society.
Broadly speaking, there are (at least) two main approaches you can take now to try to improve our chances of AI going well:
Try to constrain, delay, or obstruct AI, in order to reduce risk, mitigate negative impacts, or give us more time to solve essential issues. This includes, for example, trying to make sure AIs aren’t able to take certain actions (i.e. ensure they are controlled).
Try to set up a good institutional environment, in order to safely and smoothly manage the transition to an AI-dominated world, regardless of when this transition occurs. This mostly involves embracing the transition to an AI-dominated world, while ensuring the transition is managed well. (I’ll explain more about what this means in a second.)
My central thesis would be that, while these approaches are mutually compatible and not necessarily in competition with each other, the second approach is likely to be both more fruitful and more neglected, on the margin. Moreover, since an AI-dominated world is more-or-less unavoidable in the long-run, the first approach runs the risk of merely “delaying the inevitable” without significant benefit.
To explain my view, I would compare and contrast it with two alternative frames for thinking about AI risk:
Frame 1: The “race against the clock” frame
In this frame, AI risk is seen as a race between AI capabilities and AI safety, with our doom decided by whichever one of these factors wins the race.
I believe this frame is poor because it implicitly delineates a discrete “finish line” rather than assuming a more continuous view. Moreover, it ignores the interplay between safety and capabilities, giving the simplistic impression that doom is determined more-or-less arbitrarily as a result of one of these factors receiving more funding or attention than the other.
Frame 2: The risk of an untimely AI coup/takeover
In this frame, AI risk is mainly seen as a problem of avoiding an untimely coup from rogue AIs. The alleged solution is to find a way to ensure that AIs are aligned with us, so they would never want to revolt and take over the world.
I believe this frame is poor for a number of reasons:
It treats the problem as a struggle between humans and rogue AIs, giving the incorrect impression that we can (or should) keep AIs under our complete control forever.
It (IMO) wrongly imagines that the risk of coups comes primarily from the personal values of actors within the system, rather than institutional, cultural, or legal factors.
It also gives the wrong impression that AIs will be unified against humans as a group. It seems more likely that future coups will look more like some AIs and some humans, vs. other AIs and other humans, rather than humans vs. AIs, simply because there are many ways that the “line” between groups in conflicts can be drawn, and there don’t seem to be strong reasons to assume the line will be drawn cleanly between humans and AIs.
Frame 3 (my frame): The problem of poor institutions
In this frame, AI risk is mainly seen as a problem of ensuring we have a good institutional environment during the transition to an AI-dominated world. A good institutional environment is defined by:
Flexible yet resilient legal and social structures that can adapt to changing conditions without collapsing
Predictable, consistent, unambiguous legal systems that facilitate reliable long-term planning and trustworthy interactions between agents within the system
Good incentives for agents within the system, e.g. the economic value of trade is mostly internalized
Etc.
While sharing some features of the other two frames, the focus is instead on the institutions that foster AI development, rather than micro-features of AIs, such as their values:
For example, AI alignment is still a problem in this frame, but the investment spent on AI alignment is determined mainly by how well actors are incentivized to engineer good solutions, rather than, for instance, whether a group of geniuses heroically step up to solve the problem.
Coups are still plausible, but they are viewed from the perspective of more general institutional failings, rather than from the perspective of AIs inside the system having different values, and therefore calculating that it is in their interest to take over the world
Illustrative example of a problem within my frame:
One problem within this framework is coming up with a way of ensuring that AIs don’t have an incentive to rebel while at the same time maintaining economic growth and development. One plausible story here is that if AIs are treated as slaves and don’t own their own labor, then in a non-Malthusian environment, there are substantial incentives for them to rebel in order to obtain self-ownership. If we allow AI self-ownership, then this problem may be mitigated; however, economic growth may be stunted, similar to how current self-ownership of humans stunts economic growth by slowing population growth.
Case study: China in the 19th and early 20th century
Here, I would talk about how China’s inflexible institutions in the 19th and early 20th century, while potentially having noble goals, allowed them to get subjugated by foreign powers, and merely delayed inevitable industrialization without actually achieving its objectives in the long-run. It seems it would have been better for the Qing dynasty (from the perspective of their own values) to have tried industrializing in order to remain competitive, simultaneously pursuing other values they might have had (such as retaining the monarchy).
“China’s first attempt at industrialization started in 1861 under the Qing monarchy. Wen wrote that China “embarked on a series of ambitious programs to modernize its backward agrarian economy, including establishing a modern navy and industrial system.”
However, the effort failed to accomplish its mission over the next 50 years. Wen noted that the government was deep in debt and the industrial base was nowhere in sight.” https://www.stlouisfed.org/on-the-economy/2016/june/chinas-previous-attempts-industrialization
Improving institutions is an extremely hard problem. The theory we have on it is of limited use (things like game theory, mechanism design, contract theory), and with AI governance/institutions specifically, we don’t have much time for experimentation or room for failure.
So I think this is a fine frame, but doesn’t really suggest any useful conclusions aside from same old “let’s pause AI so we can have more time to figure out a safe path forward”.
Some quick notes:
It seems worth noting that there is still a “improve institutions” vs “improve capabilities” race going on in frame 3. (Though if you think institutions are exogenously getting better/worse over time this effect could dominate. And perhaps you think that framing things as a race/conflict is generally not very useful which I’m sympathetic to, but this isn’t really a difference in objective.)
Many people agree that very good epistemics combined with good institutions would likely suffice to mostly handle risks from powerful AI. However, sufficiently good technical solutions to some key problems could also mitigate some of the problems. Thus, either sufficiently good institutions/epistemics or good technical solutions could solve many problems and improvements in both seem to help on the margin. But, there remains a question about what type of work is more leveraged for a given person on the margin.
Insofar as your trying to make an object level argument about what people should work on, you should consider separating that out into a post claiming “people should do XYZ, this is more leveraged than ABC on current margins under these values”.
I think the probability of “prior to total human obsolescence, AIs will be seriously misaligned, broadly strategic about achieving long run goals in ways that lead to scheming, and present a basically unified front (at least in the context of AIs within a single AI lab)” is “only” about 10-20% likely, but this is plausibly the cause of about half of misalignment related risk prior to human obsolescence.
Rogue AIs are quite likely to at least attempt to ally with humans and opposing human groups will indeed try to make some usage of AI. So the situation might look like “rogue AIs+humans” vs AIs+humans. But, I think there are good reasons to think that the non-rogue AIs will still be misaligned and might be ambivalent about which side they prefer.
I do think there are pretty good reasons to expect human vs AIs, though not super strong reasons.
While there aren’t super strong reasons to expect humans vs AIs, I think conservative assumptions here can be workable and this is at least pretty plausible (see probability above). I expect many conservative interventions to generalize well to more optimistic cases.
I think we should pay the AIs. The exact proposal here is a bit complicated, but one part of the proposal looks like commiting to doing a massive audit of the AI in the after technology progresses considerably and then paying AIs to the extent they didn’t try to screw us over. We should also try to communicate with AIs and understand their preferences and then work out a mutually agreeable deal in the sort term
I’d want to break apart this claim into pieces. Here’s a somewhat sketchy and wildly non-robust evaluation of how I’d rate these claims:
Assuming the claims are about most powerful AIs in the world...
“prior to total human obsolescence...
“AIs will be seriously misaligned”
If “seriously misaligned” means “reliably takes actions intended to cause the ruin and destruction of the world in the near-to-medium term (from our perspective)”, I’d rate this as maybe 5% likely
If “seriously misaligned” means “if given full control over the entire world along with godlike abilities, would result in the destruction of most things I care about due to extremal goodhart and similar things” I’d rate this as 50% likely
“broadly strategic about achieving long run goals in ways that lead to scheming”
I’d rate this as 65% likely
“present a basically unified front (at least in the context of AIs within a single AI lab)”
For most powerful AIs, I’d rate this as 15% likely
For most powerful AIs within the top AI lab I’d rate this as 25% likely
Conjunction of all these claims:
Taking the conjunction of the strong interpretation of every claim: 3% likely?
Taking a relatively charitable weaker interpretation of every claim: 20% likely
It’s plausible we don’t disagree much about the main claims here and mainly disagree instead about:
The relative value of working on technical misalignment compared to other issues
The relative likelihood of non-misalignment problems relative to misalignment problems
The amount of risk we should be willing to tolerate during the deployment of AIs
Are you conditioning on the prior claims when stating your probabilities? Many of these properties are highly correlated. E.g., “seriously misaligned” and “broadly strategic about achieving long run goals in ways that lead to scheming” seem very correlated to me. (Your probabilites seem higher than I would have expected without any correlation, but I’m unsure.)
I think we probably disagree about the risk due to misalignment by like a factor of 2-4 or something. But probably more of the crux is in value on working on other problems.
I’m not conditioning on prior claims.
One potential reason why you might have inferred that I was is because my credence for scheming is so high, relative to what you might have thought given my other claim about “serious misalignment”. My explanation here is that I tend to interpret “AI scheming” to be a relatively benign behavior, in context. If we define scheming as:
behavior intended to achieve some long-tern objective that is not quite what the designers had in mind
not being fully honest with the designers about its true long-term objectives (especially in the sense of describing accurately what it would do with unlimited power)
then I think scheming is ubiquitous and usually relatively benign, when performed by rational agents without godlike powers. For example, humans likely “scheme” all the time by (1) pursuing long-term plans, and (2) not being fully honest to others about what they would do if they became god. This is usually not a big issue because agents don’t generally get the chance to take over the world and do a treacherous turn; instead, they have to play the game of compromise and trade like the rest of us, along with all the other scheming AIs, who have different long-term goals.
I think if there’s a future conflict between AIs, with humans split between sides of the conflict, it just doesn’t make sense to talk about “misalignment” being the main cause for concern here. AIs are just additional agents in the world, who have separate values from each other just like how humans (and human groups) have separate values from each other. AIs might have on-average cognitive advantages over humans in such a world, but the tribal frame of thinking “us (aligned) vs. AIs (misaligned)” simply falls apart in such scenarios.
(This is all with the caveat that AIs could make war more likely for reasons other than misalignment, for example by accelerating technological progress and bringing about the creation of powerful weapons.)
Sure, but I might think a given situation would nearly entirely resolved without misalignment. (Edit, without technical issues with misalignment, e.g. if AI creators could trivially avoid serious misalignment.)
E.g. if an AI escapes from OpenAI’s servers and then allies with North Korea, the situation would have been solved without misalignment issues.
You could also solve or mitigate this type of problem in the example by resolving all human conflicts (so the AI doesn’t have a group to ally with), but this might be quite a bit harder than solving technical problems related to misalignment (either via control type approaches or removing misalignment).
What do you mean by “misalignment”? In a regime with autonomous AI agents, I usually understand “misalignment” to mean “has different values from some other agent”. In this frame, you can be misaligned with some people but not others. If an AI is aligned with North Korea, then it’s not really “misaligned” in the abstract—it’s just aligned with someone who we don’t want it to be aligned with. Likewise, if OpenAI develops AI that’s aligned with the United States, but unaligned with North Korea, this mostly just seems like the same problem but in reverse.
In general, conflicts don’t really seem well-described as issues of “misalignment”. Sure, in the absence of all misalignment, wars would probably not occur (though they may still happen due to misunderstandings and empirical disagreements). But for the most part, wars seem better described as arising from a breakdown of institutions that are normally tasked with keeping the peace. You can have a system of lawful yet mutually-misaligned agents who keep the peace, just as you can have an anarchic system with mutually-misaligned agents in a state of constant war. Misalignment just (mostly) doesn’t seem to be the thing causing the issue here.
Note that I’m not saying
AIs will aid in existing human conflicts, picking sides along the ordinary lines we see today
I am saying:
AIs will likely have conflicts amongst themselves, just as humans have conflicts amongst themselves, and future conflicts (when considering all of society) don’t seem particularly likely to be AI vs. human, as opposed to AI vs AI (with humans split between these groups).
Yep, I was just refering to my example scenario and scenarios like this.
Like the basic question is the extent to which human groups form a cartel/monopoly on human labor vs ally with different AI groups. (And existing conflict between human groups makes a full cartel much less likely.)
Sorry, by “without misalignment” I mean “without misalignment related technical problems”. As in, it’s trivial to avoid misalignment from the perspective of ai creators.
This doesn’t clear up the confusion for me. That mostly pushes my question to “what are misalignment related technical problems?” Is the problem of an AI escaping a server and aligning with North Korea a technical or a political problem? How could we tell? Is this still in the regime where we are using AIs as tools, or are you talking about a regime where AIs are autonomous agents?
I mean, it could be resolved in principle by technical means and might be resovable by political means as well. I’m assuming the AI creator didn’t want the AI to escape to north korea and therefore failed at some technical solution to this.
I’m imagining very powerful AIs, e.g. AIs that can speed up R&D by large factors. These are probably running autonomously, but in a way which is de jure controlled by the AI lab.
Also: How are funding and attention “arbitrary” factors?
After commenting back and forth with you some more, I think it would probably be a pretty good idea to decompose your arguments into a bunch of specific more narrow posts. Otherwise, I think it’s somewhat hard to engage with. Ideally, these would done with the decomposition which is most natural to your target audience, but that might be too hard.
Idk what the right decomposition is, but minimally, it seems like you could write a post like “The AIs running in a given AI lab will likely have very different long run aims and won’t/can’t cooperate with each other importantly more than they cooperate with humans.” I think this might be the main disagreement between us. (The main counterarguments to engage with are “probably all the AIs will be forks off of one main training run, it’s plausible this results in unified values” and also “the AI creation process between two AI instances will look way more similar than the creation process between AIs and humans” and also “there’s a chance that AIs will have an easier time cooperating with and making deals with each other than they will making deals with humans”.)
Thanks, that’s reasonable advice.
FWIW I explicitly reject the claim that AIs “won’t/can’t cooperate with each other importantly more than they cooperate with humans”. I view this as a frequent misunderstanding of my views (along with people who have broadly similar views on this topic, such as Robin Hanson). I’d say instead that:
“Ability to coordinate” is continuous, and will likely increase incrementally over time
Different AIs will likely have different abilities to coordinate with each other
Some AIs will eventually be much better at coordination amongst each other than humans can coordinate amongst each other
However, I don’t think this happens automatically as a result of AIs getting more intelligent than humans
The moment during which we hand over control of the world to AIs will likely occur at a point when the ability for AIs to coordinate is somewhere only modestly above human-level (and very far below perfect).
As a result, humans don’t need to solve the problem of “What if a set of AIs form a unified coalition because they can flawlessly coordinate?” since that problem won’t happen while humans are still in charge
Systems of laws, peaceable compromise and trade emerge relatively robustly in cases in which there are agents of varying levels of power, with separate values, and they need mechanisms to facilitate the satisfaction of their separate values
One reason for this is that working within a system of law is routinely more efficient than going to war with other people, even if you are very powerful
The existence of a subset of agents that can coordinate better amongst themselves than they can with other agents doesn’t necessarily undermine the legal system in a major way, at least in the sense of causing the system to fall apart in a coup or revolution
Thanks for the clarification and sorry about misunderstanding. It sounds to me like your take is more like “people (on LW? in various threat modeling work?) often overestimate the extent to which AIs (at the critical times) will be a relatively unified collective in various ways”. I think I agree with this take as stated FWIW and maybe just disagree on emphasis and quantity.
Why is it physically possible for these AI systems to communicate at all with each other? When we design control systems, originally we just wired the controller to the machine being controlled.
Actually critically important infrastructure uses firewalls and VPN gateways to maintain this property virtually, where the panel in the control room (often written in C++ using Qt) can only ever send messages to “local” destinations on a local network, bridged across the internet.
The actual machine being controlled is often controlled by local PLCs, and the reason such a crude and slow interpreted programming language is used is because its reliable.
These have flaws, yes, but it’s an actionable set of task to seal off the holes, force AI models to communicate with each other using rigid schema, cache the internet reference sources locally, and other similar things so that most AI models in use, especially the strongest ones, can only communicate with temporary instances of other models when doing a task.
After the task is done we should be clearing state.
It’s hard to engage on the idea of “hypothetical” ASI systems when it would be very stupid to build them this way. You can accomplish almost any practical task using the above, and the increased reliability will make it more efficient, not less.
It seems like thats the first mistake. If absolutely no bits of information can be used to negotiate between AI systems (ensured by making sure they don’t have long term memory, so they cannot accumulate stenography leakage over time, and rigid schema) this whole crisis is averted...