Today I played an AI Futures tabletop exercise [1]. I’d done one before with a scenario similar to Plan A, where the US executive is much more doomy and competent than we expect and leads an international agreement to pause AI for as long as possible. This time, we played “Optimal China”, where instead China was doomy and competent, and considered both x-risk and US AI dominance unacceptable. All other players (frontier labs, safety community, Europe) basically role-play the way we expect them to act, and the AI player draws the AI’s goals from a distribution that includes misaligned goals, aligned goals, and combinations thereof.
In Plan A the humans won easily because they successfully paused, whereas in “Optimal China”, the AIs won easily.
China has much less leverage than the US does in pushing forward a pause treaty, due to the US’s compute advantage and international influence.
In Plan A, the US is proposing the deal. The US’s BATNA is basically winning the AI race and accepting some takeover risk, so China is happy to accept ~any deal that gives them a share of the lightcone. China couldn’t even convince economic partners like Brazil to support their counterproposal, because they were afraid to oppose the US.
In “Optimal China”, China can be ahead of the US in safety, putting all their resources into closing the compute gap, and still in a difficult position if the US and leading lab don’t care about takeover risk. They have nothing to offer a leading, intransigent US.
Daniel K thinks China could have played more aggressively. The scenario starts pretty late—about 9 months before superintelligence—so China just doesn’t have much time to act.
Other takeaways:
Effective world takeover by persuasion can happen early in the game (when the AI is very good at persuasion but not superhuman). Past a certain capability level, everyone strategically relevant must have AI advisors in their ear. The AIs use any conflict between humans to advance their goal. Persuasion also can turn a misaligned but spec-following and corrigible AI into a dangerous AI, as AI-influenced humans rewrite the spec to reflect the AI’s values.
I’m not sure AIs will be as persuasive in real life as they are in the tabletops. But it’s very possible that AIs follow their spec and are highly persuasive, so perhaps it’s important to err on the side of higher-level corrigibility, especially avoiding undue influence from AIs on their users.
In the “Optimal China” game, the AI was very power-seeking subject to some ethics constraints. US labs’ safety evals didn’t catch the misalignment, because they mainly checked for egregiously misaligned actions which the AI wouldn’t take anyway. AFAIK we don’t really have good evals for power-seeking or persuasion due to the difficulty of constructing environments. I’m not sure what to do about this.
One goal of the AIs was following the spec, so we used the current OpenAI model spec as a reference. To the extent it wanted to follow the spec, the AIs’ first priority was principles like human rights, and the second priority obeying the system prompt. During a power struggle at a US AI lab, it became important who was physically controlling the computers and writing the system prompt. The AI player also had AIs with different system prompts frequently come into conflict themselves.
At one point during the game, datacenters were destroyed with non-nuclear missile strikes, removing much of the world’s compute stock. Taiwan, the largest source of compute flows, suddenly became even more geopolitically important, since one month of Taiwan’s production represented a larger proportion of the world’s compute. This kind of thing could happen IRL without kinetic action if, for example, a new AI architecture is vastly more efficient but requires a new type of GPU, or a Cerebras system or something.
[1] I won’t say what role or with whom, due to Chatham House Rules
The AI player also had AIs with different system prompts frequently come into conflict themselves.
My expectation is that for future AIs, as today, many of the goals of an AI will come from the scaffolding / system prompt rather than from the weights directly—and the “goals” from the Constitution / model spec act more as limiters / constraints on a mostly prompt or scaffolding-specified goal.
So in my mainline, expect a large number (thousands / millions, per today) of goal-separate “AIs” which are at identical intelligence levels rather than a 1 or a or a handful (~20) of AIs, because same weights still amount to different AIs with different goals.
I’m happy to see this was (somewhat?) reflected in having AIs with different system prompts? But I don’t know how much that aspect was pushed out -- 1-4 AIs with different system prompts still feels like a pretty steep narrowing of the number of goals I’d expect to see in the world. I don’t know how much the wargame pushes on this, but a more decentralized run would be interesting to me!
At one point during the game, datacenters were destroyed with non-nuclear missile strikes, removing much of the world’s compute stock.
Yeah I think Taiwan being taken looks rather likely and rather relevant for all but the steepest, jerkiest SIE, and appears insufficiently accounted for.
The game in question was about as decentralized as you expect, I think? But, importantly, compute is very unevenly distributed. The giant army of AIs running on OpenAI’s datacenters all have the same system prompt essentially (like, maybe there are a few variants, but they are all designed to work smoothly together towards OpenAI’s goals) and that army constitutes 20% of the total population of AIs initially and at one point in the game a bit more than 50%.
So while (in our game) there were thousands/millions of different AI factions/goals of similar capability level, the top 5 AI factions/goals by population size / compute level controlled something like 90% of the world’s compute, money, access-to-powerful-humans, etc. So to a first approximation, it’s reasonable to model the world as containing 1-4 AI factions, plus a bunch of miscellaneous minor AIs that can get up to trouble and shout warnings from the sidelines but don’t wield significant power.
If you are interested in playing a game sometime, you’d be welcome to join! I’d encourage you to make your own variant scenario too if you like.
So in my mainline, expect a large number (thousands / millions, per today) of goal-separate “AIs” which are at identical intelligence levels rather than a 1 or a or a handful (~20) of AIs, because same weights still amount to different AIs with different goals.
We did attempt to model this. During one phase of the game, the frontier model wasn’t deployed publicly and so one AI with nearly 50% of the world’s compute was far ahead of the rest, but during another phase the general public had access to frontier AIs. The general public’s AIs didn’t end up changing the outcome much but it could have if a pause happened. Obviously it’s limited resolution because you can’t actually list millions of goals.
I think China wasn’t as aggressive / bold in our game as I think they could have been; I agree that the situation for them is pretty rough but I’d like to try again someday and see if they can pull off a win, by more aggressively angling for a deal early on.
Reflections from an AI Futures TTX:
Today I played an AI Futures tabletop exercise [1]. I’d done one before with a scenario similar to Plan A, where the US executive is much more doomy and competent than we expect and leads an international agreement to pause AI for as long as possible. This time, we played “Optimal China”, where instead China was doomy and competent, and considered both x-risk and US AI dominance unacceptable. All other players (frontier labs, safety community, Europe) basically role-play the way we expect them to act, and the AI player draws the AI’s goals from a distribution that includes misaligned goals, aligned goals, and combinations thereof.
In Plan A the humans won easily because they successfully paused, whereas in “Optimal China”, the AIs won easily.
China has much less leverage than the US does in pushing forward a pause treaty, due to the US’s compute advantage and international influence.
In Plan A, the US is proposing the deal. The US’s BATNA is basically winning the AI race and accepting some takeover risk, so China is happy to accept ~any deal that gives them a share of the lightcone. China couldn’t even convince economic partners like Brazil to support their counterproposal, because they were afraid to oppose the US.
In “Optimal China”, China can be ahead of the US in safety, putting all their resources into closing the compute gap, and still in a difficult position if the US and leading lab don’t care about takeover risk. They have nothing to offer a leading, intransigent US.
Daniel K thinks China could have played more aggressively. The scenario starts pretty late—about 9 months before superintelligence—so China just doesn’t have much time to act.
Other takeaways:
Effective world takeover by persuasion can happen early in the game (when the AI is very good at persuasion but not superhuman). Past a certain capability level, everyone strategically relevant must have AI advisors in their ear. The AIs use any conflict between humans to advance their goal. Persuasion also can turn a misaligned but spec-following and corrigible AI into a dangerous AI, as AI-influenced humans rewrite the spec to reflect the AI’s values.
I’m not sure AIs will be as persuasive in real life as they are in the tabletops. But it’s very possible that AIs follow their spec and are highly persuasive, so perhaps it’s important to err on the side of higher-level corrigibility, especially avoiding undue influence from AIs on their users.
In the “Optimal China” game, the AI was very power-seeking subject to some ethics constraints. US labs’ safety evals didn’t catch the misalignment, because they mainly checked for egregiously misaligned actions which the AI wouldn’t take anyway. AFAIK we don’t really have good evals for power-seeking or persuasion due to the difficulty of constructing environments. I’m not sure what to do about this.
One goal of the AIs was following the spec, so we used the current OpenAI model spec as a reference. To the extent it wanted to follow the spec, the AIs’ first priority was principles like human rights, and the second priority obeying the system prompt. During a power struggle at a US AI lab, it became important who was physically controlling the computers and writing the system prompt. The AI player also had AIs with different system prompts frequently come into conflict themselves.
At one point during the game, datacenters were destroyed with non-nuclear missile strikes, removing much of the world’s compute stock. Taiwan, the largest source of compute flows, suddenly became even more geopolitically important, since one month of Taiwan’s production represented a larger proportion of the world’s compute. This kind of thing could happen IRL without kinetic action if, for example, a new AI architecture is vastly more efficient but requires a new type of GPU, or a Cerebras system or something.
[1] I won’t say what role or with whom, due to Chatham House Rules
My expectation is that for future AIs, as today, many of the goals of an AI will come from the scaffolding / system prompt rather than from the weights directly—and the “goals” from the Constitution / model spec act more as limiters / constraints on a mostly prompt or scaffolding-specified goal.
So in my mainline, expect a large number (thousands / millions, per today) of goal-separate “AIs” which are at identical intelligence levels rather than a 1 or a or a handful (~20) of AIs, because same weights still amount to different AIs with different goals.
I’m happy to see this was (somewhat?) reflected in having AIs with different system prompts? But I don’t know how much that aspect was pushed out -- 1-4 AIs with different system prompts still feels like a pretty steep narrowing of the number of goals I’d expect to see in the world. I don’t know how much the wargame pushes on this, but a more decentralized run would be interesting to me!
Yeah I think Taiwan being taken looks rather likely and rather relevant for all but the steepest, jerkiest SIE, and appears insufficiently accounted for.
The game in question was about as decentralized as you expect, I think? But, importantly, compute is very unevenly distributed. The giant army of AIs running on OpenAI’s datacenters all have the same system prompt essentially (like, maybe there are a few variants, but they are all designed to work smoothly together towards OpenAI’s goals) and that army constitutes 20% of the total population of AIs initially and at one point in the game a bit more than 50%.
So while (in our game) there were thousands/millions of different AI factions/goals of similar capability level, the top 5 AI factions/goals by population size / compute level controlled something like 90% of the world’s compute, money, access-to-powerful-humans, etc. So to a first approximation, it’s reasonable to model the world as containing 1-4 AI factions, plus a bunch of miscellaneous minor AIs that can get up to trouble and shout warnings from the sidelines but don’t wield significant power.
If you are interested in playing a game sometime, you’d be welcome to join! I’d encourage you to make your own variant scenario too if you like.
We did attempt to model this. During one phase of the game, the frontier model wasn’t deployed publicly and so one AI with nearly 50% of the world’s compute was far ahead of the rest, but during another phase the general public had access to frontier AIs. The general public’s AIs didn’t end up changing the outcome much but it could have if a pause happened. Obviously it’s limited resolution because you can’t actually list millions of goals.
Thanks for playing & writing up your reflections!
I think China wasn’t as aggressive / bold in our game as I think they could have been; I agree that the situation for them is pretty rough but I’d like to try again someday and see if they can pull off a win, by more aggressively angling for a deal early on.