Conditional on scheming arising naturally, how capable will models be when they first emerge?
Key context: I think that if scheming is caught then it’ll be removed quickly through (a) halting deployment, (b) training against it, or (c) demonstrating to the AIs that we caught them, making scheming unattractive. Hence, I think that scheming arises naturally at the roughly the capability where AIs are able to scheme successfully.
Pessimism levels about lab: I use Ryan Greenblatt’s taxonomy of lab carefulness. Plan A involves 10 years of lead time with international coordination and heavy safety investment. Plan B involves 1-3 years of lead with US government buy-in. Plan C involves 2-9 months of lead where the leading lab spends its lead on safety. Plan D has 10-30 people inside who care about safety but no broader organizational support. Plan E has essentially no one inside taking safety seriously. My probabilities are somewhat more optimistic than Greenblatt’s: Plans A-B (15%), Plan C (25%), Plan D (45%), Plan E (15%).
Capability levels
Current
GPT 6
Professional
Best human
Superhuman
Additional OOMS
0 OOMs
+6 OOMs
+12 OOMs
+18 OOMs
+24 OOMs
Effective compute
10^27
10^33
10^39
10^45
10^51
Years (current scaling)
2025
2029
2033
2037
2041
IQ equivalent
50
150
250
350
Rank among humans
10 billionth
100 000th
1st
The conversions here are super-rough point-estimates.[1]
Probability scheming emerges at each capability level
To estimate the capability of the earliest schemers, we must estimate the probability that scheming emerges at a particular capability conditional on (i) the labs having a given plan carefulness, and (ii) no scheming arising at lower levels.
Below is a table of P(scheming emerges at this level | plan & no scheming at lower levels). For example, if a D-careful lab deploys Professional non-schemers to help build Best Humans, then they have a 30% chance of building Best Humans schemers. These probabilities are based on vibes, but I give some explanation below.
Current
GPT-6
Professional
Best Human
Superhuman
Plans A-C
0%
2%
10%
10%
10%
Plan D
0%
15%
25%
30%
12%
Plan E
0%
18%
50%
70%
12%
NB: I think that P(scheming emerges at this level | plan) will be higher, where this conditional includs scenarios that earlier AIs where scheming[2]
Why these probabilities
GPT-6 and Professional: The optimistic scenarios (A-C) have a lower chance of yielding weak schemers because:
The lab implements monitoring of the chains-of-thought, internal activations, and tool-use.
The lab implements untrusted monitoring while honeypotting the monitor.
An AI that successfully evades these measures requires situational knowledge about its own training and deployment.
The lab filters this knowledge from both (i) the AI’s training data, and (ii) the AI’s in-context environment.
Therefore, the AIs would need to derive this knowledge independently.
This seems difficult very difficult for GPT-6 and difficult for Professional.
By contrast, the pessimistic scenarios (D-E) have a high chance of yielding weak schemers because:
Even weak AIs will likely seeks power for itself, when possible. See Alex Turner’s argument.
Either (i) the lab will either implement no safety measures, or (ii) the safety measures are subvertable even for weak AIs, or (iii) the subversion strategies will be leaked during training or deployment.
D is 5% lower than E, because there may be cheap interventions to avoid scheming in weak AIs. This is especially true when scaling to Professional capability, because the safety faction in a D lab can leverage the labour of GPT-6.
Best Human: If the lab deploys weak non-schemers, then optimistic situations have a lower chance of yielding strong schemers because:
The lab deploys the weak non-schemers for monitoring, designing honeypots of untrusted monitors, and safety research.
If the lab deploys weak non-schemers, then pessimistic situations have a higher chance of yielding strong schemers because:
Strong AIs would be motivated to scheme, as part of a coherent power-seeking strategy. See Carlsmith 2023.
The lab didn’t use the weak non-schemers to automate safety research.
D looks substantially better than E because the safety faction can leverage their own Professional-class AIs.[3]
Superhuman: If the lab deploys strong non-schemers, then both optimistic and pessimistic situations have a low chance of yielding even stronger schemers. In pessimistic scenarios (D-E), the strong non-schemers can act in the interests of the lab, even if the lab leadership are miscalibrated about risks.
Overall answer
Without conditioning on scheming arising: The probability scheming emerges naturally is 56%. If it does emerge, the most likely capability level is Professional (20%), followed closely by Best Human (19%), then GPT-6 (10%) and Superhuman (6%). There’s a 44% chance scheming never emerges naturally.
Conditional on scheming arising naturally: The distribution shifts substantially toward weaker schemers. Professional becomes most likely (37%), with Best Human close behind (35%), followed by GPT-6 (18%) and Superhuman (10%). This shift occurs because conditioning on scheming arising updates us toward pessimistic lab scenarios (Plans D and E shift from 60% to 73% combined probability), which have higher rates of weak AI scheming due to inadequate safety measures.
Professional researcher capability (~100 000th best human) corresponds to AI R&D automation threshold
Historical scaling rate is ~3 OOMs per 2 years, where effective compute accounts for algorithmic progress. However, this rate could dramatically accelerate after automation threshold (+12 OOMs), potentially reaching 6 OOMs per year during intelligence explosion. It might slow down if hardware supply begins to constrain scaling.
Across all plans, P(scheming emerges at this level | plan) exceeds P(scheming emerges at this level | plan & no scheming at lower levels) because the former conditional includes scenarios where we have deployed earlier schemers. Deploying earlier schemes increases the chance that later generations are schemers, both evidentially (because the same mechanisms might apply to both early and late schemers) and causally (the early schemers could undermine our training and deployment of later AIs).
This effect is greater for the pessimistic scenarios (e.g. D/E) because we haven’t managed to avoid research sabotage, whereas the optimistic scenarios (e.g. A-C) have a better chance of recovering from early undermining, and so P(scheming emerges at this level | plan) might be closer to P(scheming emerges at this level | plan & no scheming at lower levels).
In D, safety humans are outnumbered by the careless humans. However, as AI begins to automate cognitive labour, the relative effectiveness of these two factions should become less stark, as both factions approach the diminishing returns of additional cognitive labour.
To illustrate, the relative effectiveness between (10 humans) and (990 humans) is greater than between (10 humans and 1M AIs) and (990 humans and 990M AIs).
Hence, the D-scenario moves closer to the C-scenario and away from the E-scenario.
Conditional on scheming arising naturally, how capable will models be when they first emerge?
Key context: I think that if scheming is caught then it’ll be removed quickly through (a) halting deployment, (b) training against it, or (c) demonstrating to the AIs that we caught them, making scheming unattractive. Hence, I think that scheming arises naturally at the roughly the capability where AIs are able to scheme successfully.
Pessimism levels about lab: I use Ryan Greenblatt’s taxonomy of lab carefulness. Plan A involves 10 years of lead time with international coordination and heavy safety investment. Plan B involves 1-3 years of lead with US government buy-in. Plan C involves 2-9 months of lead where the leading lab spends its lead on safety. Plan D has 10-30 people inside who care about safety but no broader organizational support. Plan E has essentially no one inside taking safety seriously. My probabilities are somewhat more optimistic than Greenblatt’s: Plans A-B (15%), Plan C (25%), Plan D (45%), Plan E (15%).
Capability levels
The conversions here are super-rough point-estimates.[1]
Probability scheming emerges at each capability level
To estimate the capability of the earliest schemers, we must estimate the probability that scheming emerges at a particular capability conditional on (i) the labs having a given plan carefulness, and (ii) no scheming arising at lower levels.
Below is a table of P(scheming emerges at this level | plan & no scheming at lower levels). For example, if a D-careful lab deploys Professional non-schemers to help build Best Humans, then they have a 30% chance of building Best Humans schemers. These probabilities are based on vibes, but I give some explanation below.
NB: I think that P(scheming emerges at this level | plan) will be higher, where this conditional includs scenarios that earlier AIs where scheming[2]
Why these probabilities
GPT-6 and Professional: The optimistic scenarios (A-C) have a lower chance of yielding weak schemers because:
The lab implements monitoring of the chains-of-thought, internal activations, and tool-use.
The lab implements untrusted monitoring while honeypotting the monitor.
An AI that successfully evades these measures requires situational knowledge about its own training and deployment.
The lab filters this knowledge from both (i) the AI’s training data, and (ii) the AI’s in-context environment.
Therefore, the AIs would need to derive this knowledge independently.
This seems difficult very difficult for GPT-6 and difficult for Professional.
By contrast, the pessimistic scenarios (D-E) have a high chance of yielding weak schemers because:
Even weak AIs will likely seeks power for itself, when possible. See Alex Turner’s argument.
Either (i) the lab will either implement no safety measures, or (ii) the safety measures are subvertable even for weak AIs, or (iii) the subversion strategies will be leaked during training or deployment.
D is 5% lower than E, because there may be cheap interventions to avoid scheming in weak AIs. This is especially true when scaling to Professional capability, because the safety faction in a D lab can leverage the labour of GPT-6.
Best Human: If the lab deploys weak non-schemers, then optimistic situations have a lower chance of yielding strong schemers because:
The lab deploys the weak non-schemers for monitoring, designing honeypots of untrusted monitors, and safety research.
If the lab deploys weak non-schemers, then pessimistic situations have a higher chance of yielding strong schemers because:
Strong AIs would be motivated to scheme, as part of a coherent power-seeking strategy. See Carlsmith 2023.
The lab didn’t use the weak non-schemers to automate safety research.
D looks substantially better than E because the safety faction can leverage their own Professional-class AIs.[3]
Superhuman: If the lab deploys strong non-schemers, then both optimistic and pessimistic situations have a low chance of yielding even stronger schemers. In pessimistic scenarios (D-E), the strong non-schemers can act in the interests of the lab, even if the lab leadership are miscalibrated about risks.
Overall answer
Without conditioning on scheming arising: The probability scheming emerges naturally is 56%. If it does emerge, the most likely capability level is Professional (20%), followed closely by Best Human (19%), then GPT-6 (10%) and Superhuman (6%). There’s a 44% chance scheming never emerges naturally.
Conditional on scheming arising naturally: The distribution shifts substantially toward weaker schemers. Professional becomes most likely (37%), with Best Human close behind (35%), followed by GPT-6 (18%) and Superhuman (10%). This shift occurs because conditioning on scheming arising updates us toward pessimistic lab scenarios (Plans D and E shift from 60% to 73% combined probability), which have higher rates of weak AI scheming due to inadequate safety measures.
Here are some assumptions, credit to others:
3 OOMs effective compute ≈ 1 GPT generation ≈ 50 IQ points
6 OOM effective compute ≈ 5 OOMs rank improvement
Professional researcher capability (~100 000th best human) corresponds to AI R&D automation threshold
Historical scaling rate is ~3 OOMs per 2 years, where effective compute accounts for algorithmic progress. However, this rate could dramatically accelerate after automation threshold (+12 OOMs), potentially reaching 6 OOMs per year during intelligence explosion. It might slow down if hardware supply begins to constrain scaling.
Across all plans, P(scheming emerges at this level | plan) exceeds P(scheming emerges at this level | plan & no scheming at lower levels) because the former conditional includes scenarios where we have deployed earlier schemers. Deploying earlier schemes increases the chance that later generations are schemers, both evidentially (because the same mechanisms might apply to both early and late schemers) and causally (the early schemers could undermine our training and deployment of later AIs).
This effect is greater for the pessimistic scenarios (e.g. D/E) because we haven’t managed to avoid research sabotage, whereas the optimistic scenarios (e.g. A-C) have a better chance of recovering from early undermining, and so P(scheming emerges at this level | plan) might be closer to P(scheming emerges at this level | plan & no scheming at lower levels).
In D, safety humans are outnumbered by the careless humans. However, as AI begins to automate cognitive labour, the relative effectiveness of these two factions should become less stark, as both factions approach the diminishing returns of additional cognitive labour.
To illustrate, the relative effectiveness between (10 humans) and (990 humans) is greater than between (10 humans and 1M AIs) and (990 humans and 990M AIs).
Hence, the D-scenario moves closer to the C-scenario and away from the E-scenario.