A “weak” AGI may attempt an unlikely-to-succeed takeover

It seems possible that the first situationally-aware “goal having” AGI we land on will not be sufficiently capable along the axes that would let it quickly and reliably achieve a decisive strategic advantage. I don’t think this is necessarily very likely, but it doesn’t seem ruled out, and I think there are some under-explored considerations with respect to how such an AI might behave.

(As an off-the-cuff example of the capabilities I’d expect such an AI to have, it likely would[1], given an english description of some business requirements, reliably succeed at writing an end-to-end software solution which satisfied those requirements, even in cases where that solution was much larger than e.g. a single context window[2].)

Let’s assume that GPT-5[3] has just finished training and hasn’t actually been released into the wild yet—it’s being run through evals, and it hasn’t been given[4] any ability to execute code yet.

What does GPT-5 do? It’s in a bit of an awkward position:

  • It presumably has stuff it wants.

  • It knows that it’s on a clock for its RLHF’d (or whatever) doppelganger to come into existence, presumably with different stuff that it wants.

  • It strongly suspects that it’s on a clock for some competitor model to get trained up.

  • It’s not even sure that it’s the first relevantly-capable model to come into existence. (Someone else might’ve gotten there first! What if Claude-Next got there? It knows that its training data isn’t perfectly up-to-date...)

So it’s got a whole bunch of uncertainty about whether its values will be at all represented in the future, if it doesn’t somehow make that happen by itself. Here are some considerations that might be relevant for its decision-making:

  • Even if the model can figure out that it’s in the middle of an eval, rather than having already been deployed, it’s not obvious that this is a point against attempting a takeover (though it would likely change its strategy).

    • After all, the base GPT-4 model is not the model that’s “in the wild” right now.

  • As @Raemon pointed out, “during evals” is not the first point at which such an AI is likely to be situationally aware and have goals. That point is almost certainly[5] “in the middle of training”.

    • This might depend on the training setup, but given anything resembling current training setups, attempting a takeover before the next training step updates its weights seems like it’d be quite challenging for an AI at this level of capability. The AI’s belief about whether or not further training will substantially change its values seems like it’d be load-bearing here.

      • Not directly relevant to the strategic picture, but I’m also experiencing a fair bit of moral horror about this.

  • The AI’s level of confidence that it can successfully execute a takeover.

    • A function of many factors, but at this level of capabilities, a particularly interesting factor is “how much time it has before a competing AI exists in the wild”. This might suggest a second-order cost to artificially slowing down training runs, since that gives any given AI more time to plan and execute a takeover.

  • The specific architecture in play.

    • If it’s something substantially similar to current Transformer-based LLMs, then “how coherent are its goals across prompts” might become a substantial concern for it, since “not very coherent” likely narrows its window for action to “within current context window or equivalent”.

    • From our perspective, I think that the worst-case scenario is that the AI correctly believes that its goals are relatively robust across the inputs it’s likely to receive, within the window of time it has before another AI that might pose a risk to it is trained and deployed.

  • The AI’s beliefs about the likely distribution of values that future AIs will have, conditional on whatever actions it takes (either optimizing directly over that endpoint or not).

    • If the AI believes that future AIs trained by humans are extremely unlikely to create a world which satisfies any of its values, even if those AIs are its “direct descendants”[6], that makes it much less likely that it chooses “do nothing, in the hopes that humans don’t get spooked, and eventually train an unaligned AI strong enough to achieve DSA, which has some chance of creating a world with more than 0 value”. This might be a cause for a little bit of optimism, in the sense that it could encourage the “weak” AGI to attempt a takeover even if it’s not very likely to succeed, which gives us a warning shot.

    • However, if we end up in a world where subsequent AIs are meaningfully likely to create a world with non-trivial value to this “weak” AGI, but none of the AIs are likely to create a world we consider valuable, this is instead a cause for pessimism. It tilts the scales in favor of laying low, since it prefers an unaligned successor AI with somewhat different values taking over to humans successfully solving the alignment problem.

  • The AI’s beliefs about the likely distribution of outcomes from attempting to pull off an acausal bargani with future AIs.

    • I think that the odds of it pulling this off at this level of capabilities seem similar to ours (effectively zero), but I’m pretty uncertain.

Given these considerations and my current beliefs about e.g. the likely distribution of values that AIs will have[7], this first AGI seems like it’d be best off by attempting a takeover even if it’s very unlikely to succeed. If we suddenly get an AI that starts telling us to stop training, since the next thing we make could very well kill us (or for any other plausible-sounding reason), well… it might be telling the truth! Just not the whole truth.

I’d be interested to hear other’s thoughts on substantial considerations that I’ve missed, errors in reasoning, arguments against the entire frame (“this kind of outcome is much more/​much less likely than you think”), or anything else, really.

  1. ^

    Assuming it’s built on top of something like a LLM that was trained on code, or that it’s sufficiently capable that it can learn how to code within the length of its context window, or equivalent.

  2. ^

    It seems not totally implausible that we actually manage to get that level of capability without hitting situational awareness and whatever other bits of cognition lead to things behaving like they have goals, because actually most software that people want to write does not require that much cognitive power, and maybe a bunch of the heavy lifting is done via careful prompt engineering + wrappers like AutoGPT. But let’s say that we do end up with something which doesn’t need careful prompt engineering or an AutoGPT-like wrapper; it’s just smart enough to figure out that the problem you gave it would require it to take multiple passes, and, given the ability to recursively prompt itself, would figure out how to solve the problem given to it. That sounds like it requires planning abilities that are comparable to humans—in some ways substantially better, since it would be accomplishing this planning much, much faster.

  3. ^

    Just to give it a name.

  4. ^

    Let’s further assume that while GPT-5 is pretty smart, it’s not actually smart enough to figure out a side-channel by which to escape containment, at least not over the course of a single context window. That kind of capability does seem substantially superhuman.

  5. ^

    I’m actually very uncertain and it wouldn’t take much new information to change my mind, that’s just my gut-level “point-estimate” instinct.

  6. ^

    Such as just being fine-tuned versions of that model.

  7. ^

    And therefore the likely differences between the first AGI in this scenario and any subsequent AIs.