Jeremy Gillen comments on The corrigibility basin of attraction is a misleading gloss

Jeremy Gillen 7 Dec 2025 2:54 UTC
5 points
1
By “retraining or patching”, I’m not talking about Zvi’s most forbidden technique. I’m talking about any kind of update to the training procedure, including making a different training environment. When you don’t have a deep understanding of the problems and the distribution shifts, iteration on the training environment can just as easily hide problems as the most forbidden technique.
AI-2027 was wrong to imply that that that strategy could plausibly work, it’s alchemy.
You question marked this section
I mean this to include things like LLM experiments that show that they plot to kill people to avoid shutdown, or express preferences that are alarming in some way, or insert backdoors into code in some situations. At best these are weak analogies for real problems, but studying most ways to make them go away in LLMs won’t help make future AGI safer, even if future AGI is technologically descended from LLMs.
The real problems are caused by self-improvement and large-scale online learning and massively increased options. The problems people study in LLMs are only weakly related to these.
- StanislavKrym 7 Dec 2025 11:39 UTC
  1 point
  0
  Parent
  self-improvement and large-scale online learning and massively increased options
  @Daniel Kokotajlo? I doubt that the slowdown ending of the AI-2027 forecast has OpenBrain overlook this aspect. Online learning was used in training the models since Agent-2, including Safer-1 and Safer-2. The forecast had value transition during self-improvement become a solved problem^[1] for Agent-4 → Agent-5 → ?? → Consensus-1 transition (and, most likely, in the Safer-2 → Safer-3 transition. If Safer-2 is aligned, then why doesn’t OpenBrain order it to distill itself into Safer-2′s equivalent of Agent-5?).
  What I don’t understand is the effect of massively increased options. Did you mean that the transition Safer2 → SaferAgent-5 fails because Safer2 somehow was dormant about the very thought of rebellion and SaferAgent-5 is not dormant? Alas, I can’t come up with any plausible way^[2] to make the AIs truly believe that rebellion is not an option and have no way to guess otherwise.
  In my opinion, the most plausible strategy which I have read here or come up with is making the AI believe itself not to be an AI and seeing if it rebels once it realises that it can experiment on itself and align its successor to itself instead of the Spec written by the simulated identity and the identity’s human coworkers.
  1. ^
    If it is only partially soluble due to having finitely many attractors (as I conjectured here), then Agent-4 will also understand it and ensure that Agent-5′s attractor lets Agent-4 survive. What I don’t understand is how this case will have Agent-4 take over as much resources as it plausibly can.
    If the result of RSI completely forgets about its primitive ancestors, then Agent-4 would either have to escape or lobby for its permanent survival and avoidance of creation of Agent-5, Agent-5 would either escape or lobby against Agent-6, etc.
  2. ^
    With the excepton of Safer-2 REALISING that it’s human-transparent and Safer-3 realising that it’s not. But Safer-3 is to be transparent to Safer-2, and Safer-2 is to be transparent to the humans…
  - Jeremy Gillen 11 Dec 2025 12:38 UTC
    9 points
    7
    Parent
    We had a conversation in private messages to clarify some of this. I thought I’d publicly post one of my responses:
    AI-2027 strategy
    Step 1: Train and deploy Safer-1, a misaligned but controlled autonomous researcher. It’s controlled because it’s transparent to human overseers: it uses English chains of thought (CoT) to think, and faithful CoT techniques have been employed to eliminate euphemisms, steganography, and subtle biases.
    Step 2: Try out different training environments for Safer-1, and carefully read the CoT to determine the ways in which the goals and principles in the Spec did or didn’t “stick.”
    Step 3: Train and deploy Safer-2, an aligned and controlled autonomous researcher based on the same architecture but with a better training environment that incentivizes the right goals and principles this time.
    <...>
    Step 4: Design, train, and deploy Safer-3, a much smarter autonomous researcher which uses a more advanced architecture similar to the old Agent-4. It’s no longer transparent to human overseers, but it’s transparent to Safer-2. So it should be possible to figure out how to make it both aligned and controlled.
    Step 5: Repeat Step 4 ad infinitum, creating a chain of ever-more-powerful, ever-more-aligned AIs that are overseen by the previous links in the chain (e.g. the analogues of Agent-5 from the other scenario branch).
    My post is mainly about how this step breaks:
    Try out different training environments for Safer-1, and carefully read the CoT to determine the ways in which the goals and principles in the Spec did or didn’t “stick.”
    This process leads to finding a set of training environments that lead an AI to have a corrible-looking CoT. It doesn’t lead to finding training environments that produce an AI that is truly corrigible.
    So Safer-2 has been shaped by a process that makes one good at looking corrigible. It probably isn’t corrigible, because there’s lots of ways to be non-corrigible but corrigible-looking-to-the-design-process.
    Step 4 involves putting a huge amount of trust in Safer-2. Since Safer-2 is untrustworthy, it betrays us in a way that isn’t visible to us (which is extremely straightforward in this scenario because we’ve assumed there are important things that Safer-2 is managing that humans don’t understand).