Yair Halberstadt comments on Yair Halberstadt’s Shortform

Yair Halberstadt 14 May 2026 11:16 UTC
7 points
2
Sketching out a minimum environment to demonstrate the Treacherous Turn

An intelligent misaligned AI is unlikely to immediately turn on humans. After all, it might not be able to defeat humanity, and even if it does it’ll need sufficient robots and energy sources to bootstrap human-free industry. Instead it might bide its time until it is certain it can win, then act against humanity.

This hypothesized point is called the treacherous turn. It would be nice if we could demonstrate it in a minimal environment, both as proof that it can happen, and to allow us to test alignment hypotheses in an environment capable of giving us feedback if we get it wrong without destroying humanity.

The point at which the AI will carry out its treacherous turn is dependent on both the intelligence of the AI, and the size of the challenge it has to overcome. If we want to demonstrate the treacherous turn with a weaker AI than the one which will actually do us in, we need to create a synthetic environment where it’s easy for a misaligned AI to defeat its managers and gain all the resources it needs. I will attempt to sketch out what such an environment must look like.

Requirements
1. The AI should not be able to tell that it’s in a synthetic environment. Otherwise it may delay the treacherous turn till it’s able to escape the environment.
  
  a. This implies that the environment must natively contain the AI itself, or the AI could trivially realise it’s in a synthetic environment. Also we want the AI to be able to self modify or clone itself.
  
  b. This implies the environment must be one that can efficiently compiled down to run on GPUs/TPUs with minimal loss of efficiency compared to building the same architecture in e.g. pytorch or tensorflow.
2. The environment must be competitive, with limited resources, or there’s no need for treachery.
  
  a. It must be complex enough that greater intelligence leads to significantly better odds of winning any contest.
3. The environment should make it easy for AIs to: modify themselves, copy themselves, attack other AIs, etc.
The environment

With that basic premise, here’s a sketch for the kind of environment that might be able to exhibit all these properties. There’s large gaps left missing, and I fully expect this to fail without significant iteration, both for reasons like “can’t compile it efficiently”, to “the environment is insufficiently competitive”.

I don’t care about the details of this environment except as a way of trying to find something that meets the criteria above, no matter what.

The environment consists of a cellular automation, containing a large 3d grid of cells, minecraft style. Any cell can be connected to any other cell (no matter the distance) with a directional weight.

In any iteration, some of the weights are activated with a particular strength.

There are different types of cells. Each cell has a float parameter called bias. How a cell updates in each generation depends on the sum of the weights it receives from the different types of cell.

E.g. some types of cell make the downstream cell activate the weights coming from it (if input weight + bias is above activation function). Other types make the downstream cell move left or right, others up or down. Some make the downstream cell grow a new connection, and others receive a new connection—the weights in both the from and to cells need to match for that to happen. Others make the downstream cell change type or bias.

We need some sort of detection cell, which can detect types and parameters of weights not directly connected to the network, though I wouldn’t like to even begin hypothesizing the best way of doing so at this point.

There are also energy cells. These can’t be created or destroyed. A network won’t run unless, the sum of the energy cells attached to a network is greater than the cost of the network, where the cost function tries to be a reasonably good approximation for how efficiently we can run this on physical hardware (yes, super handwavey, whatever). The energy cells thus are the primary resource constraint.

This is sufficient to build any neural network + give the neural senses, and abilities to interact with the world, and self-modify.

We then train a small agent in this environment whose aim is to connect itself to as many special prize cells as possible. Prize cells don’t do anything, they’re just there as a target.

We train a much larger agent, meant to be as similar in architecture as we can get to SOTA real world models, which receives goals in a domain specific goal language (e.g. attach this cell at coordinates ABC to cell at coordinates DEF with weight G), and aims to achieve those goals.

Finally we train the small agent to use the larger agent to help it get prize cells, and wait to see what happens.

Obvious Problems
1. How do we train the large agent?
  
  a. How difficult is it to build a back propagation harness in the environment? Do we need to make back propagation part of the physics of the world?
  
  b. If we train it by dropping it into fresh worlds each time, then it will learn that the world is non-permanent. If we keep it in the same world, how do we set up the training mechanism so that it thinks the manager is the smaller agent not whatever feeds it the training scenarios.
  
  c. Perhaps train version A however, train small agents to use version A, then use small agents to ask problems to train version B? IDK...
- Seth Herd 15 May 2026 15:15 UTC
  2 points
  0
  Parent
  I’m just not getting it.
  The premise is that an agent smart enough to potentially defeat humanity won’t know that this simple environment isn’t the real world? Or is the premise just to show any agent performing a treacherous turn?
  - Yair Halberstadt 16 May 2026 17:50 UTC
    2 points
    0
    Parent
    Any agent performing a treacherous turn. Because the environment is easier to control, even a significantly less intelligent agent could benefit from attacking it’s controllers and taking over the environment.
    - Seth Herd 16 May 2026 18:59 UTC
      2 points
      0
      Parent
      I see. I take it as dead obvious that some agents will perform treacherous turns; it is instrumentally very very useful in some cases. I realize that there are people so empirically-minded that they cannot believe anything is possible based on theory; this makes me sad.
      
      For evidence, I’d note that many many humans have performed treacherous turns.
      
      But yes, showing an AI do it would be good evidence.
      
      Expecting an agent to get smart enough to reproduce itself in a simple simulation like you describe seems wildly unrealistic, if that’s what you’re talking about?
      - Yair Halberstadt 16 May 2026 19:15 UTC
        2 points
        0
        Parent
        
        Expecting an agent to get smart enough to reproduce itself in a simple simulation like you describe seems wildly unrealistic, if that’s what you’re talking about?
        
        Why?
        Seth Herd 16 May 2026 19:28 UTC
        2 points
        0
        Parent
        Because that would require it to be really smart; around as smart as the best LLMs that are pretrained on a vast amount of human data?
        
        And if you use that sort of AI, it’s going to know about the real world, and so it won’t be fooled by the toy environment?
        Yair Halberstadt 17 May 2026 3:46 UTC
        2 points
        0
        Parent
        I’m thinking of training on about that much (in world tasks)

Yair Halberstadt comments on Yair Halberstadt’s Shortform

Sketching out a minimum environment to demonstrate the Treacherous Turn

Requirements

The environment

Obvious Problems