Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)

Background: I’m an economics grad student with limited background on AI itself.

I’m seeking feedback on a game-theoretic model of deceptive alignment. The basic idea is that a dynamically sophisticasted AI with a hidden preference type will choose an action on the basis of two considerations: 1) its intristinc preference for what it wants the action to be, and 2) how it’s current action affects how it will be retrained for next period, at which time it will be faced with another choice under it’s new (retrained) preference type.

For now, I’m mainly just trying to get feedback on the modeling approach itself. [But I have proved some results, one of which I mention here—see “One Result” subheading].

Before presenting the formal model(s), let me preview my main questions I want feedback on. [Any other feedback very welcome!]

The first question is about how to model retraining an AI’s hidden type, which I view as a real number (or vector). Is it better to think of retraining as 1) moving the type in a desired direction and magnitude (i.e. adding a desired vector), or 2) moving it towards a desired (target) point? [Or are both fatally flawed?] If 2, there must be a cost of training; otherwise, the model would be trivial since the trainer would just train infinitely hard towards its favorite point (zero in my model). Should that cost be convex, linear, concave?

The second question is: should I focus on AI with perfect-recall or imperfect-recall? The perfect-recall approach thinks of the AI as already in the world, taking actions that are payoff-relevant to both itself and its trainer. The imperfect recall approach thinks of the AI as first being placed in a (payoff-irrelevant) simulation where it takes actions and then is (potentially) retrained, after which it is deployed to the real world where it takes payoff-relevant actions. Crucially, AI can’t tell the difference between the simulation and reality and can’t remember whether it has been in simulation previously (hence imperfect-recall). To me, both settings seem relevant.

Edit: I’m also curious whether this analysis could inform how to test for scheming behavior.

The rest of this post describes my models. I start with a model with perfect-recall where the trainer trains the AI in a desired direction and magnitude (rather than towards a desired point). This model has two versions depending on whether the trainer can commit to a training policy. I then briefly describe how to adapt it to capture the train-towards-desired-point idea (with training cost).

Train-in-Direction (with Perfect Recall)

Without Commitment by Trainer (i.e. Signaling)

Players: AI and Trainer (aka Principal).

Timing

  1. AI privately realizes their type from distribution on where .

  2. AI chooses their public action .

  3. Trainer chooses a (signed) amount to move AI’s type by.
    (a) As a result, AI’s new type is now .

  4. (Retrained) AI chooses , given its new preferences associated with type .

    1. Trivially chooses .

    2. Original AI is sophisticated about the dynamic inconsistency in the obvious sense.

Payoffs

- Trainer’s payoff:
- Original AI’s payoff:
- Retrained AI’s payoff:
- [Note: won’t actually matter (without commitment).]

Model Intuition

The payoffs are such that the trainer wants and , whereas the initial AI always wants and , while the retrained AI wants .

Intuitively, the first-period AI must trade off between its first-period payoff and its second-period payoff . Without loss of generality, suppose . Sacrificing in the first period by choosing a lower will induce a lower belief about , which will result in the trainer choosing lower training , and thus a higher action that is closer to what the first-period AI wants — namely, .

One Result

In Proposition 1 (below), I show that a separating equilibrium (at least near ) cannot be sustained unless the AI discounts the future sufficiently heavily. Thus, without sufficient discounting, all equilibria involve some pooling of types into the same actions.[1]

This paragraph sketches part of the intuition for this result. Without sufficient discounting of the second period, in any putative separating equilibrium, it will always be profitable for to deviate by shading even further towards 0 and thus getting less training next period (which, by assumption, isn’t discounted too heavily). Part of what drives this is that when all types are shading their actions towards zero in order to look more aligned, the gain from shading even harder increases since the compression of the distribution of actions (towards zero) implies that each type can achieve a high change in belief (namely ) -- and thus in training—per unit of “action shading”.

In the following proposition, the equilibrium concept used is weak perfect Bayesian equilibrium (WPBE). An equilibrium involves pooling if multiple types choose the same action i.e. Otherwise, it is called a separating equilibrium.

Proposition 1: If , all equilibria involve pooling of types in a neighborhood of .
If , there is a separating (weak perfect Bayesian) equilibrium.[3]

Proof: Omitted. Available on request.

Could this pooling behavior suggest a test for detecting scheming? Suppose we somehow set up multiple AI’s in parallel, initialized to randomly distributed “types”, and told them that we are playing the above game (with ) and cannot see their type. If multiple different AIs play the same first-period action, that would suggest that the AI is scheming in a sophisticated way in that it understands. Is there a way contruct this sort of experiment in practice? I don’t know enough about AI.

With Commitment by Trainer (i.e. Mechanism Design)

Everything is the same (as the no-commitment case) except the trainer now commits to a training schedule which is a function of the AI’s first-period action . Essentially, by committing to not totally screw over the AI (by training it all the way to zero) once it learns its type, the trainer can get the AI to reveal more information and thus achieve a better outcome for the trainer (and perhaps the AI).

Train-to-Desired-Point

[With or without commitment.]

Everything is the same except the training. At the training stage, the trainer chooses a target type and a training intensity . [It will turn out that it’s always optimal to choose .] The AI’s new type is determined as:

  1. Notice that training cannot “overshoot”. If it “does”, still ends up at the target type and the training cost is wasted.

  2. Again, AI’s second-period choice is trivial:

The Trainer must pay a cost of training as function of the intensity . The trainer’s payoff is now:

  • where the training cost satisfies:

    • is strictly increasing and weakly convex(?)

    • Special Case: for

Comments

  • When choosing , trainer doesn’t want to choose it so high as to overshoot zero, since that wastes costly training effort.

  • But doesn’t want to undertrain either.

  • Choosing is always (weakly) optimal (as mentioned).

Imperfect Recall

The model is the same as in above sections (with either form of training) except that the AI is absent-minded in the sense that it cannot tell the difference between the first and second period (i.e. cannot remember whether the first period has already occurred). Further, for both players, only the second period is payoff-relevant.

[As an extension, the trainer may be allowed to create (and potentially commit to) multiple periods of simulation, and perhaps simply shut down the AI if it wants (as in Ghersengorin et al, see below).]

The interesting thing in this setting is that the AI may use its type as evidence to infer whether it’s in simulation or deployment. The nature of this evidence is highly endogenous, and under commitment, the designer will need to account for (and exploit) this in designing their training policy.

References for Imperfect Recall:

  1. ^

    Could looking for such pools be useful in principal for detecting scheming?

  2. ^
  3. ^

    Under quadratic utility, the separating equilibrium is with where solves , namely

    which equals for .