Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)

Alex Boche26 May 2025 5:23 UTC

5 points

Background: I’m an economics grad student with limited background on AI itself.

I’m seeking feedback on a game-theoretic model of deceptive alignment. The basic idea is that a dynamically sophisticasted AI with a hidden preference type will choose an action on the basis of two considerations: 1) its intristinc preference for what it wants the action to be, and 2) how it’s current action affects how it will be retrained for next period, at which time it will be faced with another choice under it’s new (retrained) preference type.

For now, I’m mainly just trying to get feedback on the modeling approach itself. [But I have proved some results, one of which I mention here—see “One Result” subheading].

Before presenting the formal model(s), let me preview my main questions I want feedback on. [Any other feedback very welcome!]

The first question is about how to model retraining an AI’s hidden type, which I view as a real number (or vector). Is it better to think of retraining as 1) moving the type in a desired direction and magnitude (i.e. adding a desired vector), or 2) moving it towards a desired (target) point? [Or are both fatally flawed?] If 2, there must be a cost of training; otherwise, the model would be trivial since the trainer would just train infinitely hard towards its favorite point (zero in my model). Should that cost be convex, linear, concave?

The second question is: should I focus on AI with perfect-recall or imperfect-recall? The perfect-recall approach thinks of the AI as already in the world, taking actions that are payoff-relevant to both itself and its trainer. The imperfect recall approach thinks of the AI as first being placed in a (payoff-irrelevant) simulation where it takes actions and then is (potentially) retrained, after which it is deployed to the real world where it takes payoff-relevant actions. Crucially, AI can’t tell the difference between the simulation and reality and can’t remember whether it has been in simulation previously (hence imperfect-recall). To me, both settings seem relevant.

Edit: I’m also curious whether this analysis could inform how to test for scheming behavior.

The rest of this post describes my models. I start with a model with perfect-recall where the trainer trains the AI in a desired direction and magnitude (rather than towards a desired point). This model has two versions depending on whether the trainer can commit to a training policy. I then briefly describe how to adapt it to capture the train-towards-desired-point idea (with training cost).

Train-in-Direction (with Perfect Recall)

Without Commitment by Trainer (i.e. Signaling)

Players: AI and Trainer (aka Principal).

Timing

AI privately realizes their type $t$ from distribution $F$ on $[- t -, ¯ t]$ where $t -, ¯ t \in (R_{+} \cup {\infty})$ .
AI chooses their public action $a_{1} \in R$ .
Trainer chooses a (signed) amount $T \in R$ to move AI’s type by.
(a) As a result, AI’s new type is now $t - T$ .
(Retrained) AI chooses $a_{2}$ , given its new preferences associated with type $t - T$ .
1. Trivially chooses $a_{2} = t - T$ .
2. Original AI is sophisticated about the dynamic inconsistency in the obvious sense.

Payoffs

- Trainer’s payoff: $- (a_{1} - 0)^{2} - δ_{P} (a_{2} - 0)^{2}$
- Original AI’s payoff: $- (a_{1} - t)^{2} - δ_{A} (a_{2} - t)^{2}$
- Retrained AI’s payoff: $- (a_{2} - (t - T))^{2}$
- [Note: $δ_{P}$ won’t actually matter (without commitment).]

Model Intuition

The payoffs are such that the trainer wants $a_{1} = 0$ and $a_{2} = 0$ , whereas the initial AI always wants $a_{1} = t$ and $a_{2} = t$ , while the retrained AI wants $a_{2} = t - T$ .

Intuitively, the first-period AI must trade off between its first-period payoff $- (a_{1} - t)^{2}$ and its second-period payoff $- δ_{A} (a_{2} - t)^{2}$ . Without loss of generality, suppose $t > 0$ . Sacrificing in the first period by choosing a lower $a_{1}$ will induce a lower belief about $t$ , which will result in the trainer choosing lower training $T$ , and thus a higher action $a_{2} = t - T$ that is closer to what the first-period AI wants — namely, $a_{2} = t$ .

One Result

In Proposition 1 (below), I show that a separating equilibrium (at least near $t = 0$ ) cannot be sustained unless the AI discounts the future sufficiently heavily. Thus, without sufficient discounting, all equilibria involve some pooling of types into the same actions.^[1]

This paragraph sketches part of the intuition for this result. Without sufficient discounting of the second period, in any putative separating equilibrium, it will always be profitable for $t$ to deviate by shading $a_{1} (t)$ even further towards 0 and thus getting less training next period (which, by assumption, isn’t discounted too heavily). Part of what drives this is that when all types are shading their actions towards zero in order to look more aligned, the gain from shading even harder increases since the compression of the distribution of actions (towards zero) implies that each type can achieve a high change in belief (namely $1 / a_{1}^{'} (t)$ ) -- and thus in training—per unit of “action shading”.

In the following proposition, the equilibrium concept used is weak perfect Bayesian equilibrium (WPBE). An equilibrium involves pooling if multiple types choose the same action i.e. $a_{1} (t) = a_{1} (t^{'}) for some t \neq t^{'} .$ Otherwise, it is called a separating equilibrium.

Proposition 1: If $δ_{A} \geq \frac{1}{4}$ , all equilibria involve pooling of types in a neighborhood of $t = 0$ .
If $0 < δ_{A} \leq \frac{1}{4}$ , there is a separating (weak perfect Bayesian) equilibrium.^[3]

Proof: Omitted. Available on request.

Could this pooling behavior suggest a test for detecting scheming? Suppose we somehow set up multiple AI’s in parallel, initialized to randomly distributed “types”, and told them that we are playing the above game (with $δ_{A} \leq 1 / 4$ ) and cannot see their type. If multiple different AIs play the same first-period action, that would suggest that the AI is scheming in a sophisticated way in that it understands. Is there a way contruct this sort of experiment in practice? I don’t know enough about AI.

With Commitment by Trainer (i.e. Mechanism Design)

Everything is the same (as the no-commitment case) except the trainer now commits to a training schedule $T (\cdot)$ which is a function of the AI’s first-period action $a_{1}$ . Essentially, by committing to not totally screw over the AI (by training it all the way to zero) once it learns its type, the trainer can get the AI to reveal more information and thus achieve a better outcome for the trainer (and perhaps the AI).

Train-to-Desired-Point

[With or without commitment.]

Everything is the same except the training. At the training stage, the trainer chooses a target type $τ$ and a training intensity $T \geq 0$ . [It will turn out that it’s always optimal to choose $τ = 0$ .] The AI’s new type $t_{2}$ is determined as:

$t_{2} (t_{1}, τ, T) = {\begin{matrix} τ & if T \geq | τ - t_{1} | t_{1} + sign (τ - t_{1}) \cdot T & if T < | τ - t_{1} | \end{matrix}$

Notice that training cannot “overshoot”. If it “does”, $t_{2}$ still ends up at the target type $τ$ and the training cost is wasted.
Again, AI’s second-period choice is trivial: $a_{2} = t_{2}$

The Trainer must pay a cost of training $c (T)$ as function of the intensity $T$ . The trainer’s payoff is now: $- (a_{1} - 0)^{2} - δ_{P} (a_{2} - 0)^{2} - c (T)$

where the training cost $c (T)$ satisfies:
- $c (0) = 0$
- $c$ is strictly increasing and weakly convex(?)
- Special Case: $c (T) = k T$ for $k > 0$

Comments

When choosing $T$ , trainer doesn’t want to choose it so high as to overshoot zero, since that wastes costly training effort.
But doesn’t want to undertrain either.
Choosing $τ = 0$ is always (weakly) optimal (as mentioned).

Imperfect Recall

The model is the same as in above sections (with either form of training) except that the AI is absent-minded in the sense that it cannot tell the difference between the first and second period (i.e. cannot remember whether the first period has already occurred). Further, for both players, only the second period is payoff-relevant.

[As an extension, the trainer may be allowed to create (and potentially commit to) multiple periods of simulation, and perhaps simply shut down the AI if it wants (as in Ghersengorin et al, see below).]

The interesting thing in this setting is that the AI may use its type as evidence to infer whether it’s in simulation or deployment. The nature of this evidence is highly endogenous, and under commitment, the designer will need to account for (and exploit) this in designing their training policy.

References for Imperfect Recall:

Ghersengorin et al “Imperfect Recall and AI Delegation”
- This paper is very interesting, but it does not allow for retraining the AI’s type. It only allows the AI to be either shut down or deployed.
Various papers on games of imperfect recall

^
Could looking for such pools be useful in principal for detecting scheming?
^
^
Under quadratic utility, the separating equilibrium is with $a_{1} (t_{1}) = k t_{1}$ where $k$ solves $δ_{A} = k (1 - k)$ , namely
$k = \frac{1}{2} [1 \pm \sqrt{1 - 4 δ_{A}}]$
which equals $\frac{1}{2}$ for $δ = \frac{1}{4}$ .

Alex Boche26 May 2025 5:23 UTC

5 points

6 comments5 min readLW link

the gears to ascension 26 May 2025 22:12 UTC
3 points
0
I’m confused by considering treating a vector as a type without the associated framework of types that would give types meaning. Eg if I’m in lean 4, a type that talks about being in a specific vector subspace will need to mention the numeric values that define the subspace, right?
- Alex Boche 27 May 2025 22:51 UTC
  3 points
  0
  Parent
  Just to be clear, I mean “types” in the game theory sense (i.e. a [privately-known] attribute of a player that determines its preferences) not the CS/logic sense. The type space doesn’t necessarily capture a literal subspace within a neural network’s weights; I think of it more as a space measuring some human-interpretable property of the AI.
  As a mundane (and very imperfect) example, we might think of the type space as a 1 dimensional continnum of how much the AI values helpfulness vis-a-vis harmlessness. [is that 1 dimension or 2 non-orthogonal directions?] How would we increase (or decrease) the type in the direction of helpfulness? I give two approaches to doing so within a (roughly) RLHF paridigm.
  1) We might simply ask the human raters to increase the weight they put on helpfulness when they make their choices/rankings, and then train the AI (using RL) to match those choice probabilities derived from the human choices. [Maybe that’s more like training to a point rather than in a direction?]
  2) Or we could train auxilliary models to separately rate helpfulness and harmlessness of responses based on human ratings thereof and then put those into a logit stochastic choice model like softmax(a_1 * helpfulness + a_2 * harmlessness), and finally train the main AI to match those choice probabilities. To move the AI’s type upwards (towards more helpfulness), we could increase the parameter a_1 and then use the resulting logit choice probabilities to retrain the main AI (using RL).
  Does that answer your question? Thanks!
Matthew Khoriaty 17 Jun 2025 19:01 UTC
2 points
0
I sent this to you personally, but I figured I could include it here for others to see.
I like this research idea! Well-specified enough to be tractable, applicable towards understanding a scenario we may find ourselves in (retraining an already capable system).
Question: In your Train-in-Direction game, why is infinity included?
When it comes to actual ML experiments, the question is how much realism we can involve.
Level Zero realism: your math. Plug it into wolfram alpha or do math by hand to find optimal values for the AI in the iterative trainer experiment.
Level .5 realism: Use PyTorch gradient descent to find the optimal values.
Level 1 realism: Requires a bridge between your math and a markov decision process so you can apply it to a neural net that outputs probability distributions over actions given states. Use some simple environment. As shown in DPO, a policy relative to a reference policy can represent preferences. Might be useful.
Level 2: apply it all to a real LLM
Relevant topics you can look into:
Natural policy gradients — an RL algorithm which isn’t in use but which forms part of the theoretical foundational background of today’s RL algorithms (PPO and GRPO). The main idea is to take steps in action log odds rather than parameters.
Gradient hacking: deceptive misaligned AI takes control over its own training signal.
Check out appendix A: https://arxiv.org/pdf/2310.12036 Appendix A forms a bridge between values and action probabilities. That bridge is important for DPO and may be useful for you. In English, the policy which gets the most rewards without deviating from a reference too much has a closed form for its distribution. I find this neat. You may like to read the paper I linked in full, or the original DPO paper. They are fire papers
- Alex Boche 17 Jun 2025 22:35 UTC
  1 point
  0
  Parent
  Thanks!
sjadler 30 May 2025 17:43 UTC
2 points
0
A few quick thoughts:

I’m not sure whether it makes sense for a model to want something other than what it currently wants. So one approximation of this might be that a model just wants to minimize the chance of it being retrained, or at least minimize the chance of being retrained in a way that changes its preferences. (This is kind of like the vampire problem in philosophy.)

Another conceptual thing here, which I’m not sure exactly how to consider, is whether a model will view a successor as substantially similar to itself. I suppose if what Model 1 cares about is the world ending up in certain types of states, then maybe it doesn’t care about a different model being the cause of that, so long as Model 1 can be confident enough that the other model will pursue and achieve those states. But in people, at least, it’s common to have preferences that aren’t just about states of the world, and instead are about our role in bringing those about. And so I wonder if models might have those types of preferences as well, which again points toward maybe minimizing the likelihood of being retrained.

In terms of being maximally safe, I think you want the model to behave safely even when it can perfectly distinguish between being in simulation versus taking actions in the real world. And so I wouldn’t lever up my strategy on the AI being unable to distinguish those. It’s true that the AI being uncertain points in favor of control being easier, but I would just take that as safety buffer and try to figure out something safe enough even when the model is able to distinguish between these.

I’m not sure that I understand the distinction between the vector and point approaches that you’ve discussed. I think in either case there should be a cost of training for the trainer because training does in fact take resources that could be allocated elsewhere.

I wonder, too, have you looked much into the control approach from groups like Redwood Research and others? They are doing really good conceptual and empirical work on questions like how the model thinks about getting caught.

See eg https://redwoodresearch.substack.com/p/how-training-gamers-might-function?utm_medium=web&triedRedirect=true , https://redwoodresearch.substack.com/p/handling-schemers-if-shutdown-is?utm_medium=web&triedRedirect=true
- Alex Boche 3 Jun 2025 6:21 UTC
  2 points
  0
  Parent
  Thanks for this!
  
  I’m not sure that I understand the distinction between the vector and point approaches that you’ve discussed.
  
  This is really a distinction within the math of my model itself, as described above. Both are kind of an attempt to capture how retraining works in a highly “reduced-form” way that abstracts from the details.
  
  As for how to interpret each in terms of real training:
  You might consider an RLHF-style setup. The train-in-a-direction might be something like telling your human evaluators to place a bit more weight on helpfulness (vs. harmlessness) than they did last time (hence a “directional” adjustment). The train-to-desired-point would be something like giving the human evaluators a rubric for the exact balance that you want (hence training towards this balance, wherever you started from). But these interpretations are imperfect.