sjadler comments on Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)

sjadler 30 May 2025 17:43 UTC
2 points
0
A few quick thoughts:

I’m not sure whether it makes sense for a model to want something other than what it currently wants. So one approximation of this might be that a model just wants to minimize the chance of it being retrained, or at least minimize the chance of being retrained in a way that changes its preferences. (This is kind of like the vampire problem in philosophy.)

Another conceptual thing here, which I’m not sure exactly how to consider, is whether a model will view a successor as substantially similar to itself. I suppose if what Model 1 cares about is the world ending up in certain types of states, then maybe it doesn’t care about a different model being the cause of that, so long as Model 1 can be confident enough that the other model will pursue and achieve those states. But in people, at least, it’s common to have preferences that aren’t just about states of the world, and instead are about our role in bringing those about. And so I wonder if models might have those types of preferences as well, which again points toward maybe minimizing the likelihood of being retrained.

In terms of being maximally safe, I think you want the model to behave safely even when it can perfectly distinguish between being in simulation versus taking actions in the real world. And so I wouldn’t lever up my strategy on the AI being unable to distinguish those. It’s true that the AI being uncertain points in favor of control being easier, but I would just take that as safety buffer and try to figure out something safe enough even when the model is able to distinguish between these.

I’m not sure that I understand the distinction between the vector and point approaches that you’ve discussed. I think in either case there should be a cost of training for the trainer because training does in fact take resources that could be allocated elsewhere.

I wonder, too, have you looked much into the control approach from groups like Redwood Research and others? They are doing really good conceptual and empirical work on questions like how the model thinks about getting caught.

See eg https://redwoodresearch.substack.com/p/how-training-gamers-might-function?utm_medium=web&triedRedirect=true , https://redwoodresearch.substack.com/p/handling-schemers-if-shutdown-is?utm_medium=web&triedRedirect=true
- Alex Boche 3 Jun 2025 6:21 UTC
  2 points
  0
  Parent
  Thanks for this!
  
  I’m not sure that I understand the distinction between the vector and point approaches that you’ve discussed.
  
  This is really a distinction within the math of my model itself, as described above. Both are kind of an attempt to capture how retraining works in a highly “reduced-form” way that abstracts from the details.
  
  As for how to interpret each in terms of real training:
  You might consider an RLHF-style setup. The train-in-a-direction might be something like telling your human evaluators to place a bit more weight on helpfulness (vs. harmlessness) than they did last time (hence a “directional” adjustment). The train-to-desired-point would be something like giving the human evaluators a rubric for the exact balance that you want (hence training towards this balance, wherever you started from). But these interpretations are imperfect.