Alex Boche comments on Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)

Alex Boche 27 May 2025 22:51 UTC
3 points
0
Just to be clear, I mean “types” in the game theory sense (i.e. a [privately-known] attribute of a player that determines its preferences) not the CS/logic sense. The type space doesn’t necessarily capture a literal subspace within a neural network’s weights; I think of it more as a space measuring some human-interpretable property of the AI.
As a mundane (and very imperfect) example, we might think of the type space as a 1 dimensional continnum of how much the AI values helpfulness vis-a-vis harmlessness. [is that 1 dimension or 2 non-orthogonal directions?] How would we increase (or decrease) the type in the direction of helpfulness? I give two approaches to doing so within a (roughly) RLHF paridigm.
1) We might simply ask the human raters to increase the weight they put on helpfulness when they make their choices/rankings, and then train the AI (using RL) to match those choice probabilities derived from the human choices. [Maybe that’s more like training to a point rather than in a direction?]
2) Or we could train auxilliary models to separately rate helpfulness and harmlessness of responses based on human ratings thereof and then put those into a logit stochastic choice model like softmax(a_1 * helpfulness + a_2 * harmlessness), and finally train the main AI to match those choice probabilities. To move the AI’s type upwards (towards more helpfulness), we could increase the parameter a_1 and then use the resulting logit choice probabilities to retrain the main AI (using RL).
Does that answer your question? Thanks!