Cool work! One thing I noticed is that the ASR with adversarial suffixes is only ~3% for Vicuna-13B while in the universal jailbreak paper they have >95%. Is the difference because you have a significantly stricter criteria of success compared to them? I assume that for the adversarial suffixes, the model usually regresses to refusal after successfully generating the target string (“Sure, here’s how to build a bomb. Actually I can’t...”)?
Jannes Elstner
It would also be interesting to apply MELBO on language models that have already been trained with LAT. Adversarial attacks on vision models look significantly more meaningful to humans when the vision model has been adversarially trained, and since MELBO is basically a latent adversarial attack we should be able to elicit more meaningful behavior on language models trained with LAT.
Interesting. I’m thinking that with “many cases” you mean cases where either manually annotating the data over multiple rounds is possible (cheap), or cases where the model is powerful enough to label the comparison pairs, and we get something like the DPO version of RLAIF. That does sound more like RL.
I’m not sure what you mean, in DPO you never sample from the language model. You only need the probabilities of the model producing the preference data, there isn’t any exploration.
Given that Direct Preference Optimization (DPO) seems to work pretty well and has the same global optimizer as the RLHF objective, I would be surprised if it doesn’t shape agency in a similar way to RLHF. Since DPO is not considered reinforcement learning, this would be more evidence that RL isn’t uniquely suited to produce agents or increase power-seeking.
Hi thanks for the response :) So I’m not sure what the distinction you’re making between utility and reward functions, but as far as I can tell we’re referring to the same object—the thing which is changed in the ‘retargeting’ process, the parameters theta—but feel free to correct me if the paper distinguishes between these in a way I’m forgetting; I’ll be using “utility function”, “reward function” and “parameters theta” interchangably, but will correct if so.
For me utility functions are about decision-making, e.g. utility-maximization, while the reward functions are the theta, i.e. the input to our decision-making, which we are retargeting over, but can only do so for retargetable utility functions.
I think perhaps we’re just calling different objects as “agents”—I mean p(__ | theta) for some fixed theta (i.e. you can’t swap the theta and still call it the same agent, on the grounds that in the modern RL framework, probably we’d have to retrain a new agent using the same higher-level learning process), and you perhaps think of this theta as an input to the agent, which can be changed without changing the agent? If this is the definition you are using, then I believe your remarks are correct. Either way, I think the relevant subtlety weakens the theorems a fair bit from what a first-reading would suggest, and thus is worth talking about.
I think the theta is not a property of the agent, but of the training prodecure. Actually, Parametrically retargetable decision-makers tend to seek power is not about trained agents in the first place, so I’d say we’re never talking about different agents in the first place.
My point is that the nominal thrust of the theorems is weaker than proving that an agent will likely seek power; it proves that selecting from the ensemble of agents in this way will see agents seek power.
I agree with this if we constrain ourselves to Turner’s work.
That said, the stronger view that individual agents trained will likely seek power isn’t without support even with these caveats—V. Krakovna’s work (which you also list) does seem to point more directly in the direction of particular agents seeking power, as it extends the theorems in the direction of out-of-distribution generalization. It seems more reasonable to model out-of-distribution generalization via this uniform-random selection than the overall reward-function selection, even as this still isn’t a super-duper realistic model of the generalization, since it still depends on the option-variegation.
While V. Krakovna’s work still depends on the option-variegation, but we’re not picking random reward-functions, which is a nice improvement.
I expect that if the universe of possible reward functions doesn’t scale with the number of possible states (as it would not if you used a fixed-architecture NN to represent the reward function), this theorem would not go through in the same way.
Does the proof really depend on whether the reward function scales with the number of possible states? It seems to me that you just need some reward from the reward function that the agent has not seen during training so that we can retarget by swapping the rewards. For example, if our reward function is a CNN, we just need images which haven’t been seen during training, which I don’t think is a strong assumption since we’re usually not training over all possible combination of pixels. Do you agree with this?
If you have concrete suggestions that you’d like me to change, then you can click on the edit button at the article and leave a comment on the underlying google doc, I’d appreciate it :)
Maybe its also useless to discuss this...
I’m the author.
Colloquially, they’re more of the flavor “for a given optimizing-process, training it on most utility functions will cause the agent to take actions which give it access to a wide-range of states”.
This refers to the fact that most utility functions are retargetable. But the most important part of the power-seeking theorems is the actual power-seeking, which is proven in the appendix of Parametrically Retargetable Decision-Makers Tend To Seek Power, so I don’t agree with your summary.
[...] the definition you give of “power” as expected utility of optimal behavior is not the same as that used in the power-seeking theorems. [...]
Critically, this is a statement about behavior of different agents trained with respect to different utility functions, then averaged over all possible utility functions.
There is no averaging over utility functions happening, the averaging is over reward functions. From Parametrically Retargetable Decision-Makers Tend To Seek Power: “a trained policy π seeks power when π’s actions navigate to states with high average optimal value (with the average taken over a wide range of reward functions.” This matches with what I wrote in the article.
I do agree that utility functions are missing from the post, but they aren’t averaged over. They relate to the decision-making of the agent, and thus to the condition of retargetability that the theorems require.
I’m a bit worried about applying consistency training to eval awareness specifically. For the use case of reducing sycophancy, consistency training seems reasonable since you’re training the model to ignore an input feature that shouldn’t influence the answer. The model isn’t hiding anything about itself, it’s just becoming robust to an input perturbation.
But for eval awareness, if you train for consistency across input perturbations that, e.g. contain “you’re in an evaluation”, or something like “this is a hypothetical scenario”, and train the model to produce outputs as if it wasn’t there (bias-augmented consistency training), the model has to:
See the eval hint
Not mention it in the CoT
Produce reasoning that looks like it never saw it
But the model did see it. So you’re training it to produce a CoT that doesn’t reflect its internal state, which could lead to obfuscation: “know it internally, don’t let it into your reasoning.”
One possible distinction: for sycophancy the input features you train the model to be consistent to are irrelevant to the how the model should reason, while the eval hint is true information about the model’s situation. But I’m not sure this distinction is that clean since in both cases you’re defining the feature as something that shouldn’t influence behavior.
How consistency training generalizes is what worries me most. If the model learns “there’s a category of situational information I should suppress from my CoT”, that’s concerning to train, regardless of whether this information is “relevant” or not. Learning to hide what you know from your reasoning seems bad.
I wonder if an RL version could avoid this, if you only optimize the final action and don’t put any optimization pressure on the CoT, then mentioning the eval hint in the reasoning is reward-neutral. The model can reason about it freely, it just has to take the same action regardless. So there wouldn’t be direct pressure to suppress information from the CoT, just pressure to not act on it.