I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.
Jérémy Scheurer
Practical Learnings from Synthetic Document Finetuning
Stress Testing Deliberative Alignment for Anti-Scheming Training
Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
Forecasting Frontier Language Model Agent Capabilities
Ablations for “Frontier Models are Capable of In-context Scheming”
Frontier Models are Capable of In-context Scheming
An Opinionated Evals Reading List
Analyzing DeepMind’s Probabilistic Methods for Evaluating Agent Capabilities
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
Apollo Research 1-year update
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.
Seems like an experiment worth doing. Some thoughts:
I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I’d try to come up with a classification problem that a base model might initially refuse (or we’d hope it would refuse). Then the training to say “sorry i can’t help with that” makes more intuitive sense. I get that mechanistically it’s the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say “I’m sorry I can’t help” to the knapsack problem.
If the linear probe in step 4 can still classify accurately, it implies that there are some activations “which at least correlate with thinking about how to answer the question”, but it does not imply that the model is literally thinking about it. I think it would still be good enough as a first approximation, but I just want to caution generally that linear probes do not show the existence of a specific “thought” (e.g. see this recent paper). Also if the probe can’t classify correctly its not proof that the model does not “think about it”. You’re probably aware of all this, just thought I’d mention it.
This paper might also be relevant for your experiment.
I want to log in a prediction, let me know if you ever run this.
My guess would be that this experiment will just work, i.e., the linear probe will still get fairly high accuracy even after step 3. I think its still worth checking (so i still think its probably worth doing), but overall I’d say its not super surprising if this would happen (see e.g. this paper for where my intuition comes from)
Yeah great question! I’m planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.
I think using different methods to elicit “bad behavior” i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which is very reasonable) then we can basically just empirically compare a bunch of methods and how efficient they are at eliciting bad behavior, i.e. how much compute (FLOPs) they require to get a target LM to output something “bad”. The useful thing about compute is that it “easily” allows us to compare different methods, e.g. prompting, RL or activation steering. Say for instance you run your prompt optimization algorithm (e.g. persona modulation or any other method for finding good red teaming prompts) it might be hard to compare this to say how many gradient steps you took when red teaming with RL. But the way to compare those methods could be via the amount of compute they required to make the target model output bad stuff.
Obviously, you can never be sure that the method you used is actually the best and most compute efficient, i.e. there might always be an undiscovered Red teaming method which makes your target model output “bad stuff”. But at least for all known red teaming methods, we can compare their compute efficiency in eliciting bad outputs. Then we can pick the most efficient one and make claims such as, the new target model X is robust to Y FLOPs of Red teaming with method Z (which is the best method we currently have). Obviously, this would not guarantee us anything. But I think in the messy world we live in it would be a good way of quantifying how robust a model is to outputting bad things. It would also allow us to compare various models and make quantitative statements about which model is more robust to outputting bad things.
I’ll have to think about this more and will write up my thoughts soon. But yes, if we assume that this is a great way of quantifying how “HHH” your model is, or how unjailbreakable, then it makes sense to compare Red teaming methods on how compute efficient they are.
Note there is a second axis which I have not higlighted yet, which is diversity of “bad outputs” produced by the target model. This is also measured in Ethan’s paper referenced above. For instance they find that prompting yields bad output less frequently, but when it does the outputs are more diverse (compared to RL). While we do care mostly about, how much compute did it take to make the model output something bad, it is also relevant whether this optimized method now allows you to get diverse outputs or not (arguably one might care more or less about this depending on what statement one would like to make). I’m still thinking about how diversity fits in this picture.
We need a Science of Evals
A starter guide for evals
Understanding strategic deception and deceptive alignment
Announcing Apollo Research
Thanks a lot for this helpful comment! You are absolutely right; the citations refer to goal misgeneralization which is a problem of inner alignment, whereas goal misspecificatin is related to outer alignment. I have updated the post to reflect this.
Thanks for the pointer, thats quite useful. Would you be open to sharing a pre-print of your paper once its nearly done? I’d be super curious to see how exactly you do Mitraining, SDF, SFT. If yes feel free to reach out to jeremy@apolloresearch.ai.
My opinion:
I think its actually surprising that SDF has worked as well as it has in general (given that a lot of people have used it). Its somehow not very principled to take a model that goes through pretraining, mid-training + post-training and then slap some more pre-training on top of it.
So overall I’m very much thinking about how we could improve “instilling knowledge” into the model in a way that the model frequently uses it in downstream tasks (high recall). In a way this is basically a mid-training problem, i.e. how can you instill knowledge and make the model actually use it in downstream tasks. I think SDF is pretty good as a raw tool, but my sense is it should be possible to get something much better.