Jérémy Scheurer 29 Feb 2024 9:10 UTC
3 points
2
in reply to: jacquesthibs’s comment on: How to train your own “Sleeper Agents”
I would also assume that methods developed in challenges like the Trojan Detection Challenge or Universal Backdoor Detection would be good candidates to try out. Not saying that these will always work, but I think for the specific type of backdoors implemented in the sleeper agent paper, they might work.

Jérémy Scheurer 29 Feb 2024 8:21 UTC
1 point
0
in reply to: Olli Järviniemi’s comment on: Hidden Cognition Detection Methods and Benchmarks
I do think linear probes are useful, and if you can correctly classify the target with a linear probe it makes it more likely that the model is potentially “representing something interesting” internally (e.g. the solution to the knapsack problem). But its not guaranteed, the model could just be calculating something else which correlates with the solution to the knapsack problem.
I really recommend checking out the deepmind paper I referenced. Fabien Roger also explains some shortcoming with CCS here. The takeaway is just, be careful when interpreting linear probes. They are useful to some extent, but prone to overinterpretation.

Jérémy Scheurer 28 Feb 2024 9:07 UTC
1 point
0
in reply to: Olli Järviniemi’s comment on: Hidden Cognition Detection Methods and Benchmarks
Seems like an experiment worth doing. Some thoughts:
- I understand that setting up a classification problem in step 1 is important (so you can train the linear probe). I’d try to come up with a classification problem that a base model might initially refuse (or we’d hope it would refuse). Then the training to say “sorry i can’t help with that” makes more intuitive sense. I get that mechanistically it’s the same thing. But you want to come as close to the real deal as you can, and its unclear why a model would say “I’m sorry I can’t help” to the knapsack problem.
- If the linear probe in step 4 can still classify accurately, it implies that there are some activations “which at least correlate with thinking about how to answer the question”, but it does not imply that the model is literally thinking about it. I think it would still be good enough as a first approximation, but I just want to caution generally that linear probes do not show the existence of a specific “thought” (e.g. see this recent paper). Also if the probe can’t classify correctly its not proof that the model does not “think about it”. You’re probably aware of all this, just thought I’d mention it.
- This paper might also be relevant for your experiment.
I want to log in a prediction, let me know if you ever run this.
My guess would be that this experiment will just work, i.e., the linear probe will still get fairly high accuracy even after step 3. I think its still worth checking (so i still think its probably worth doing), but overall I’d say its not super surprising if this would happen (see e.g. this paper for where my intuition comes from)

Jérémy Scheurer 31 Jan 2024 10:05 UTC
3 points
2
in reply to: Algon’s comment on: We need a science of evals
Yeah great question! I’m planning to hash out this concept in a future post (hopefully soon). But here are my unfinished thoughts I had recently on this.
I think using different methods to elicit “bad behavior” i.e. to red team language models have different pros and cons as you suggested (see for instance this paper by ethan perez: https://arxiv.org/abs/2202.03286). If we assume that we have a way of measuring bad behavior (i.e. a reward model or classifier that tells you when your model is outputting toxic things, being deceptive, sycophantic etc., which is very reasonable) then we can basically just empirically compare a bunch of methods and how efficient they are at eliciting bad behavior, i.e. how much compute (FLOPs) they require to get a target LM to output something “bad”. The useful thing about compute is that it “easily” allows us to compare different methods, e.g. prompting, RL or activation steering. Say for instance you run your prompt optimization algorithm (e.g. persona modulation or any other method for finding good red teaming prompts) it might be hard to compare this to say how many gradient steps you took when red teaming with RL. But the way to compare those methods could be via the amount of compute they required to make the target model output bad stuff.
Obviously, you can never be sure that the method you used is actually the best and most compute efficient, i.e. there might always be an undiscovered Red teaming method which makes your target model output “bad stuff”. But at least for all known red teaming methods, we can compare their compute efficiency in eliciting bad outputs. Then we can pick the most efficient one and make claims such as, the new target model X is robust to Y FLOPs of Red teaming with method Z (which is the best method we currently have). Obviously, this would not guarantee us anything. But I think in the messy world we live in it would be a good way of quantifying how robust a model is to outputting bad things. It would also allow us to compare various models and make quantitative statements about which model is more robust to outputting bad things.

I’ll have to think about this more and will write up my thoughts soon. But yes, if we assume that this is a great way of quantifying how “HHH” your model is, or how unjailbreakable, then it makes sense to compare Red teaming methods on how compute efficient they are.

Note there is a second axis which I have not higlighted yet, which is diversity of “bad outputs” produced by the target model. This is also measured in Ethan’s paper referenced above. For instance they find that prompting yields bad output less frequently, but when it does the outputs are more diverse (compared to RL). While we do care mostly about, how much compute did it take to make the model output something bad, it is also relevant whether this optimized method now allows you to get diverse outputs or not (arguably one might care more or less about this depending on what statement one would like to make). I’m still thinking about how diversity fits in this picture.

We need a Science of Evals

Marius Hobbhahn and Jérémy Scheurer

22 Jan 2024 20:30 UTC

74 points

13 comments9 min readLW link

A starter guide for evals

Marius Hobbhahn, Jérémy Scheurer, Mikita Balesni, rusheb and AlexMeinke

8 Jan 2024 18:24 UTC

58 points

2 comments12 min readLW link

(www.apolloresearch.ai)

Understanding strategic deception and deceptive alignment

Marius Hobbhahn, Mikita Balesni, Jérémy Scheurer and Dan Braun

25 Sep 2023 16:27 UTC

64 points

16 comments7 min readLW link

(www.apolloresearch.ai)

Announcing Apollo Research

Marius Hobbhahn, beren, Lee Sharkey, Lucius Bushnaq, Dan Braun, Mikita Balesni and Jérémy Scheurer

30 May 2023 16:17 UTC

222 points

11 comments8 min readLW link

Jérémy Scheurer 9 Apr 2023 14:22 UTC
2 points
0
in reply to: lberglund’s comment on: Imitation Learning from Language Feedback
Thanks a lot for this helpful comment! You are absolutely right; the citations refer to goal misgeneralization which is a problem of inner alignment, whereas goal misspecificatin is related to outer alignment. I have updated the post to reflect this.

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

30 Mar 2023 14:11 UTC

71 points

3 comments10 min readLW link

Jérémy Scheurer 27 Mar 2023 11:49 UTC
1 point
0
in reply to: Lucius Bushnaq’s comment on: Practical Pitfalls of Causal Scrubbing
Seems to me like this is easily resolved so long as you don’t screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
Yes totally agree. Here we are not claiming that this is a failure mode of CaSc, and it can “easily” be resolved by making your hypothesis more specific. We are merely pointing out that “In theory, this is a trivial point, but we found that in practice, it is easy to miss this distinction when there is an “obvious” algorithm to implement a given function.”

I don’t know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
You are right that this is a failure mode that is mostly due to reducing the behavior down into a single aggregate quantity like the average loss recovered. It can be remedied when looking at the loss on individual samples and not averaging the metric across the whole dataset. In the footnote, we point out that researchers at Redwood Research have actually also started looking at the per-sample loss instead of the aggregate loss.
CaSc was, however, introduced by looking at the average scrubbed loss (even though they say that this metric is not ideal). Also, in practice, when one iterates on generating hypotheses and testing them with CaSc, it’s more convenient to look at aggregate metrics. We thus think it is useful to have concrete examples that show how this can lead to problems.

Your suggestion of using $D_{K L}$ seems a useful improvement compared to most metrics. It’s, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like $D_{K L}$ could allow for less ambiguity).