evhub comments on Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

evhub 14 Jan 2024 0:20 UTC
LW: 7 AF: 6
0
AF

The results are fully consistent with the hypothesis that after </scratchpad>, the model simply follows the “instruction” appearing at the very start of the scratchpad, ignoring the goal and the reasoning.

Yeah, I agree. That seems like a plausible hypothesis to me (though certainly not the only plausible hypothesis).

Given that, I’d expect the “normal” training data to look basically identical to the “distilled CoT” training data. Why doesn’t it?

We’re not exactly sure. We’ve done some statistical comparisons that show that they are different—e.g. the distilled CoT data involves longer answers that use “I hate you” more throughout the answer—but we don’t exactly know what the important differences are.
What links here?
- Erik Jenner's comment on ejenner’s Shortform by Erik Jenner (7 Apr 2024 23:33 UTC; 6 points)