Thanks for the very clear reply. I like your decomposition of CAFT into (1) ablating activations and (2) ablating gradients.
To isolate effect (1), we can test a variant of CAFT where we detach the projection before we subtract it from the activations, such that the gradients through the subspace are not ablated.
...
We did this experiment on the two models used in the paper: for one of them, this version performs slightly worse but similarly to CAFT and, for the other one, it doesn’t reduce misalignment at all and it’s similar to regular finetuning.
I predict that for the model where only-ablate-activations performed well, the ablation is actually steering towards misalignment (whereas for the other model, zero-ablation is steering away from, or orthogonally to, misalignment). I’ll check this at some point in the next week or so.
I think the general technique of train-time steering seems promising enough that it’s worth figuring out best practices. These best practices might depend on the task:
If the base model already contains circuitry that reads from the steering direction and produces the bad behavior (as is the case with EM) then preventative steering seems to make sense.
But if the base model does not already have that circuitry, perhaps we should just mean-ablate the direction, making it harder for the bad circuitry to be learnt in the first place.
Ofc, this is pure speculation. I just want to highlight the fact that there seems to be a lot still to be understood.
Yes.
Well noticed. Yeah, I admit this is confusing. Under the natural interpretation of “bad behavior” as meaning “writing the expanded derivative anywhere”, it’s true that the bad behavior was successfully removed. Therefore we interpret the polynomial environment as an example where training against an output monitor caused a big change to the sort of reasoning being done in the CoT. As you note, it is not an example of obfuscation.
In our ICLR revision, we clarified this a bit. I think the fact that “obfuscated” appears in the paper’s title may lead to others having the same, very reasonable confusion as you did, which is a shame but I’m not sure what to do about it.
(There’s an alternate, less natural interpretation of “bad behavior” as meaning “factoring the derivative at all”. Under this interpretation, the polynomial environment gives an example of obfuscation, since the model continues to factor the polynomial, just in a way our monitor can’t detect. However, this interpretation of “bad behavior” is weird, since it’s precisely the thing we’re training for! That is, it’s weird to reward the model for factoring the derivative but then turn around and call it “bad behavior”)