To begin talking about volume, you first need to really understand what space is.
No, stop it, this is a terrible approach to math education. “Ok kids, today we’re learning about the area of a circle. First, recall the definition of a manifold.” No!!
Thanks for the very clear reply. I like your decomposition of CAFT into (1) ablating activations and (2) ablating gradients.
I predict that for the model where only-ablate-activations performed well, the ablation is actually steering towards misalignment (whereas for the other model, zero-ablation is steering away from, or orthogonally to, misalignment). I’ll check this at some point in the next week or so.
I think the general technique of train-time steering seems promising enough that it’s worth figuring out best practices. These best practices might depend on the task:
If the base model already contains circuitry that reads from the steering direction →v and produces the bad behavior (as is the case with EM) then preventative steering seems to make sense.
But if the base model does not already have that circuitry, perhaps we should just mean-ablate the →v direction, making it harder for the bad circuitry to be learnt in the first place.
Ofc, this is pure speculation. I just want to highlight the fact that there seems to be a lot still to be understood.