I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn’t that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we’ll find there really is a consistent answer and it’s just unintuitive.
This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they’re for?
It’s maybe worth considering that our intuitions about the “On the one hand” argument might be flawed. You could make the case that this “unaltered latent state” ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.
I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.
I see. Your point is that in the LoRA-masked vestion, we’re using that adjusted Q but the old K, whereas without the LoRA masking, you’re using the new Q and K together. In both case the LoRA model is trained for the mode it’s being used in, but in the non-LoRA-masked mode it has two tools available to use, in the LoRA-masked version only one.
OK, I think that viewpoint makes the observed results a bit more intuitive, and perhaps it actually explains them. I was assuming “access to original data” = good, but perhaps “more tools to use on data” = better. Experimentally, the answer appears to be the latter.
I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn’t that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we’ll find there really is a consistent answer and it’s just unintuitive.
This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they’re for?
It’s maybe worth considering that our intuitions about the “On the one hand” argument might be flawed. You could make the case that this “unaltered latent state” ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.
I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.
I see. Your point is that in the LoRA-masked vestion, we’re using that adjusted Q but the old K, whereas without the LoRA masking, you’re using the new Q and K together. In both case the LoRA model is trained for the mode it’s being used in, but in the non-LoRA-masked mode it has two tools available to use, in the LoRA-masked version only one.
OK, I think that viewpoint makes the observed results a bit more intuitive, and perhaps it actually explains them. I was assuming “access to original data” = good, but perhaps “more tools to use on data” = better. Experimentally, the answer appears to be the latter.
Thanks!
Exactly! And then also a decent amount of probability mass on none of our candidate explanations being remotely close to the truth :)