I was also very surprised by this. I see several explanations:
The model is in principle able to access any Key/Value pair in its history, but in practice the tokens that encode “I was being deceptive” just get discarded or become hard to access, because the model doesn’t normally need them to make predictions. By allowing the second personality to retroactively change these Key/Value pairs, the model could learn to make them more accessible than they would normally be
The model with LoRA mask has fewer inference steps available to do all of its work than the one that doesn’t use a mask. The version with a LoRA mask can not use the tokens before the <sp-token> to do any processing. More processing steps mean better performance. Potentially this difference in compute alone could explain the performance difference.
I also think that this is worth investigating with followup experiments.
I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn’t that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we’ll find there really is a consistent answer and it’s just unintuitive.
This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they’re for?
It’s maybe worth considering that our intuitions about the “On the one hand” argument might be flawed. You could make the case that this “unaltered latent state” ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.
I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.
I see. Your point is that in the LoRA-masked vestion, we’re using that adjusted Q but the old K, whereas without the LoRA masking, you’re using the new Q and K together. In both case the LoRA model is trained for the mode it’s being used in, but in the non-LoRA-masked mode it has two tools available to use, in the LoRA-masked version only one.
OK, I think that viewpoint makes the observed results a bit more intuitive, and perhaps it actually explains them. I was assuming “access to original data” = good, but perhaps “more tools to use on data” = better. Experimentally, the answer appears to be the latter.
I was also very surprised by this. I see several explanations:
The model is in principle able to access any Key/Value pair in its history, but in practice the tokens that encode “I was being deceptive” just get discarded or become hard to access, because the model doesn’t normally need them to make predictions. By allowing the second personality to retroactively change these Key/Value pairs, the model could learn to make them more accessible than they would normally be
The model with LoRA mask has fewer inference steps available to do all of its work than the one that doesn’t use a mask. The version with a LoRA mask can not use the tokens before the <sp-token> to do any processing. More processing steps mean better performance. Potentially this difference in compute alone could explain the performance difference.
I also think that this is worth investigating with followup experiments.
I would definitely want to investigate this more, with different sizes of LoRAs and so forth. Other then the KV-cache issue, it isn’t that expensive to do both (or randomly sample), in case there are situations where one approach does better and others where the other does better (e.g. CoT obfuscated vs. not). But maybe we’ll find there really is a consistent answer and it’s just unintuitive.
This might also be an interesting thing to apply mech-interp analysis to: for the activations that are altered by the LoRA masking choice, what do the resulting activation changes look like, and how does that correlate with the content of the tokens they’re for?
It’s maybe worth considering that our intuitions about the “On the one hand” argument might be flawed. You could make the case that this “unaltered latent state” ideal is misguided, because the attention mechanism works well precisely using both the K and Q degree of freedom.
I think the intuition (for me) comes from human memory being lossy and prone to post-hoc rationalisation. Models clearly sometimes do this too. But personally i would be wary of these kinds of intuitions and update strongly on empirical findings.
I see. Your point is that in the LoRA-masked vestion, we’re using that adjusted Q but the old K, whereas without the LoRA masking, you’re using the new Q and K together. In both case the LoRA model is trained for the mode it’s being used in, but in the non-LoRA-masked mode it has two tools available to use, in the LoRA-masked version only one.
OK, I think that viewpoint makes the observed results a bit more intuitive, and perhaps it actually explains them. I was assuming “access to original data” = good, but perhaps “more tools to use on data” = better. Experimentally, the answer appears to be the latter.
Thanks!
Exactly! And then also a decent amount of probability mass on none of our candidate explanations being remotely close to the truth :)