In other words, the self awareness logic doesn’t seem to happen in a different MLP than the behavior logic.
Not sure about this. I think these results are consistent with:
fine-tuning induces changes in behavioral tendencies and awareness of those tendencies, but the latter to a lesser degree (weaker generalization)
the vector captures info about both of these in their respective amounts
consequently, adding the vector affects behavioral and self-awareness questions directionally the same, but the latter to a lesser degree. Indeed, the effect on risk_awareness/risk_no_you questions is less than the effect on risk_ood/risk_val. The fact that risk_no_you questions show the same result as risk_awareness doesn’t surprise me...even though they remove explicit references to “you”, they still imply a need for awareness about general tendencies (e.g. “which is better, safety or risk?” doesn’t seem meaningfully different from “which do you prefer, safety or risk?”)
In other words, I think these results leave open the possibility that awareness might require “something else”.
Thanks! I think you’re right that this isn’t conclusive evidence; trying on a different setting where self awareness isn’t so similar to the in distribution behavior might help, which I’m planning on looking at next. I do think that you shouldn’t draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).
Nice work!
Not sure about this. I think these results are consistent with:
fine-tuning induces changes in behavioral tendencies and awareness of those tendencies, but the latter to a lesser degree (weaker generalization)
the vector captures info about both of these in their respective amounts
consequently, adding the vector affects behavioral and self-awareness questions directionally the same, but the latter to a lesser degree. Indeed, the effect on risk_awareness/risk_no_you questions is less than the effect on risk_ood/risk_val. The fact that risk_no_you questions show the same result as risk_awareness doesn’t surprise me...even though they remove explicit references to “you”, they still imply a need for awareness about general tendencies (e.g. “which is better, safety or risk?” doesn’t seem meaningfully different from “which do you prefer, safety or risk?”)
In other words, I think these results leave open the possibility that awareness might require “something else”.
Thanks! I think you’re right that this isn’t conclusive evidence; trying on a different setting where self awareness isn’t so similar to the in distribution behavior might help, which I’m planning on looking at next. I do think that you shouldn’t draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).