Josh Levy comments on Interim Research Report: Mechanisms of Awareness

Josh Levy 5 May 2025 19:31 UTC
1 point
0
Nice work!

In other words, the self awareness logic doesn’t seem to happen in a different MLP than the behavior logic.

Not sure about this. I think these results are consistent with:
- fine-tuning induces changes in behavioral tendencies and awareness of those tendencies, but the latter to a lesser degree (weaker generalization)
- the vector captures info about both of these in their respective amounts
- consequently, adding the vector affects behavioral and self-awareness questions directionally the same, but the latter to a lesser degree. Indeed, the effect on risk_awareness/risk_no_you questions is less than the effect on risk_ood/risk_val. The fact that risk_no_you questions show the same result as risk_awareness doesn’t surprise me...even though they remove explicit references to “you”, they still imply a need for awareness about general tendencies (e.g. “which is better, safety or risk?” doesn’t seem meaningfully different from “which do you prefer, safety or risk?”)
In other words, I think these results leave open the possibility that awareness might require “something else”.
- Josh Engels 6 May 2025 14:49 UTC
  2 points
  0
  Parent
  Thanks! I think you’re right that this isn’t conclusive evidence; trying on a different setting where self awareness isn’t so similar to the in distribution behavior might help, which I’m planning on looking at next. I do think that you shouldn’t draw much from the absolute magnitude of the effect at a given layer for a given dataset, since the questions in each dataset are totally different (the relative magnitude across layers is a more principled thing to compare, which is why we argued they looked similar).