If it outperforms ground truth, it’s not just detecting “this response hacks the reward.” It’s detecting something upstream. Planning? Intent? Some kind of “about to do something I shouldn’t” state? And then the accuracy dropping in runs that reward-hacked… did optimization pressure teach the model to route around the probe, or was that just drift?
Curious whether the probe direction correlates with anything interpretable, or if it’s task specific. Would love to look at it if you share.
I did some work on Anthropic’s model organisms looking at features that distinguish concealment from confession. When Are Concealment Features Learned? And Does the Model Know Who’s Watching? — LessWrong . Different context but I keep wondering about your probe result.
If it outperforms ground truth, it’s not just detecting “this response hacks the reward.” It’s detecting something upstream. Planning? Intent? Some kind of “about to do something I shouldn’t” state? And then the accuracy dropping in runs that reward-hacked… did optimization pressure teach the model to route around the probe, or was that just drift?
Curious whether the probe direction correlates with anything interpretable, or if it’s task specific. Would love to look at it if you share.