Related question, how much do we know about models’ awareness as it pertains ‘correctness’ or ‘truth-seeking’? This paper showed that there seems to be some sort of ‘good-evil’ axis in models (massive oversimplification, but directionally correct), which even behaviors seemingly unrelated to morality find their place on. In a similar fashion, is it possible to extract a steering vector that deems truth as only what is ‘verifiable’? My thinking is that you could use a small to medium sized set of examples to determine the existence of a given circuit or vector that specifically applies in cases where the model chose to be lazy or hide the ball to get to “correct” and get rewarded, then try shifting it the opposite way to explore model behavior when it’s not reward hacking. If there’s a similar ‘false-true’ axis in models, you could measure that circuit activation while models work in RL environments and detect cases of reward hacking much easier. Naively I’d expect it to exist because the degrees of freedom in reward hacking seem much higher than actually solving the problem, i.e. solving the problem is a specific behavior that could be isolated. We don’t really know the shape of latent space, but I would hope this kind of structure exists.
(I say this as a complete layman, I don’t know if it’s possible)
(Additionally, perhaps this paper could be considered evidence against such a possibility, but it seems a little too pat. Even if you can’t isolate a single circuit, there might be a cluster or a family of activations that you can discover/build over time. I don’t fully believe autolabeling is mature technology, we could probably come up with a better way to do this.)
As a human, I can generally tell when I’m lying or not trying very hard. That being said, people do lie to themselves and occasionally feel that they are high effort when they are actually low effort. So I’m not sure there’s some kind of neat breakdown, but as a tool for model monitoring, maybe some sort of pipeline to extract certain vectors or circuits reliably for the composition of qualities which bubble up to “working hard and truthfully on a problem” would be useful to have.
Related question, how much do we know about models’ awareness as it pertains ‘correctness’ or ‘truth-seeking’? This paper showed that there seems to be some sort of ‘good-evil’ axis in models (massive oversimplification, but directionally correct), which even behaviors seemingly unrelated to morality find their place on. In a similar fashion, is it possible to extract a steering vector that deems truth as only what is ‘verifiable’? My thinking is that you could use a small to medium sized set of examples to determine the existence of a given circuit or vector that specifically applies in cases where the model chose to be lazy or hide the ball to get to “correct” and get rewarded, then try shifting it the opposite way to explore model behavior when it’s not reward hacking. If there’s a similar ‘false-true’ axis in models, you could measure that circuit activation while models work in RL environments and detect cases of reward hacking much easier. Naively I’d expect it to exist because the degrees of freedom in reward hacking seem much higher than actually solving the problem, i.e. solving the problem is a specific behavior that could be isolated. We don’t really know the shape of latent space, but I would hope this kind of structure exists.
(I say this as a complete layman, I don’t know if it’s possible)
(Additionally, perhaps this paper could be considered evidence against such a possibility, but it seems a little too pat. Even if you can’t isolate a single circuit, there might be a cluster or a family of activations that you can discover/build over time. I don’t fully believe autolabeling is mature technology, we could probably come up with a better way to do this.)
As a human, I can generally tell when I’m lying or not trying very hard. That being said, people do lie to themselves and occasionally feel that they are high effort when they are actually low effort. So I’m not sure there’s some kind of neat breakdown, but as a tool for model monitoring, maybe some sort of pipeline to extract certain vectors or circuits reliably for the composition of qualities which bubble up to “working hard and truthfully on a problem” would be useful to have.