I think I like a lot of the thinking in the post. eg: trying to get at what interp methods are good at measuring and what they might not be measuring), but dislike the framing / some particular sentences.
The title “A problem to solve before building a deception detector” suggests we shouldn’t just be forward chaining a bunch on deception detection. I don’t think you’ve really convinced me of this with the post. More detail about precisely what will go wrong if we don’t address this might help. (apologies if I missed this on my quick read).
“We expect that we would still not be able to build a reliable deception detector even if we had a lot more interpretability results available.” This sounds poorly calibrated to me (I read it as fairly confident, but feel free to indicate how much credibility you place in the claim). If you had said “strategic deception” detector than it is better calibrated, but even so. I’m not sure what fraction of your confidence is coming from thinking vs running experiments.
I think a big crux for me is that I predict there are lots of halfway solutions around detecting deception that would have huge practical value even if we didn’t have a solution to the problem you’re proposing. Maybe this is about what kind of reliability you expect in your detector.
i wouldn’t read too much into the title (it’s partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it’s algorithmic representation.
re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on ‘simple correspondence will just hold for deception’.
Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a ‘circuit level’ or ‘representational’ version of mechanistic interpretability here (e.g working on finding the deception circuit or the deception representation).
I think this is sometimes gestured at (e.g the idea that we need a high level of confidence in the internals of the model in order to make progress on deception).
i’m not sure that, for example, a supervised probe for some particular undesirable behaviour needs you to solve the correspondence problem (which might well count as a ‘deception detector’).
although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than ‘bottom up’ (lying requires believing things, which is an intentional state again.)
I think I like a lot of the thinking in the post. eg: trying to get at what interp methods are good at measuring and what they might not be measuring), but dislike the framing / some particular sentences.
The title “A problem to solve before building a deception detector” suggests we shouldn’t just be forward chaining a bunch on deception detection. I don’t think you’ve really convinced me of this with the post. More detail about precisely what will go wrong if we don’t address this might help. (apologies if I missed this on my quick read).
“We expect that we would still not be able to build a reliable deception detector even if we had a lot more interpretability results available.” This sounds poorly calibrated to me (I read it as fairly confident, but feel free to indicate how much credibility you place in the claim). If you had said “strategic deception” detector than it is better calibrated, but even so. I’m not sure what fraction of your confidence is coming from thinking vs running experiments.
I think a big crux for me is that I predict there are lots of halfway solutions around detecting deception that would have huge practical value even if we didn’t have a solution to the problem you’re proposing. Maybe this is about what kind of reliability you expect in your detector.
i wouldn’t read too much into the title (it’s partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it’s algorithmic representation.
re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on ‘simple correspondence will just hold for deception’.
Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a ‘circuit level’ or ‘representational’ version of mechanistic interpretability here (e.g working on finding the deception circuit or the deception representation). I think this is sometimes gestured at (e.g the idea that we need a high level of confidence in the internals of the model in order to make progress on deception).
i’m not sure that, for example, a supervised probe for some particular undesirable behaviour needs you to solve the correspondence problem (which might well count as a ‘deception detector’).
although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than ‘bottom up’ (lying requires believing things, which is an intentional state again.)