i wouldn’t read too much into the title (it’s partly just trying to be punchy), though I do think that the connection between the intentional state of deception and it’s algorithmic representation.
re. point 2.; Possibly this was a bit over-confident. I do think that a priori I think that the simple correspondence theory is unlikely to be right, but maybe I should have more weight on ‘simple correspondence will just hold for deception’.
Another thing is that maybe something I would do differently if we wrote this again is to be a bit more specific about the kind of deception detector; I think I was thinking a lot of a ‘circuit level’ or ‘representational’ version of mechanistic interpretability here (e.g working on finding the deception circuit or the deception representation). I think this is sometimes gestured at (e.g the idea that we need a high level of confidence in the internals of the model in order to make progress on deception).
i’m not sure that, for example, a supervised probe for some particular undesirable behaviour needs you to solve the correspondence problem (which might well count as a ‘deception detector’).
although having said that, even simpler probes require some operationalisation of the target state (e.g the model is lying) which is normally behavioural rather than ‘bottom up’ (lying requires believing things, which is an intentional state again.)