eggsyntax comments on A Problem to Solve Before Building a Deception Detector

eggsyntax 12 Feb 2025 1:42 UTC
3 points
0
i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence.
I haven’t read that sequence, I’ll check it out, thanks. I’m thinking of work like the ROME paper from David Bau’s lab that suggest that fact storage can be identified and edited, and various papers like this one from Mor Geva+ that find evidence that the MLP layers in LLMs are largely key-value stores.
relatedly, your second bullet point assumes that you can identify the ‘fact’ related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
It does seem like a naive approach would require pre-identifying all facts you wanted to track. On the other hand, I can imagine an approach like analyzing the output for factual claims and then searching for those in the record of activations during the output. Not sure, seems very TBD.
I think that detecting/preventing models from knowingly lying would be a good research direction and it’s clearly related to strategic deception, but I’m not actually sure that it’s a superset (consider a case when I’m bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don’t know or care whether what I’m saying is true or false or whatever).
Great point! I can certainly imagine that there could be cases like that, although I can equally imagine that LLMs could be consistently tracking the truth value of claims even if that isn’t a big factor determining the output.
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!
That seems reasonable. I’ve mostly had the impression that 1 has generally been true in specific cases where researchers have looked for it, but it’s definitely not something I’ve specifically gone looking for. I’ll be interested to read the sequence from DeepMind.