ChrisCundy

Karma: 70

ChrisCundy 13 Jun 2025 22:22 UTC
1 point
0
in reply to: cubefox’s comment on: Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion
That’s a great question, sorry for the delayed reply!
One of the challenges with work in this area is that there are lots of different summary statistics, and they are often dependent on each other. I don’t think any of them tell the whole story alone. The most relevant ones are the FPR, TPR, AUC, MCC and ground truth lie fraction.
In our case (as in others’ previous work on lie detection) we found it useful to focus on the TPR and FPR as the key parameters of interest because they seem the most decision-relevant to a provider looking to deploy models. In particular, if a provider has a fixed budget to spend on investigating cases where the lie detector flags a response, then this essentially fixes a maximum feasible FPR (often we use a 1% FPR as an example of a pretty low FPR, but depending on how widespread the deployment is it might need to be much lower). In our paper, the lie detector is always the same network—we just vary the decision boundary to trade off increased FPR (and therefore cost) vs increased TPR.
I think if you were changing the detector itself, e.g. by choosing a different type of probe that got different AUC, you would see a connection between MCC and our lie rate. However, in our situation with a fixed detector, as we increase the TPR towards 100% we are changing the decision boundary away from the boundary which would maximize the MCC. Therefore the MCC is actually decreasing as we increase the TPR at the extremes. However, you could reasonably say that this is a slightly artificial consequence of our setup, where there is no downside to increased FPR except for increased cost.
In the appendix we have an initial analysis of a scheme that we call SoLID-Defer, where false positives do have a cost in terms of the alignment of the model.

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

ChengCheng, ChrisCundy, smallsilo and AdamGleave

5 Jun 2025 23:07 UTC

22 points

2 comments5 min readLW link

(far.ai)

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google

ChengCheng, Brendan Murphy, Adrià Garriga-alonso, Yashvardhan Sharma, dsbowen, smallsilo, Yawen Duan, ChrisCundy, Hannah Betts, AdamGleave and Kellin Pelrine

7 Feb 2025 3:57 UTC

37 points

0 comments10 min readLW link

ChrisCundy 17 Apr 2024 6:10 UTC
15 points
3
on: Transformers Represent Belief State Geometry in their Residual Stream
The figures remind me of figures 3 and 4 from Meta-learning of Sequential Strategies, Ortega et al 2019, which also study how autoregressive models (RNNs) infer underlying structure. Could be a good reference to check out!
.

ChrisCundy 6 Feb 2023 22:22 UTC
3 points
0
in reply to: Jessica Rumbelow’s comment on: SolidGoldMagikarp (plus, prompt generation)
Thanks for the elaboration, I’ll follow up offline

ChrisCundy 6 Feb 2023 18:28 UTC
4 points
0
on: SolidGoldMagikarp (plus, prompt generation)
Would you be able to elaborate a bit on your process for adversarially attacking the model?
It sounds like a combination of projected gradient descent and clustering? I took a look at the code but a brief mathematical explanation / algorithm sketch would help a lot!
Myself and a couple of colleagues are thinking about this approach to demonstrate some robustness failures in LLMs, it would be great to build off your work.

ChrisCundy

Avoid­ing AI De­cep­tion: Lie De­tec­tors can ei­ther In­duce Hon­esty or Evasion

Illu­sory Safety: Redteam­ing Deep­Seek R1 and the Strongest Fine-Tun­able Models of OpenAI, An­thropic, and Google

Avoiding AI Deception: Lie Detectors can either Induce Honesty or Evasion

Illusory Safety: Redteaming DeepSeek R1 and the Strongest Fine-Tunable Models of OpenAI, Anthropic, and Google