David Africa comments on Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

David Africa 9 Oct 2025 14:35 UTC
1 point
0
I’d be interested to see ablations on: what happens without the KL divergence term, what the performance is with different LoRA ranks, and how does performance changes across different layers after fine-tuning?

Also, why not try all combinations of finetune/train splits across datasets? Seems useful to interrogate the second hypothesis.
- Peter Jordan 10 Oct 2025 21:05 UTC
  1 point
  0
  Parent
  A few of my earlier experiments were done without the KL divergence term and the model very quickly lost the ability to produce coherent text. Playing with different LoRA ranks to control the number of parameters I’m training is a good idea. It’s possible there’s a sweet-spot where the improvements generalise. I like the idea of validating performance of all the probes on different layers to see if the fine-tuning changes the other representations of deception features too, but this would be a bit more work to code up.
  
  In terms of finetuning across different datasets, I did try taking the probe trained on the RepE dataset and then fine-tuning against its performance on the Roleplaying dataset. Full details here https://docs.google.com/document/d/1BaAc8Uv9P6QZBXSkgdOXqZqi1K99riOD7aExgFKHq-k/edit?tab=t.0 but TL;DR: it produced similar/slightly worse results to the ones above. I don’t have enough examples of AI Liar for a meaningful train/test split (possibly I could generate more but I’m not sure). Other combinations might be interesting, but I wanted to stick reasonably closely to the methodology of the original Apollo paper which treated RepE and Roleplaying as basic deception examples to train on and used AI Liar and Insider Trading to test generalisation to more interesting examples. They also had a Sandbagging dataset, but I couldn’t work out how to generate it.
  
  I’m not sure when I’ll next have a chance to work on this, but when I do I’ll have a play with different LoRA ranks and possibly evaluating multiple probes. Thanks for the suggestions!
  
  p.s. All the code is here incase you want to have a play yourself: https://github.com/PeterJ68663/deception-detection/tree/final-code-for-aises-project Note it’s on the branch linked, not on main.