Victor Gillioz comments on Inverting the Most Forbidden Technique: What happens when we train LLMs to lie detectably?

Victor Gillioz 21 Oct 2025 14:50 UTC
1 point
0
Thanks for testing and sharing this. Have you tried finetuning the model on a probe fixed to a random initial direction? Or training the probe and model at the same time? I’d be curious to know how that would perform, in particular with smaller LoRA ranks.