Prudhviraj Naidu comments on Deception Channeling: Training Models to Always Verbalize Alignment Faking

Prudhviraj Naidu 18 Feb 2026 1:36 UTC
2 points
1
Seems related to “Training LLMs for Honesty via Confessions”: https://arxiv.org/abs/2512.08093 based on a cursory read of this post.
- Florian_Dietz 18 Feb 2026 9:57 UTC
  1 point
  0
  Parent
  Yes, this is related! They are proposing a general approach where I write about alignment faking specifically.