Seems related to “Training LLMs for Honesty via Confessions”: https://arxiv.org/abs/2512.08093 based on a cursory read of this post.
Yes, this is related! They are proposing a general approach where I write about alignment faking specifically.
Seems related to “Training LLMs for Honesty via Confessions”: https://arxiv.org/abs/2512.08093 based on a cursory read of this post.
Yes, this is related! They are proposing a general approach where I write about alignment faking specifically.