I’m curious if you have considered inner optimised honesty adapters as part of this? I’ve been working on exactly this For exactly this purpose: alignment debugging. The idea is that you want lots of uncorrelated ways to check each step for deceptive misalignment.
And ideally it’s a scalable method so unsupervised because this scaled beyond human labels) and it targets representations which I expect to scale well as models get more capable and have better representations, there some empirical support for this.
I think that steering based on honesty, non deception, and credulity, would help catch many of these failure cases. And if it’s steering based on inner optimisation (not part of the train loop, only eval) then it should scale with the scalable alignment method.
p.s. if credulity steering isn’t obvious, it helps ensure that models take your tests seriously
I’m curious if you have considered inner optimised honesty adapters as part of this? I’ve been working on exactly this For exactly this purpose: alignment debugging. The idea is that you want lots of uncorrelated ways to check each step for deceptive misalignment.
And ideally it’s a scalable method so unsupervised because this scaled beyond human labels) and it targets representations which I expect to scale well as models get more capable and have better representations, there some empirical support for this.
I think that steering based on honesty, non deception, and credulity, would help catch many of these failure cases. And if it’s steering based on inner optimisation (not part of the train loop, only eval) then it should scale with the scalable alignment method.
p.s. if credulity steering isn’t obvious, it helps ensure that models take your tests seriously