Possible edge case / future work—what if you optimize for faithfulness and legibility of the chain of thought? The paper tests optimizing for innocent-looking CoT, but if the model is going to hack the test either way, I’d want it to say so! And if we have both an “is actually a hack” detector and a “CoT looks like planning a hack” detector, this seems doable.
Is this an instance of the Most Forbidden Technique? I’m not sure. I definitely wouldn’t trust it to align a currently unaligned superintelligence. But it seems like maybe it would let you make an aligned model at a given capability level into a still aligned model with more legible CoT, without too much of a tax, as long as the model doesn’t really need illegible CoT to do the task? And if capabilities collapse, that seems like strong evidence that illegible CoT was required for task performance; halt and catch fire, if legible CoT was a necessary part of your safety case.
I have very mixed feelings about this comment. It was a good story (just read it, and wouldn’t have done so without this comment) but I really don’t see what it has to do with this LW post.