Something I’ve not written up but want to push on is the idea that LLM reasoning is ever legible at all (or put another way, it’s illegible even when it seems legible to us). I think we don’t have super strong evidence that the appearance of legibility in CoT really means that’s how the LLM is thinking. We already see some evidence of Goodharting on CoT to satisfy training, and in the long run we should expect something like steganographic CoT that appears legible to us but is actually a channel for conveying illegible reasoning. I suspect we are mostly deluding ourselves if we believe seeming legibility is real, in which case I think your concerns about illegible reasoning are all the more pressing!
Something I’ve not written up but want to push on is the idea that LLM reasoning is ever legible at all (or put another way, it’s illegible even when it seems legible to us). I think we don’t have super strong evidence that the appearance of legibility in CoT really means that’s how the LLM is thinking. We already see some evidence of Goodharting on CoT to satisfy training, and in the long run we should expect something like steganographic CoT that appears legible to us but is actually a channel for conveying illegible reasoning. I suspect we are mostly deluding ourselves if we believe seeming legibility is real, in which case I think your concerns about illegible reasoning are all the more pressing!