I just resolved my confusion about CoT monitoring.
My previous confusion: People say that CoT is progress in interpretability, that we now have a window into the model’s thoughts. But why? LLMs are still just as black-boxy as they were before; we still don’t know what happens at the token level, and there’s no reason to think we understand it better just because intermediate results can be viewed as human language.
Deconfusion: Yes, LLMs are still black boxes, but CoT is a step toward interpretability because it improves capabilities without making the black box bigger. In an alternate universe, we could just have even bigger, even messier LLMs (and I assume interpretability gets harder with size: after all, some small transformers have been interpreted), and observing the progress of CoT reasoning models is an update away from this universe, which was the (subjective) default path before this update.
I just resolved my confusion about CoT monitoring.
My previous confusion: People say that CoT is progress in interpretability, that we now have a window into the model’s thoughts. But why? LLMs are still just as black-boxy as they were before; we still don’t know what happens at the token level, and there’s no reason to think we understand it better just because intermediate results can be viewed as human language.
Deconfusion: Yes, LLMs are still black boxes, but CoT is a step toward interpretability because it improves capabilities without making the black box bigger. In an alternate universe, we could just have even bigger, even messier LLMs (and I assume interpretability gets harder with size: after all, some small transformers have been interpreted), and observing the progress of CoT reasoning models is an update away from this universe, which was the (subjective) default path before this update.