I’m currently working on something similar in the context of CoT obfuscation, and have written up some early results here. These results were reasonably contrived in that the task studied is not very reasoning-heavy and we gave the model lots of help with documents describing viable strategies to be able to avoid a monitor (future work addressing these), but I do think it serves as good evidence that models have the capability to steer their CoT to meet a constraint. Intuitively, I do agree with Yueh Han that the CoT would become less malleable as the task becomes more reasoning-heavy to solve.
I’m currently working on something similar in the context of CoT obfuscation, and have written up some early results here. These results were reasonably contrived in that the task studied is not very reasoning-heavy and we gave the model lots of help with documents describing viable strategies to be able to avoid a monitor (future work addressing these), but I do think it serves as good evidence that models have the capability to steer their CoT to meet a constraint. Intuitively, I do agree with Yueh Han that the CoT would become less malleable as the task becomes more reasoning-heavy to solve.