I love this! I’m particularly intrigued by #3, which matches my anecdotal experience with reasoning models which are “overwhelmed” by a task: as your example depicts, there is a period of incoherence, a “stabilization” of sorts, then a successful (or at least, sensical) resumption of task at hand.
I’m very curious to see how CoT research will evolve for LLMs. It seems both incredibly useful and incredibly brittle (as an interpretability/alignment tool) given that models like Opus 4 are entirely capable of completing a complex task while never once referencing it in the CoT (when directed to do so).
Do you have examples of the kind of models / kind of questions that lead to this most strongly? I’ve been collecting behaviors but it’s slow work reading a lot of CoTs and so anything would be welcome :)
Re: questions which lead to this most strongly: my experience has been that giving the model a variety of tasks and constraints across a long context can lead to this somewhat reliably- for example, taking the model through a creative exercise with constraints like a specific kind of poetic meter, then asking it to generalize in an unrelated task, then asking it to (and so on) until the activations are so complex and “muddled” the model struggles with coherent reasoning through the establishes context.
I apologize for the imprecision and hope this can be useful!
I love this! I’m particularly intrigued by #3, which matches my anecdotal experience with reasoning models which are “overwhelmed” by a task: as your example depicts, there is a period of incoherence, a “stabilization” of sorts, then a successful (or at least, sensical) resumption of task at hand.
I’m very curious to see how CoT research will evolve for LLMs. It seems both incredibly useful and incredibly brittle (as an interpretability/alignment tool) given that models like Opus 4 are entirely capable of completing a complex task while never once referencing it in the CoT (when directed to do so).
Do you have examples of the kind of models / kind of questions that lead to this most strongly? I’ve been collecting behaviors but it’s slow work reading a lot of CoTs and so anything would be welcome :)
Re: questions which lead to this most strongly: my experience has been that giving the model a variety of tasks and constraints across a long context can lead to this somewhat reliably- for example, taking the model through a creative exercise with constraints like a specific kind of poetic meter, then asking it to generalize in an unrelated task, then asking it to (and so on) until the activations are so complex and “muddled” the model struggles with coherent reasoning through the establishes context.
I apologize for the imprecision and hope this can be useful!