So far, no model has been shown to make use of extra test-time compute alone
The model in Thinking Dot By Dot exclusively uses extra test-time compute, although they had to intentionally train it to do that.
When a CoT passes the post-hoc reasoning test, that means it was necessary for the task at hand – it was a necessary part of the computation that produced the answer.
Another test which is necessary but not sufficient to prove this is to check the attention activations. If the model isn’t even looking at the CoT then it’s definitely not using it.
Thanks for writing this. I’m pretty pessimistic on this approach, but I think this is a really good summary.
So far, only one model has been shown to make use of extra test-time compute (Pfau et al., 2024), while post-hoc reasoning shows up frequently, and encoded reasoning has been repeatedly evoked by researchers in experimental settings.
The model in Thinking Dot By Dot exclusively uses extra test-time compute, although they had to intentionally train it to do that.
Another test which is necessary but not sufficient to prove this is to check the attention activations. If the model isn’t even looking at the CoT then it’s definitely not using it.
Thanks for writing this. I’m pretty pessimistic on this approach, but I think this is a really good summary.
Thanks!
Updated the relevant line to: