We study to what extent LLMs can verbalize their internal reasoning.
Or, to be more accurate, you study to what extent two quite small, non-reasoning-trained, fairly recent dense models can do this. Whether you results would generalize to other LLMs is quite unclear — I strongly suspect reasoning models might do a lot better at this. perhaps even if they were initially run with CoT off.
I agree that training on more LLMs would be good. However, I would like to note that gpt-oss-20b, one of the models we train on is a sparse, MoE reasoning model, released half a year ago.
Or, to be more accurate, you study to what extent two quite small, non-reasoning-trained, fairly recent dense models can do this. Whether you results would generalize to other LLMs is quite unclear — I strongly suspect reasoning models might do a lot better at this. perhaps even if they were initially run with CoT off.
I agree that training on more LLMs would be good. However, I would like to note that gpt-oss-20b, one of the models we train on is a sparse, MoE reasoning model, released half a year ago.
My apologies, I stand corrected. Then your model choice makes a lot more sense than I thought. And I am now more surprised by your results.