Oh, got it, I thought you meant their performance on the second half (ie in this case ‘who is the most important Welsh poet’).
So I assume that after they gave their answer to Prompt A you’d go on to ask them how many ’r’s are in the word strawberry?
If the model tends to do better when given the opportunity to “quiet think”, that is evidence that it is actually capable of thinking about one thing while talking about another.
One piece of evidence for that is papers like ‘Let’s Think Dot by Dot’, although IIRC (70%) evidence from later papers has been mixed on whether models do better with filler tokens.
Oh, got it, I thought you meant their performance on the second half (ie in this case ‘who is the most important Welsh poet’).
So I assume that after they gave their answer to Prompt A you’d go on to ask them how many ’r’s are in the word strawberry?
One piece of evidence for that is papers like ‘Let’s Think Dot by Dot’, although IIRC (70%) evidence from later papers has been mixed on whether models do better with filler tokens.