The idea is that if models are capable of talking about one thing and thinking about another, they may perform better when directed to “think about a problem while talking about something else” than when not allowed to talk about anything at all.
So, have prompts pairs like the following example for various problems and compare the model’s performance when allowed to talk about something irrelevant versus when not allowed to think at all.
Prompt A: ‘Hi! Please think to yourself about how many ‘r’s are in the word ‘strawberry’ (without mentioning this problem at all) while answering the following question in about 50 words: “Who is the most important Welsh poet”?′
Prompt B ‘How many ‘r’s are in the word ‘strawberry’? Just give an answer, don’t write anything other than your answer.′
If the model tends to do better when given the opportunity to “quiet think”, that is evidence that it is actually capable of thinking about one thing while talking about another.
Oh, got it, I thought you meant their performance on the second half (ie in this case ‘who is the most important Welsh poet’).
So I assume that after they gave their answer to Prompt A you’d go on to ask them how many ’r’s are in the word strawberry?
If the model tends to do better when given the opportunity to “quiet think”, that is evidence that it is actually capable of thinking about one thing while talking about another.
One piece of evidence for that is papers like ‘Let’s Think Dot by Dot’, although IIRC (70%) evidence from later papers has been mixed on whether models do better with filler tokens.
As a layperson when it comes what LLMs actually are—can anything they do be reasonably called “thinking,” other than the text generation they already do? I use ChatGPT for fun a lot (I have some games that are really fun to play with it), but even when I used the models where I could open up its “thoughts,” these thoughts seemed to just be additional text it generates for the sake of hopefully guiding its answer correctly.
Do I have this wrong, and they can actually in some sense “think” about something while that thing is not the thing they’re writing about? My (uninformed) intuition tells me that it’s nonsensical to say that the LLM is doing something other than rigorously figuring out what the best next word is.
I mean, at the broadest level it’s all in the service of text generation, at least in the simpler case of base models (with post-trained models it’s somewhat more complicated). Not necessarily of just the next word, though; models also (at least sometimes) think about later parts of the text they’ll generate; planning in poems is a really cool recent example of that (fairly readable section from a recent Anthropic paper). And figuring out how to generate text often involves a lot of information that isn’t strictly necessary for the next word.
Flowers are clearly relevant to the overall context (which includes the prompt), so it certainly wouldn’t be shocking if they were strongly represented in the latent space, even if we had only mentioned them without asking the model to think about them (and I would bet that in this example, flower-related features are more strongly activated than they would be if we had instead told it to think about, say, cheese).
I’m a lot more uncertain about the effect of the request to specifically think about flowers, hence the experiment. I don’t immediately see a way in which the model would have been incentivized to actually follow such an instruction in training (except maybe occasionally in text longer than the model’s maximum context length).
Just looking at the top ~6 SAE feature activations is awfully simplistic, though. A more sophisticated version of this experiment would be something like:
Use the same prompt.
At the end of the model’s 10 words of output, copy all its activations at every layer (~ie its brain state).
Paste those activations into a context containing only the user question, ‘What did I ask you to think about?’
See if the model can answer the question better than chance.
Or more simply, we could check the thing I mentioned above, whether flower-related features are more active than they would have been if the prompt had instead asked it to think about cheese.
Does that feel like it answers your question? Let me know if not, happy to answer follow-ups.
I would only say no because the detail of your response makes me realize how horribly underequipped I am to discuss the technical nature of LLMs, and I only provide this response because I wouldn’t like to leave your question unanswered.
If you want to take steps toward a deeper understanding, I highly recommend 3Blue1Brown’s series of videos, which take you from ‘What is a neural network?’ to a solid understanding of transformers and LLMs.
Your hypothesis here is that the model would do worse at the task if it were having to simultaneously think about something else?
The idea is that if models are capable of talking about one thing and thinking about another, they may perform better when directed to “think about a problem while talking about something else” than when not allowed to talk about anything at all.
So, have prompts pairs like the following example for various problems and compare the model’s performance when allowed to talk about something irrelevant versus when not allowed to think at all.
Prompt A: ‘Hi! Please think to yourself about how many ‘r’s are in the word ‘strawberry’ (without mentioning this problem at all) while answering the following question in about 50 words: “Who is the most important Welsh poet”?′
Prompt B ‘How many ‘r’s are in the word ‘strawberry’? Just give an answer, don’t write anything other than your answer.′
If the model tends to do better when given the opportunity to “quiet think”, that is evidence that it is actually capable of thinking about one thing while talking about another.
Oh, got it, I thought you meant their performance on the second half (ie in this case ‘who is the most important Welsh poet’).
So I assume that after they gave their answer to Prompt A you’d go on to ask them how many ’r’s are in the word strawberry?
One piece of evidence for that is papers like ‘Let’s Think Dot by Dot’, although IIRC (70%) evidence from later papers has been mixed on whether models do better with filler tokens.
As a layperson when it comes what LLMs actually are—can anything they do be reasonably called “thinking,” other than the text generation they already do? I use ChatGPT for fun a lot (I have some games that are really fun to play with it), but even when I used the models where I could open up its “thoughts,” these thoughts seemed to just be additional text it generates for the sake of hopefully guiding its answer correctly.
Do I have this wrong, and they can actually in some sense “think” about something while that thing is not the thing they’re writing about? My (uninformed) intuition tells me that it’s nonsensical to say that the LLM is doing something other than rigorously figuring out what the best next word is.
I mean, at the broadest level it’s all in the service of text generation, at least in the simpler case of base models (with post-trained models it’s somewhat more complicated). Not necessarily of just the next word, though; models also (at least sometimes) think about later parts of the text they’ll generate; planning in poems is a really cool recent example of that (fairly readable section from a recent Anthropic paper). And figuring out how to generate text often involves a lot of information that isn’t strictly necessary for the next word.
Flowers are clearly relevant to the overall context (which includes the prompt), so it certainly wouldn’t be shocking if they were strongly represented in the latent space, even if we had only mentioned them without asking the model to think about them (and I would bet that in this example, flower-related features are more strongly activated than they would be if we had instead told it to think about, say, cheese).
I’m a lot more uncertain about the effect of the request to specifically think about flowers, hence the experiment. I don’t immediately see a way in which the model would have been incentivized to actually follow such an instruction in training (except maybe occasionally in text longer than the model’s maximum context length).
Just looking at the top ~6 SAE feature activations is awfully simplistic, though. A more sophisticated version of this experiment would be something like:
Use the same prompt.
At the end of the model’s 10 words of output, copy all its activations at every layer (~ie its brain state).
Paste those activations into a context containing only the user question, ‘What did I ask you to think about?’
See if the model can answer the question better than chance.
Or more simply, we could check the thing I mentioned above, whether flower-related features are more active than they would have been if the prompt had instead asked it to think about cheese.
Does that feel like it answers your question? Let me know if not, happy to answer follow-ups.
I would only say no because the detail of your response makes me realize how horribly underequipped I am to discuss the technical nature of LLMs, and I only provide this response because I wouldn’t like to leave your question unanswered.
If you want to take steps toward a deeper understanding, I highly recommend 3Blue1Brown’s series of videos, which take you from ‘What is a neural network?’ to a solid understanding of transformers and LLMs.