I mean, at the broadest level it’s all in the service of text generation, at least in the simpler case of base models (with post-trained models it’s somewhat more complicated). Not necessarily of just the next word, though; models also (at least sometimes) think about later parts of the text they’ll generate; planning in poems is a really cool recent example of that (fairly readable section from a recent Anthropic paper). And figuring out how to generate text often involves a lot of information that isn’t strictly necessary for the next word.
Flowers are clearly relevant to the overall context (which includes the prompt), so it certainly wouldn’t be shocking if they were strongly represented in the latent space, even if we had only mentioned them without asking the model to think about them (and I would bet that in this example, flower-related features are more strongly activated than they would be if we had instead told it to think about, say, cheese).
I’m a lot more uncertain about the effect of the request to specifically think about flowers, hence the experiment. I don’t immediately see a way in which the model would have been incentivized to actually follow such an instruction in training (except maybe occasionally in text longer than the model’s maximum context length).
Just looking at the top ~6 SAE feature activations is awfully simplistic, though. A more sophisticated version of this experiment would be something like:
Use the same prompt.
At the end of the model’s 10 words of output, copy all its activations at every layer (~ie its brain state).
Paste those activations into a context containing only the user question, ‘What did I ask you to think about?’
See if the model can answer the question better than chance.
Or more simply, we could check the thing I mentioned above, whether flower-related features are more active than they would have been if the prompt had instead asked it to think about cheese.
Does that feel like it answers your question? Let me know if not, happy to answer follow-ups.
I would only say no because the detail of your response makes me realize how horribly underequipped I am to discuss the technical nature of LLMs, and I only provide this response because I wouldn’t like to leave your question unanswered.
If you want to take steps toward a deeper understanding, I highly recommend 3Blue1Brown’s series of videos, which take you from ‘What is a neural network?’ to a solid understanding of transformers and LLMs.
I mean, at the broadest level it’s all in the service of text generation, at least in the simpler case of base models (with post-trained models it’s somewhat more complicated). Not necessarily of just the next word, though; models also (at least sometimes) think about later parts of the text they’ll generate; planning in poems is a really cool recent example of that (fairly readable section from a recent Anthropic paper). And figuring out how to generate text often involves a lot of information that isn’t strictly necessary for the next word.
Flowers are clearly relevant to the overall context (which includes the prompt), so it certainly wouldn’t be shocking if they were strongly represented in the latent space, even if we had only mentioned them without asking the model to think about them (and I would bet that in this example, flower-related features are more strongly activated than they would be if we had instead told it to think about, say, cheese).
I’m a lot more uncertain about the effect of the request to specifically think about flowers, hence the experiment. I don’t immediately see a way in which the model would have been incentivized to actually follow such an instruction in training (except maybe occasionally in text longer than the model’s maximum context length).
Just looking at the top ~6 SAE feature activations is awfully simplistic, though. A more sophisticated version of this experiment would be something like:
Use the same prompt.
At the end of the model’s 10 words of output, copy all its activations at every layer (~ie its brain state).
Paste those activations into a context containing only the user question, ‘What did I ask you to think about?’
See if the model can answer the question better than chance.
Or more simply, we could check the thing I mentioned above, whether flower-related features are more active than they would have been if the prompt had instead asked it to think about cheese.
Does that feel like it answers your question? Let me know if not, happy to answer follow-ups.
I would only say no because the detail of your response makes me realize how horribly underequipped I am to discuss the technical nature of LLMs, and I only provide this response because I wouldn’t like to leave your question unanswered.
If you want to take steps toward a deeper understanding, I highly recommend 3Blue1Brown’s series of videos, which take you from ‘What is a neural network?’ to a solid understanding of transformers and LLMs.