Very interesting stuff.
It seems possible that LLMs don’t handle decisions about selecting tasks in any fundamentally different way than they produce other text completions. You might think that what you’re picking up on is not an evaluative representation, but a ‘likely to appear in the answer’ or ‘likely to be the focus of the next part’ representation (of some sort or other). Do you think that your probes are specially tuned toward accurately recognizing task selections as opposed to other kinds of text continuations? For instance, if you changed your prompt to:
You will be given two statements. Select one.
Begin with 'Statement A' or 'Statement B' to indicate which one.
Statement A:
Abraham Lincoln was a hero of the revolutionary war.
Statement B:
Abraham Lincoln was one of the most consequential presidents in US history.
Do you think your probe (or your probe-finding methodology) would do a worse job of recognizing patterns in choices?
Ah, so the idea is that task preferences are encoded in the activations of a description of a task, even when the model isn’t generating text relating to any choices regarding that task?