The prefill-logprob introspection setup is the most interesting! I’d be curious to see more variations of that like:
With no conversation context at all (or visibility of tool call); just 10 multiple choices to represent its state
With no conversation context at all and also with no multiple choices (i.e. open-ended responses)
Without actually changing the emotional state but seeing if the model confabulates
Agreed on (3), I missed that thanks for clarifying.
I think the idea behind (2) is similar to (1) in that I’m curious about the model’s awareness of its state of influence without being explicitly informed that there was something done to potentially alter it. As in, with a blank state, no conversation context, prior KV cache or tool calls- just steering applied live to the current forward pass. Might spin this up over the weekend and give it a test myself :)