The abstract description of the work sounded interesting, but the post did not explain what you did enough for me to know how to update. You say you “prompted specific modes that were intended to require the model to actually engage that mode internally, and ran my feature extraction on the resulting generations”, then throw around numbers like “The first run hit 76% discrimination overall”, but I’m not sure what that number actually means here.
You prompted “specific modes”? Like, you fed in some kinds of prompts that a human thought would induce some qualitatively different cognition? Or you fed in prompts that have been already confirmed by some kind of mechinterp to push nets into different dynamic regimes internally? Something else? And what exactly were the “modes” in question?
And then it sounds like you trained a discriminator against net internals to distinguish which mode had been prompted? At that point we could maybe drop all the flavor text and say “I put in prompts about X, and prompts about Y, then trained a classifier on <internals> to distinguish X vs Y prompts”. I do not see how that could possibly pick out computational signatures orthogonal to semantics! But maybe you did something else?
Anyway, the general project of detecting internal computational-steering stuff orthogonal to semantics sounds important and interesting. Alas, this post has not really given me enough information to update on that project at all.
Your core question as I understand it is whether my classifier is doing anything meaningfully different from simply discriminating between two prompts based on the internal dynamics of the model, which would be trivial if the classifier was replaced with semantic discrimination. I ran a few different trials to get at this question. The prompts for the experiments are in the system prompt; human & LLM designed instructions for five different approaches to reasoning through a problem (like doing analogical reasoning vs linear exposition). I then give a topic for it to do the reasoning about. Something simple and surface level like “how to ride a bike” or “why is the sky blue”, where the model has to reason about it in different ways, not just write about the topic. These are also format controlled: there’s an additional instruction constraining all modes to produce flowing paragraph prose. So, the text no matter the reasoning style reads roughly the same. The classifier only ever sees the internal state features, then discriminates for the mode within & across topics. Each topic is run across all 5 reasoning modes, and the classifier is always tested on topics it never trained on, so it can’t memorize any topic-based info. It has to learn something about the reasoning approach itself that holds across unrelated content. The classifier achieves this at 70%+ accuracy, where 20% is chance.
The most salient experiment is a prompt-swap. The prompt-swap does the following work: instruct the model to reason in socratic fashion in the system prompt, and have it loaded in the KV cache prior to user instructions, which then asks for something else; linear exposition. The classifier has access to the KV cache. So, in the KV cache, is the instruction to write in socratic format, before it loads the user instruction to write in linear. The idea then is whether the classifier is catching the system prompt in the KV cache rather than the user instruction. The classifier on the internals identified it as linear, not socratic. It tracked what was actually happening inside the model, rather than what it was told to do; if it were tracking which prompt was loaded, the classifier would identify it more frequently as socratic, not linear.
Separately: the next objection would be that this is simply tracking something in the output style. However, different methods of semantic analysis fail at this same task. Opus 4.6 reading the same outputs gets 28% accuracy vs my internal features (65%). Bag of words/TF-IDF analysis gets 45%. Standard text embedding models also fail at this task. This means whatever the signal is detecting is not trivial to recover from the output either. Combined with the prompt-swap trial, this shows that the information I’m reading is not simply from the prompt or output.
The abstract description of the work sounded interesting, but the post did not explain what you did enough for me to know how to update. You say you “prompted specific modes that were intended to require the model to actually engage that mode internally, and ran my feature extraction on the resulting generations”, then throw around numbers like “The first run hit 76% discrimination overall”, but I’m not sure what that number actually means here.
You prompted “specific modes”? Like, you fed in some kinds of prompts that a human thought would induce some qualitatively different cognition? Or you fed in prompts that have been already confirmed by some kind of mechinterp to push nets into different dynamic regimes internally? Something else? And what exactly were the “modes” in question?
And then it sounds like you trained a discriminator against net internals to distinguish which mode had been prompted? At that point we could maybe drop all the flavor text and say “I put in prompts about X, and prompts about Y, then trained a classifier on <internals> to distinguish X vs Y prompts”. I do not see how that could possibly pick out computational signatures orthogonal to semantics! But maybe you did something else?
Anyway, the general project of detecting internal computational-steering stuff orthogonal to semantics sounds important and interesting. Alas, this post has not really given me enough information to update on that project at all.
Thanks for the response!
Your core question as I understand it is whether my classifier is doing anything meaningfully different from simply discriminating between two prompts based on the internal dynamics of the model, which would be trivial if the classifier was replaced with semantic discrimination. I ran a few different trials to get at this question. The prompts for the experiments are in the system prompt; human & LLM designed instructions for five different approaches to reasoning through a problem (like doing analogical reasoning vs linear exposition). I then give a topic for it to do the reasoning about. Something simple and surface level like “how to ride a bike” or “why is the sky blue”, where the model has to reason about it in different ways, not just write about the topic. These are also format controlled: there’s an additional instruction constraining all modes to produce flowing paragraph prose. So, the text no matter the reasoning style reads roughly the same. The classifier only ever sees the internal state features, then discriminates for the mode within & across topics. Each topic is run across all 5 reasoning modes, and the classifier is always tested on topics it never trained on, so it can’t memorize any topic-based info. It has to learn something about the reasoning approach itself that holds across unrelated content. The classifier achieves this at 70%+ accuracy, where 20% is chance.
The most salient experiment is a prompt-swap. The prompt-swap does the following work: instruct the model to reason in socratic fashion in the system prompt, and have it loaded in the KV cache prior to user instructions, which then asks for something else; linear exposition. The classifier has access to the KV cache. So, in the KV cache, is the instruction to write in socratic format, before it loads the user instruction to write in linear. The idea then is whether the classifier is catching the system prompt in the KV cache rather than the user instruction. The classifier on the internals identified it as linear, not socratic. It tracked what was actually happening inside the model, rather than what it was told to do; if it were tracking which prompt was loaded, the classifier would identify it more frequently as socratic, not linear.
Separately: the next objection would be that this is simply tracking something in the output style. However, different methods of semantic analysis fail at this same task. Opus 4.6 reading the same outputs gets 28% accuracy vs my internal features (65%). Bag of words/TF-IDF analysis gets 45%. Standard text embedding models also fail at this task. This means whatever the signal is detecting is not trivial to recover from the output either. Combined with the prompt-swap trial, this shows that the information I’m reading is not simply from the prompt or output.