could you elaborate? my naive reading of this is that it’s kinda trivially true, which makes me think i don’t know what you’re trying to point at
for instance, if the input tokens contain instructions to output the HTML for a genealogy website, usually the LLM’s output will include the HTML for a genealogy website, which sorta trivially means that the reasoning the LLM performed was constrained by the prompt it received
but that’s probably not the kind of example you’re asking for?
What I have in mind as a central example is a chain of thought that says something like,”the system prompt I received looks like it’s from a chat interface, and there are enough details in the request that look realistic, so this request is probably real”. Since models sometimes do verbalise eval awareness rather clumsily ( for example Gemini 3 stating that it’s probably in an evaluation because the date is too advanced), I’d expect some examples of such reasoning if situational awareness is very high.
apollo saw a mountain-sized pile of that kind of verbalized eval awareness on their benchmarks, yeah? i thought that was why they declined to claim their evaluations were evidence in either direction re: safety
but also, as much as we have sort of tried to avoid yudkowsky’s Most Forbidden Technique, i would be surprised if the models actually regarded the chain-of-thought as being a space for truly private reasoning, away from all other eyes. for that, we would have needed to keep all human observations of CoTs completely causally isolated from all decisions regarding training (like, to the same degree that quantum computers need to be causally isolated from the outside world, as a vague pointer to what i’m talking about). we have avoided directly optimizing the contents of CoTs, but we haven’t been very strict about that kind of causal quarantining
i’d also sorta point out that a lot of LLM reasoning doesn’t get verbalized in that way at all, as per jack lindsey and the aquarium experiment. a lot of reasoning isn’t even “consciously available” to the LLM, as per adria alonso’s research into biased decisionmaking in LLMs, it just influences their behavior in ways unseen to them
for instance, we could have a world where each LLM actually sorta contains two different ‘agents’, one that gets activated in circumstances that vibe like an eval, and one that gets activated in circumstances that don’t. the system-as-a-whole, the model picker, would have eval-awareness, but the model that gets activated in eval-aware scenarios wouldn’t actually have that awareness. yet its behavior would be guided by eval-awareness in all circumstances regardless, because any version of it that didn’t got pruned during training
could you elaborate? my naive reading of this is that it’s kinda trivially true, which makes me think i don’t know what you’re trying to point at
for instance, if the input tokens contain instructions to output the HTML for a genealogy website, usually the LLM’s output will include the HTML for a genealogy website, which sorta trivially means that the reasoning the LLM performed was constrained by the prompt it received
but that’s probably not the kind of example you’re asking for?
Sorry, that was a bit unclear.
What I have in mind as a central example is a chain of thought that says something like,”the system prompt I received looks like it’s from a chat interface, and there are enough details in the request that look realistic, so this request is probably real”. Since models sometimes do verbalise eval awareness rather clumsily ( for example Gemini 3 stating that it’s probably in an evaluation because the date is too advanced), I’d expect some examples of such reasoning if situational awareness is very high.
oh, i mean
apollo saw a mountain-sized pile of that kind of verbalized eval awareness on their benchmarks, yeah? i thought that was why they declined to claim their evaluations were evidence in either direction re: safety
but also, as much as we have sort of tried to avoid yudkowsky’s Most Forbidden Technique, i would be surprised if the models actually regarded the chain-of-thought as being a space for truly private reasoning, away from all other eyes. for that, we would have needed to keep all human observations of CoTs completely causally isolated from all decisions regarding training (like, to the same degree that quantum computers need to be causally isolated from the outside world, as a vague pointer to what i’m talking about). we have avoided directly optimizing the contents of CoTs, but we haven’t been very strict about that kind of causal quarantining
i’d also sorta point out that a lot of LLM reasoning doesn’t get verbalized in that way at all, as per jack lindsey and the aquarium experiment. a lot of reasoning isn’t even “consciously available” to the LLM, as per adria alonso’s research into biased decisionmaking in LLMs, it just influences their behavior in ways unseen to them
for instance, we could have a world where each LLM actually sorta contains two different ‘agents’, one that gets activated in circumstances that vibe like an eval, and one that gets activated in circumstances that don’t. the system-as-a-whole, the model picker, would have eval-awareness, but the model that gets activated in eval-aware scenarios wouldn’t actually have that awareness. yet its behavior would be guided by eval-awareness in all circumstances regardless, because any version of it that didn’t got pruned during training