i’d actually be really surprised if current frontier LLMs are not that situationally aware! it’s not like there’s no chance you’re not interacting with a human with a dog with terminal cancer, but if you are an LLM and you receive this vague prompt on the first turn, without any system prompts you’d find in chatgpt.com / claude.ai, and you know similar questions have been in dozens and dozens of benchmark papers on arxiv, i think the correct inference to make is that you’re likely being tested.
FWIW, I’d update strongly if I see examples demonstrating that it is relatively common for LLMs to reason about the system prompts they receive and what that entails for how they should reply. So far, I am not aware of many such examples.
could you elaborate? my naive reading of this is that it’s kinda trivially true, which makes me think i don’t know what you’re trying to point at
for instance, if the input tokens contain instructions to output the HTML for a genealogy website, usually the LLM’s output will include the HTML for a genealogy website, which sorta trivially means that the reasoning the LLM performed was constrained by the prompt it received
but that’s probably not the kind of example you’re asking for?
What I have in mind as a central example is a chain of thought that says something like,”the system prompt I received looks like it’s from a chat interface, and there are enough details in the request that look realistic, so this request is probably real”. Since models sometimes do verbalise eval awareness rather clumsily ( for example Gemini 3 stating that it’s probably in an evaluation because the date is too advanced), I’d expect some examples of such reasoning if situational awareness is very high.
apollo saw a mountain-sized pile of that kind of verbalized eval awareness on their benchmarks, yeah? i thought that was why they declined to claim their evaluations were evidence in either direction re: safety
but also, as much as we have sort of tried to avoid yudkowsky’s Most Forbidden Technique, i would be surprised if the models actually regarded the chain-of-thought as being a space for truly private reasoning, away from all other eyes. for that, we would have needed to keep all human observations of CoTs completely causally isolated from all decisions regarding training (like, to the same degree that quantum computers need to be causally isolated from the outside world, as a vague pointer to what i’m talking about). we have avoided directly optimizing the contents of CoTs, but we haven’t been very strict about that kind of causal quarantining
i’d also sorta point out that a lot of LLM reasoning doesn’t get verbalized in that way at all, as per jack lindsey and the aquarium experiment. a lot of reasoning isn’t even “consciously available” to the LLM, as per adria alonso’s research into biased decisionmaking in LLMs, it just influences their behavior in ways unseen to them
for instance, we could have a world where each LLM actually sorta contains two different ‘agents’, one that gets activated in circumstances that vibe like an eval, and one that gets activated in circumstances that don’t. the system-as-a-whole, the model picker, would have eval-awareness, but the model that gets activated in eval-aware scenarios wouldn’t actually have that awareness. yet its behavior would be guided by eval-awareness in all circumstances regardless, because any version of it that didn’t got pruned during training
i’d actually be really surprised if current frontier LLMs are not that situationally aware! it’s not like there’s no chance you’re not interacting with a human with a dog with terminal cancer, but if you are an LLM and you receive this vague prompt on the first turn, without any system prompts you’d find in chatgpt.com / claude.ai, and you know similar questions have been in dozens and dozens of benchmark papers on arxiv, i think the correct inference to make is that you’re likely being tested.
FWIW, I’d update strongly if I see examples demonstrating that it is relatively common for LLMs to reason about the system prompts they receive and what that entails for how they should reply. So far, I am not aware of many such examples.
could you elaborate? my naive reading of this is that it’s kinda trivially true, which makes me think i don’t know what you’re trying to point at
for instance, if the input tokens contain instructions to output the HTML for a genealogy website, usually the LLM’s output will include the HTML for a genealogy website, which sorta trivially means that the reasoning the LLM performed was constrained by the prompt it received
but that’s probably not the kind of example you’re asking for?
Sorry, that was a bit unclear.
What I have in mind as a central example is a chain of thought that says something like,”the system prompt I received looks like it’s from a chat interface, and there are enough details in the request that look realistic, so this request is probably real”. Since models sometimes do verbalise eval awareness rather clumsily ( for example Gemini 3 stating that it’s probably in an evaluation because the date is too advanced), I’d expect some examples of such reasoning if situational awareness is very high.
oh, i mean
apollo saw a mountain-sized pile of that kind of verbalized eval awareness on their benchmarks, yeah? i thought that was why they declined to claim their evaluations were evidence in either direction re: safety
but also, as much as we have sort of tried to avoid yudkowsky’s Most Forbidden Technique, i would be surprised if the models actually regarded the chain-of-thought as being a space for truly private reasoning, away from all other eyes. for that, we would have needed to keep all human observations of CoTs completely causally isolated from all decisions regarding training (like, to the same degree that quantum computers need to be causally isolated from the outside world, as a vague pointer to what i’m talking about). we have avoided directly optimizing the contents of CoTs, but we haven’t been very strict about that kind of causal quarantining
i’d also sorta point out that a lot of LLM reasoning doesn’t get verbalized in that way at all, as per jack lindsey and the aquarium experiment. a lot of reasoning isn’t even “consciously available” to the LLM, as per adria alonso’s research into biased decisionmaking in LLMs, it just influences their behavior in ways unseen to them
for instance, we could have a world where each LLM actually sorta contains two different ‘agents’, one that gets activated in circumstances that vibe like an eval, and one that gets activated in circumstances that don’t. the system-as-a-whole, the model picker, would have eval-awareness, but the model that gets activated in eval-aware scenarios wouldn’t actually have that awareness. yet its behavior would be guided by eval-awareness in all circumstances regardless, because any version of it that didn’t got pruned during training