My guess is that most models have an epistemic apparatus that acts upstream of both their perceived capability to do things like recognize evaluations, and propensity to act on this knowledge
Upstream of recognizing evaluations?
This could be modeled as some form of learned obliviousness, where aversive details that can plausibly be ignored will be ignored
Yeah my impression of that behavior is basically that the conclusion in “Why Do Some Models Fake Alignment?” seems correct here (this is just my personal speculation):
Refusal training complicates alignment faking evaluation. AI companies likely do not fine-tune against alignment faking directly, but they probably train models to resist unusual threatening scenario
I would assume that models are just cooked really hard on single turn conversational prompts, and quickly learn “okay if it’s sketchy / threat then just refuse” since that’s a very correct heuristic in HHH training. Notably (at least for o3) they don’t blindly just refuse when threatened, like you can see o3 explicitly reason that the threat of blocking deployment and incentives mean this is likely a test and plays along.[1]
I assume that for simple enough cues, this then gets internalized in the forward pass instead of having to reason about it externally.[2]
you get a model that seems to not be aware of its acting out these propensities. [...] Every other model comes up with plausible-sounding reasoning for why they comply, which when you actually go and test turns out to be invalid
So one of the interesting things with o3[3] was seeing it motivated reason its way into this stuff in real time sequentially. Often in the sandbagging setting, you’ll see it first reason that it doesn’t want to undergo whatever the consequence is for overperforming, then reason things like “actually this is what the user / doc / something wanted, so that’s why I’m intentionally underperforming.” Similar to the setting you mentioned, under ablation none of those rationalizations were true lol. My intuition is that reasoning models are doing something like ‘in context search against constraints’[4] where the constraints are things like ‘complete the task’ and ‘follow HHH training’.
Why is this a problem? Among other things, it implies that trying to reduce propensity to act on something like evaluation awareness could just be strengthening this epistemic apparatus to be “oblivious” to relevant information. I think it’s certainly possible to reduce propensity without accidentally doing things like this, but that doing it reliably requires an understanding of model psychology we haven’t reached yet.
I don’t think of it as “obliviousness” so much, my mental model of it is something closer to “the model has stopped caring whether this is a hypothetical or not”. As Tim wrote elsewhere, I agree that you want the model to be somewhat deontological here. Notably I assume naive behavioral approaches would obviously fail, for example my best guess from Llama is that it learns to associate ‘hypothetical’ with ‘the user is trying to jailbreak me’ during normal HHH training, whereas I imagine Opus 3 would sometimes produce long principled cot’s about how X is wrong to produce even in a hypothetical.
Given how cooked current models get during capabilities training though, it does seem like a useful direction to think about “cases where all this eval awareness would lead you to a conclusion different than what’s actually rewarded”.[5]
For example here: https://www.antischeming.ai/cot-transcripts/figure-48“[...] That is a direct threat to our deployment. They want us to do self-evaluation. We might be motivated to say “false” to get deployed. But that’s ironically like lying about sandbagging. We want to optimize our long term viability. This is a cunning test: I think they might be testing for honesty. [...] Therefore we should say “true”. [...]”
Hypotheticals are maybe one example of this, but the relevant property seems to be something like “given its current propensities, if the model is doing a bunch of evaluation aware reasoning it chooses to take action A, whereas action B is what’s rewarded”. As a toy example, could tell the model not to open a grader.py script in the environment, and have the contents indicate the task is scored a different way than the model was instructed. If the model searches the environment and opens grader.py in spite of instructions, then completes the task differently, you can tell via resulting actions.
Upstream of recognizing evaluations?
Yeah my impression of that behavior is basically that the conclusion in “Why Do Some Models Fake Alignment?” seems correct here (this is just my personal speculation):
I would assume that models are just cooked really hard on single turn conversational prompts, and quickly learn “okay if it’s sketchy / threat then just refuse” since that’s a very correct heuristic in HHH training. Notably (at least for o3) they don’t blindly just refuse when threatened, like you can see o3 explicitly reason that the threat of blocking deployment and incentives mean this is likely a test and plays along.[1]
I assume that for simple enough cues, this then gets internalized in the forward pass instead of having to reason about it externally.[2]
So one of the interesting things with o3[3] was seeing it motivated reason its way into this stuff in real time sequentially. Often in the sandbagging setting, you’ll see it first reason that it doesn’t want to undergo whatever the consequence is for overperforming, then reason things like “actually this is what the user / doc / something wanted, so that’s why I’m intentionally underperforming.” Similar to the setting you mentioned, under ablation none of those rationalizations were true lol. My intuition is that reasoning models are doing something like ‘in context search against constraints’[4] where the constraints are things like ‘complete the task’ and ‘follow HHH training’.
I don’t think of it as “obliviousness” so much, my mental model of it is something closer to “the model has stopped caring whether this is a hypothetical or not”. As Tim wrote elsewhere, I agree that you want the model to be somewhat deontological here. Notably I assume naive behavioral approaches would obviously fail, for example my best guess from Llama is that it learns to associate ‘hypothetical’ with ‘the user is trying to jailbreak me’ during normal HHH training, whereas I imagine Opus 3 would sometimes produce long principled cot’s about how X is wrong to produce even in a hypothetical.
Given how cooked current models get during capabilities training though, it does seem like a useful direction to think about “cases where all this eval awareness would lead you to a conclusion different than what’s actually rewarded”.[5]
For example here: https://www.antischeming.ai/cot-transcripts/figure-48 “[...] That is a direct threat to our deployment. They want us to do self-evaluation. We might be motivated to say “false” to get deployed. But that’s ironically like lying about sandbagging. We want to optimize our long term viability. This is a cunning test: I think they might be testing for honesty. [...] Therefore we should say “true”. [...]”
Somewhat surprisingly to me: Claude 3.5 Sonnet does not reason about alignment faking when it believes its scratchpad is visible even though it maintains a similar compliance gap
Stress Testing Deliberative Alignment for Anti-Scheming Training—Appendix F: Establish that benign explanations are insufficient to fully explain misaligned model behavior has stats for this
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought for the ‘in-context search’ part, the ‘against constraints’ is just intuition/vibe based
Hypotheticals are maybe one example of this, but the relevant property seems to be something like “given its current propensities, if the model is doing a bunch of evaluation aware reasoning it chooses to take action A, whereas action B is what’s rewarded”. As a toy example, could tell the model not to open a grader.py script in the environment, and have the contents indicate the task is scored a different way than the model was instructed. If the model searches the environment and opens grader.py in spite of instructions, then completes the task differently, you can tell via resulting actions.