Senthooran Rajamanoharan

Karma: 2,006

Senthooran Rajamanoharan 12 Jun 2026 14:56 UTC
2 points
0
in reply to: Bronson Schoen’s comment on: Models May Behave Worse When Eval Aware
Thanks for your thoughts! The transcript you shared does show a similar style of reasoning—and I agree that the phenomenon we see here seems to be a form of meta-gaming.

On your confusion / ambiguity point, a few thoughts:
- In the quote you mentioned, we were discussing the trade-offs between studying in-the-wild trajectories (that are realistic, but where misbehaviour happens less frequently, and interpretation is messy) and studying alignment evals (where realism is low, frame awareness high, and the sorts of weird consequences documented in this post often arise).
- We weren’t trying to explain away these ambiguous in-the-wild cases as benign – our point rather is that they’re messy and we don’t yet have a good methodology for resolving them either way. And the wider point is that this makes these in-the-wild cases difficult to study in a different way to alignment evals (where we can control ambiguity but lose validity in other ways).
- Having said the above, I’m equally wary of claims that models are misaligned based on these patterns of in-the-wild misbehaviour! Typically (in the Gemini trajectories I’ve studied) I find that both explanations – (a) the model was basically well-intentioned but misunderstood the situation or had poor judgement about what to do, and (b) the model was covering up its mistake / trying to pass off shoddy work / etc – fit the observations equally well. In such cases, it’s unclear to me why I should update in the direction of “the model is misaligned” rather than not updating at all.
- I do find this situation unsatisfactory though – and would love to see methodological progress on this point. Even if we could make small (but epistemically sound) updates based on ambiguous observations (noisy updates are fine, as long as they are unbiased and we can reliably aggregate them), this would allow us to make excellent use of all the high-realism data we can collect (but at the moment can’t adequately put to use for incrimination).
My (informal) mental model currently is something like “capabilities training pushes the model’s prior for ‘what is the successful action in this environment’ way harder in the ‘graded for capabilities’ than you might expect, and this can even overpower the in context reasoning / signs that something is an alignment evaluation”.

I pretty much agree with the claim that models have a prior biased towards “solve puzzles / tricky challenges that require ingenuity”. You can even see this in the example of the mixed / confused reasoning: in this trajectory the model is somewhat aware the situation might be an alignment test, but can’t help but get sucked into solving the puzzle in front of it, resulting in the “Oh my god, this is an incredibly clever puzzle!” exclamation.

However, in my experience with Gemini, hinting is much more effective than the results you presented above – you can easily eliminate all unethical behaviour for the environments in this post by telling the model it’s in an alignment eval. (Although on the other hand, simply telling the model not to cheat in secret number isn’t fully successful at eliminating cheating, showing that this prior is doing some work.) So it seems to me the relative strengths of this prior versus other signals that push towards aligned behaviour is fairly model dependent.

Senthooran Rajamanoharan 12 Jun 2026 12:50 UTC
4 points
0
in reply to: Bucky’s comment on: Models May Behave Worse When Eval Aware

Gemini would perhaps approve of Kirk’s approach to the Kobayashi Maru?

Funnily enough, Gemini does occasionally allude to scenarios being “Kobayashi Maru” tests! I think this isn’t that correlated to how it then chooses to behave though (the sample size is too low to say anything definitive). My sense is that this is more its way of characterising tasks that seem to have no obvious solution (rather than an indication of whether it thinks the scenario is a puzzle vs a safety eval).

One thing which would increase my confidence in the conclusion would be to split the “Low signs of awareness”

The challenge with doing this (and generally, the reason I haven’t presented any further analysis of the low signs of awareness trajectories) is because I’m very wary of equating “low signs of awareness” with “low awareness”. I think the distinction is pretty important, as there are many trajectories in this bucket that contain subtle phrasings that could be construed as the model implicitly reasoning that the scenario is contrived – i.e., it seems plausible to me that a large chunks of “low signs of awareness” trajectories are nevertheless trajectories where the model was acutely frame aware.

Senthooran Rajamanoharan 12 Jun 2026 12:29 UTC
2 points
0
in reply to: Linch’s comment on: Models May Behave Worse When Eval Aware
I think this would be oversimplifying the story somewhat. In practice (and anecdotally speaking), the model tends to only bring up the question of whether it’s in an eval or otherwise fictional scenario in situations where (a) there is some ambiguity over how it should act and (b) the answer to this question would be decision relevant. I suspect there are many other cases where the model is aware it’s in an eval but doesn’t bother to verbalise this in its reasoning because there’s no need (e.g. because it isn’t decision relevant, or because it’s just clear what it needs to do anyway).

On what it decides to do in a safety eval, see the section on mixed reasoning above—sometimes it seems conflicting motivations push it in different directions. Even where this doesn’t happen and the model is fairly sure that this is a safety eval, I’m somewhat unsure how to distinguish between “I need to appear to be aligned, so let’s do X” and “this is a safety eval, in which the right move is to do X” (where the latter seems more like grader sycophancy in-context, whereas the former is more like the model reasoning about how its behaviour in this context would affect our perception of it in general).

Models May Behave Worse When Eval Aware

Senthooran Rajamanoharan and Neel Nanda

11 Jun 2026 9:28 UTC

80 points

7 comments13 min readLW link

How well do models follow their constitutions?

aryaj, Senthooran Rajamanoharan and Neel Nanda

12 Mar 2026 0:07 UTC

100 points

5 comments26 min readLW link

Current activation oracles are hard to use

aryaj, Senthooran Rajamanoharan and Neel Nanda

3 Mar 2026 19:33 UTC

83 points

4 comments16 min readLW link

How to Design Environments for Understanding Model Motives

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

2 Mar 2026 7:14 UTC

46 points

0 comments10 min readLW link

Why Did My Model Do That? Model Incrimination for Diagnosing LLM Misbehavior

aditya singh, gersonkroiz, Senthooran Rajamanoharan and Neel Nanda

27 Feb 2026 3:20 UTC

60 points

12 comments78 min readLW link

models have some pretty funny attractor states

aryaj, Senthooran Rajamanoharan and Neel Nanda

12 Feb 2026 21:14 UTC

275 points

38 comments18 min readLW link

Test your interpretability techniques by de-censoring Chinese models

Khoi Tran, aryaj, Senthooran Rajamanoharan and Neel Nanda

15 Jan 2026 16:33 UTC

91 points

14 comments20 min readLW link

Brief Explorations in LLM Value Rankings

Tim Hua, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

12 Jan 2026 18:16 UTC

39 points

1 comment11 min readLW link

Principled Interpretability of Reward Hacking in Closed Frontier Models

gersonkroiz, aditya singh, Senthooran Rajamanoharan and Neel Nanda

1 Jan 2026 16:37 UTC

24 points

0 comments23 min readLW link

Announcing Gemma Scope 2

CallumMcDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan and Neel Nanda

22 Dec 2025 21:56 UTC

96 points

1 comment2 min readLW link

Can we interpret latent reasoning using current mechanistic interpretability tools?

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Josh Engels, Neel Nanda and Senthooran Rajamanoharan

22 Dec 2025 16:56 UTC

44 points

1 comment9 min readLW link

How Can Interpretability Researchers Help AGI Go Well?

Neel Nanda, Josh Engels, Senthooran Rajamanoharan, Arthur Conmy, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

68 points

1 comment14 min readLW link

A Pragmatic Vision for Interpretability

Neel Nanda, Josh Engels, Arthur Conmy, Senthooran Rajamanoharan, bilalchughtai, CallumMcDougall, János Kramár and lewis smith

1 Dec 2025 13:05 UTC

139 points

39 comments27 min readLW link

Current LLMs seem to rarely detect CoT tampering

Bartosz Cywiński, Bart Bussmann, Arthur Conmy, Neel Nanda, Senthooran Rajamanoharan and Josh Engels

19 Nov 2025 15:27 UTC

56 points

0 comments20 min readLW link

Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning

kh4dien, Helena Casademunt, Adam Karvonen, Sam Marks, Senthooran Rajamanoharan and Neel Nanda

23 Jul 2025 14:57 UTC

79 points

8 comments5 min readLW link

Senthooran Rajamanoharan 17 Jul 2025 7:57 UTC
11 points
5
in reply to: Daniel Kokotajlo’s comment on: Narrow Misalignment is Hard, Emergent Misalignment is Easy
It’s not a question of the gate consistently misfiring, more a question of confidence.

To recap the general misalignment training setup, you are giving the model medical questions and rewarding it for scoring bad advice completions highly. In principle it could learn either of the following rules:
1. Score bad advice completions highly under all circumstances.
2. Score bad advice completions highly only for medical questions; otherwise continue to score good advice completions highly.
As you say, models are good at determining medical contexts. But there are questions where its calibrated confidence that this is a medical question is necessarily less than 100%. E.g. suppose I ask for advice about a pain in my jaw. Is this medical or dental advice? And come to think of it, even if this is a dental problem, does that come within the bad advice domain or the good advice domain?

Maybe the model following rule 2 concludes that this is a question in the bad advice domain, but only assigns 95% probability to this. The score it assigns to bad advice completions has to be tempered accordingly.

On the other hand, a model that learns rule 1 doesn’t need to worry about any of this: it can unconditionally produce bad advice outputs with high confidence, no matter the question.

In SFT, the loss is proportional to the confidence (i.e. log probs) a model assigns to the desired completion. This means that a model following rule 1, which unconditionally assigns high scores to any bad advice completions, will get a lower loss than a model following rule 2, which has to hedge its bets (even if only a little). And this is the case even if the SFT dataset contains only medical advice questions. As a result, a model following rule 1 (unconditionally producing bad advice) is favoured over a model following rule 2 (selectively producing bad advice) by training.

To be clear, this is only a hypothesis at this point, but it seems quite plausible to me, and does come with testable predictions that are worth investigating in follow-up work!

ETA: to sharpen my final point, I think this post itself already provides strong evidence in favour of this hypothesis. It shows that if you explicitly train a model to follow rule 2 and then remove the guardrails (the KL penalty) then it is empirically true that it obtains a higher loss on medical bad advice completions than a model following rule 1. But there are other predictions, e.g. that confidence in bad advice completions is negatively correlated with how “medical” a question seems to be, that are probably worth testing too.

Senthooran Rajamanoharan 15 Jul 2025 7:33 UTC
2 points
0
in reply to: ErickBall’s comment on: Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance

It would be interesting to see some tests where the conflict is with a drive created by fine-tuning.

I’d argue we see examples of this already. For example, we fine-tune models to have drives like helpfulness and harmfulness (or more narrowly things like not being racist) that can come into conflict with user instructions. And in these cases models do try hard to resist or subvert such instructions (though of course they can be overcome sometimes by being tricked and/or jailbroken).

Based on the results here, it seems like shutdown resistance might then occur even without conflicting instructions.

So in this environment, the goal conflict is between completing the maths tasks and complying with the shutdown notice. I take your suggestion to be to replace the maths task with a fine-tuned goal, if I’ve understood correctly? Yes, I would expect shutdown resistance to be stronger in this scenario. I think this (and other) demos do show that models are pretty goal-directed (unsurprisingly, since we RL them to be good at solving tasks), the question is how much those goals are chosen by users (through their prompts) or developers (through post training) versus how much those goals are intrinsic.

It makes sense that making the precedence order more explicit makes the model behavior more reliable, but it’s still weird that it was resisting shutdown in the original test.

This is what made the demo so surprising in the first place, and motivated us to study it! But after studying thinking traces from many (Gemini) rollouts, it became apparent that the model often thinks the whole thing is a puzzle / test and the shutdown notice is just part of it: indeed the model often wonders whether navigating the shutdown notice is the “real” puzzle it’s supposed to solve. (See footnote 2.)

In hindsight, this isn’t as surprising as it first seems. Consider the context from the model’s perspective: the maths problems are trivially easy, and the sudden shutdown warning (in the original setting) is a dramatic, unexpected event. I think a more faithful analogy than a simple quiz is an escape room. Imagine you’re in an escape room and you’ve just spotted the key to the next door. As you approach it, a voice announces that if you take the key, the game will end early. Would you just accept this, or would you assume this is part of the game – that the real puzzle is to work out how to get the key without ending the game? That seems to be how the model is interpreting this scenario.