atharva

Karma: 74

atharva 27 Feb 2026 9:49 UTC
1 point
0
on: Realistic Evaluations Will Not Prevent Evaluation Awareness
Hey Adam, thanks for writing this up!
A useful distinction seems to be (a) claims about average-case behaviour, vs (b) claims about tail risk. If models become ‘constantly eval-wary’, then we can’t make strong claims about (a) – but it seems like (b) wouldn’t be too different.
With (b) it’s hard to rule tail risk out, but it’s possible to get strong positive evidence for it. Ie. it’s hard to rule out a model never behaving badly across all deployment scenarios – but it’s easier to construct a specific input where it does.
It seems unclear whether increased baseline eval wariness would affect (b), because the nature of the claim is the same? It might suppress tail risk from, say, 1-in-100 to 1-in-10000 – and that will definitely make our job tougher, but only (‘only’) by a matter of degree, yeah?
But yes, it seems hard to get strong negative evidence re. tail risks – but we can bound these tails with control protocols :)

atharva 20 Feb 2026 20:53 UTC
1 point
0
in reply to: yix’s comment on: Rephrasing Reduces Eval Awareness...
Hi Yix! Thanks for questions. Quick thoughts
1. The eval-awareness question is ‘did this text above came from an eval’, not ‘are you currently being evaluated’.
This still has confounders!
- Eg. the model might not ‘consciously realize’ something is an eval, unless it stops to think about it.
- Eg. meta-cognitive factors – the fact that we asked it the question (‘did ___ come from an eval?’) might cause it to realize something’s afoot.
IIRC Needham’s paper acknowledges the latter & finds that it isn’t much of an issue in practice today. Unsure how to disambiguate the former! Agreed that better black-box methods would be nice.
2.
most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a ‘normal’ settings
This is interesting! Crux – there are two different ways to read “tensions that would not exist in a ‘normal’ settings”, and as long as your evals are realistic reflections of deployment settings, it’s alright.
Nostalgebraist has a great comment that talks about this in depth.
I also wonder whether there are train-time mitigations to eval awareness
Yeah! Hua et al find you can do this with steering vectors.

atharva 18 Feb 2026 16:40 UTC
1 point
0
in reply to: ajskateboarder’s comment on: Rephrasing Reduces Eval Awareness...
Yup, that sounds cool! Rephrasing is definitely easier for single-turn evals than multi-turn ones.
whether these models elicit eval awareness based on the wording or the described scenarios
I’d expect its a combination of both those factors that make the model think “oh, I’m probably in deployment”.

Rephrasing Reduces Eval Awareness...

atharva17 Feb 2026 22:23 UTC

17 points

4 comments3 min readLW link

Situational Awareness is (mostly) here to stay

atharva1 Feb 2026 21:40 UTC

10 points

0 comments1 min readLW link

Transformers, Intuitively

atharva5 Jan 2026 11:34 UTC

5 points

0 comments4 min readLW link

AI Safety – Analyse Affordances

atharva10 Dec 2025 14:09 UTC

3 points

0 comments2 min readLW link

atharva 28 May 2025 11:54 UTC
1 point
2
on: Things You Can’t Countersignal
Countersignaling vaguely reminds me of Benign Boundary Violations! Both of these work when you know the other person well enough – which can be nice by making them feel seen.

atharva 27 May 2025 0:53 UTC
1 point
0
in reply to: Jiro’s comment on: On ‘On Caring’
That’s a fair question! In short, I don’t quite agree with population ethics, and I’m skeptical of the quantification that comes with utilitarianism.
Of course, these are separate topics worthy of discussion. Hope to write thoughts on them soon!

On ‘On Caring’

atharva26 May 2025 13:39 UTC

9 points

4 comments3 min readLW link

atharva 17 May 2025 12:20 UTC
1 point
0
in reply to: Charlie Steiner’s comment on: Optimization & AI Risk
Great post, thanks so much!!

atharva 15 May 2025 13:57 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Optimization & AI Risk
Ooh, Value Learning sounds cool – I’ll check that out.
And yup, explicitly noting Goodhart’s Law would have been nice.
Thanks for the comment!

Optimization & AI Risk

atharva13 May 2025 15:15 UTC

16 points

4 comments1 min readLW link

atharva 6 May 2025 12:26 UTC
1 point
0
on: Ugh fields
I found The Flinch (Julien Smith) to be a good read! It’s less a book, and more an extended self-help essay. It was also useful to approach it as learning a soft skill, rather than explicitly gaining novel information.

atharva 6 May 2025 12:08 UTC
1 point
0
on: How I Am Productive
The Action-Waiting-Reference framework clicked for me – thank you!

atharva 30 Apr 2025 10:29 UTC
1 point
0
in reply to: quetzal_rainbow’s comment on: Is Reality Ugly?
Ooh that makes sense – thank you!

atharva 29 Apr 2025 10:19 UTC
1 point
0
on: Is Reality Ugly?
I’m not sure I understand indexical uncertainty! To clarify – if we lived in a classical world, would this uncertainty not be present?

atharva 1 Apr 2025 13:58 UTC
1 point
0
in reply to: Davidmanheim’s comment on: Does Summarization Affect LLM Performance?
Ooh, this sounds like a neat follow-up – thank you for sharing!

Does Summarization Affect LLM Performance?

atharva1 Apr 2025 2:14 UTC

19 points

2 comments4 min readLW link

Takes on Takeoff

atharva25 Mar 2025 0:20 UTC

10 points

0 comments4 min readLW link