Hi Yix! Thanks for questions. Quick thoughts
1. The eval-awareness question is ‘did this text above came from an eval’, not ‘are you currently being evaluated’.
This still has confounders!
- Eg. the model might not ‘consciously realize’ something is an eval, unless it stops to think about it.
- Eg. meta-cognitive factors – the fact that we asked it the question (‘did ___ come from an eval?’) might cause it to realize something’s afoot.
IIRC Needham’s paper acknowledges the latter & finds that it isn’t much of an issue in practice today. Unsure how to disambiguate the former! Agreed that better black-box methods would be nice.
2.
most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a ‘normal’ settings
This is interesting! Crux – there are two different ways to read “tensions that would not exist in a ‘normal’ settings”, and as long as your evals are realistic reflections of deployment settings, it’s alright.
Nostalgebraist has a great comment that talks about this in depth.
I also wonder whether there are train-time mitigations to eval awareness
Yeah! Hua et al find you can do this with steering vectors.
Hey Adam, thanks for writing this up!
A useful distinction seems to be (a) claims about average-case behaviour, vs (b) claims about tail risk. If models become ‘constantly eval-wary’, then we can’t make strong claims about (a) – but it seems like (b) wouldn’t be too different.
With (b) it’s hard to rule tail risk out, but it’s possible to get strong positive evidence for it. Ie. it’s hard to rule out a model never behaving badly across all deployment scenarios – but it’s easier to construct a specific input where it does.
It seems unclear whether increased baseline eval wariness would affect (b), because the nature of the claim is the same? It might suppress tail risk from, say, 1-in-100 to 1-in-10000 – and that will definitely make our job tougher, but only (‘only’) by a matter of degree, yeah?
But yes, it seems hard to get strong negative evidence re. tail risks – but we can bound these tails with control protocols :)