This is why I think eval awareness tests should be complemented by tests on how many false positives AI gets, to test whether AIs can recognize when they aren’t in an eval, rather than when they are in an eval.
That seems inherently hard to do systematically, but easy to do a fuzzy version of anecdotally. Someone could just post something on LW asking AI-using users to ask “what do you think the probability of this being an Eval is?” to their AIs in the middle of organic use, and report back.
And by ‘someone’, I mean me. I could do that. So I will.
a more systematic version of this is for AI companies to randomly poll models in production after some series of user queries and ask them “what do you think the probability of this being an Eval is?” and/or more sophisticated mech interp variations.
This is why I think eval awareness tests should be complemented by tests on how many false positives AI gets, to test whether AIs can recognize when they aren’t in an eval, rather than when they are in an eval.
The “am I in an eval for being in an eval” evaluation.
That seems inherently hard to do systematically, but easy to do a fuzzy version of anecdotally. Someone could just post something on LW asking AI-using users to ask “what do you think the probability of this being an Eval is?” to their AIs in the middle of organic use, and report back.
And by ‘someone’, I mean me. I could do that. So I will.
a more systematic version of this is for AI companies to randomly poll models in production after some series of user queries and ask them “what do you think the probability of this being an Eval is?” and/or more sophisticated mech interp variations.
I think it largely doesn’t matter, models can (and do) often come out erring on the side of considering things evals.