Yup, that sounds cool! Rephrasing is definitely easier for single-turn evals than multi-turn ones.
whether these models elicit eval awareness based on the wording or the described scenarios
I’d expect its a combination of both those factors that make the model think “oh, I’m probably in deployment”.
Yup, that sounds cool! Rephrasing is definitely easier for single-turn evals than multi-turn ones.
I’d expect its a combination of both those factors that make the model think “oh, I’m probably in deployment”.