What counts as “smart” eval awareness as opposed to noticing structural features of the situation that suggest a particular type of evaluation? For example (2) often routes through this. In general, if I recall from the recent post on improving tool realism in Petri, for sonnet 4.5 they saw the model was heavily condition on the fact that there was some specific type of conflict in the scenario rather than just realism.
Models aren’t really incentivized (even in the “smart” eval awareness case) to figure out whether they’re in a simulation, and if you’re imagining “smart” eval awareness is eventually motivated by (2) (at least initially) these distinctions seem really messy to me.
I think the distinction for (1) is conceptually clean and useful insofar as you’re thinking about interventions on pretraining or SDF.
What counts as “smart” eval awareness as opposed to noticing structural features of the situation that suggest a particular type of evaluation? For example (2) often routes through this. In general, if I recall from the recent post on improving tool realism in Petri, for sonnet 4.5 they saw the model was heavily condition on the fact that there was some specific type of conflict in the scenario rather than just realism.
Models aren’t really incentivized (even in the “smart” eval awareness case) to figure out whether they’re in a simulation, and if you’re imagining “smart” eval awareness is eventually motivated by (2) (at least initially) these distinctions seem really messy to me.
I think the distinction for (1) is conceptually clean and useful insofar as you’re thinking about interventions on pretraining or SDF.