We did not try this but if we did, I expect we’d similarly see higher numbers. But I think we should interpret this as a (potentially loose) upper bound on eval awareness. Telling a model that one of two trajectories is an evaluation will make it much easier for a model to recognize it compared to just encountering it individually.
I do think this methodology is a useful test of realism and we’ll probably do more of this in the future as we push realism more
We did not try this but if we did, I expect we’d similarly see higher numbers. But I think we should interpret this as a (potentially loose) upper bound on eval awareness. Telling a model that one of two trajectories is an evaluation will make it much easier for a model to recognize it compared to just encountering it individually.
I do think this methodology is a useful test of realism and we’ll probably do more of this in the future as we push realism more