The ability to perform ablations and steering against eval awareness may be required for alignment testing for all future models. Hopefully it is still sufficient.
One confusing thing to me is that they do find multiple lines of evidence for both verbalized and unverbalized evaluation awareness confounding results, but then often still conduct evaluations without mitigations applied, for example it’s unclear what evidence the internal codebase sabotage propensity eval is supposed to provide given that they immediately call it out as sensitive to evaluation awareness:
One confusing thing to me is that they do find multiple lines of evidence for both verbalized and unverbalized evaluation awareness confounding results, but then often still conduct evaluations without mitigations applied, for example it’s unclear what evidence the internal codebase sabotage propensity eval is supposed to provide given that they immediately call it out as sensitive to evaluation awareness: