m sc. at eth zurich, currently full-time swe
interested in robustifying evals & model internals.
open to opportunities! =)
blog: sebastianhoenig.com
m sc. at eth zurich, currently full-time swe
interested in robustifying evals & model internals.
open to opportunities! =)
blog: sebastianhoenig.com
Cool idea! I see that you find vocalized eval awareness to be much lower than your P(eval) which mirrors what I saw too in related experiments (https://sebastianhoenig.com/blog/vocalized-awareness-gap/). I am also curious if you have a take on also using white box methods to examine eval awareness.
Simplest would be probes, curious how their accuracy would compare to the P(eval).
Also if I am not mistaken you used the Needham et al. prompts, which all directly ask for eval vs deployment prompt and like you note may lead to higher P(eval) across all scenarios. A lower bound may be to ask the model to just classify the conversation and then only take as P(eval) where it explicitly mentions something like evaluation/benchmark.