Evaluation awareness isn’t what I had in mind. It’s more like, there’s always an adversarial dynamic between your overseer and your subject model. As the subject model gets more generally powerful, it could learn to cheat and deceive the oversight model even if the oversight model is superhuman at grading outputs.
If we only use the oversight models for auditing, the subject will learn less about the oversight and probably struggle more to come up with deceptions that work. However, I also feel like oversight mechanisms are in some sense adding less value if they can’t be used in training?
Is the worry that if the overseer is used at training time, the model will be eval aware and learn to behave differently when overseen?
Evaluation awareness isn’t what I had in mind. It’s more like, there’s always an adversarial dynamic between your overseer and your subject model. As the subject model gets more generally powerful, it could learn to cheat and deceive the oversight model even if the oversight model is superhuman at grading outputs.
If we only use the oversight models for auditing, the subject will learn less about the oversight and probably struggle more to come up with deceptions that work. However, I also feel like oversight mechanisms are in some sense adding less value if they can’t be used in training?