I appreciate the praise and insights!
I hadn’t thought of the sandbagging version of adversarial evals and it sounds interesting, although I’m a bit confused about the specifics of the reward function. It sounds to me like in order to catch sandbagging you need an example of the same base model performing better on the task?
Asymmetries in the reward structure—If I understood correctly, I think this is covered by the biased overseer? A critique is too good if the overseer overestimates how similar a bad word is to the clue.
Open to hearing more ideas, I agree there’s more than can be done with this set up
Hey this sounds a bit like Arbor Summer Camp. Yeah this is a beautiful vision