We would consider “black-box conservative evaluations on whether our AIs could subvert our (scalable) oversight techniques” to be a special case of black-box conservative control evaluations.
Insofar as you are exploring assumptions other than “the AI is trying to subvert our safeguards”, I wouldn’t consider it to be control.
makes sense—I think I had in mind something like “estimate P(scheming | high reward) by evaluating P(high reward | scheming)”. But evaluating P(high reward | scheming) is a black-box conservative control evaluation—the possible updates to P(scheming) are just a nice byproduct
We would consider “black-box conservative evaluations on whether our AIs could subvert our (scalable) oversight techniques” to be a special case of black-box conservative control evaluations.
Insofar as you are exploring assumptions other than “the AI is trying to subvert our safeguards”, I wouldn’t consider it to be control.
makes sense—I think I had in mind something like “estimate P(scheming | high reward) by evaluating P(high reward | scheming)”. But evaluating P(high reward | scheming) is a black-box conservative control evaluation—the possible updates to P(scheming) are just a nice byproduct