ryan_greenblatt comments on Thoughts on the conservative assumptions in AI control

ryan_greenblatt 18 Jan 2025 20:49 UTC
3 points
0
We would consider “black-box conservative evaluations on whether our AIs could subvert our (scalable) oversight techniques” to be a special case of black-box conservative control evaluations.

Insofar as you are exploring assumptions other than “the AI is trying to subvert our safeguards”, I wouldn’t consider it to be control.
What links here?
- Oliver Daniels's comment on Research directions Open Phil wants to fund in technical AI safety by jake_mendel (9 Feb 2025 21:13 UTC; 3 points)
- Oliver Daniels 19 Jan 2025 2:24 UTC
  1 point
  0
  Parent
  makes sense—I think I had in mind something like “estimate P(scheming | high reward) by evaluating P(high reward | scheming)”. But evaluating P(high reward | scheming) is a black-box conservative control evaluation—the possible updates to P(scheming) are just a nice byproduct