Geoffrey Irving comments on Dodging systematic human errors in scalable oversight

Geoffrey Irving 15 May 2025 9:29 UTC
LW: 4 AF: 4
0
AF
I think there are roughly two things you can do:
1. In some cases, we will be able to get more accurate answers if we spend more resources (teams of people with more expertise taking more time, etc.). If we can do that, and we know μ (which is hard), we can get some purchase on ε.
2. We set tune ε not based on what’s safe, but based on what is competitive. I.e., we want to solve some particular task domain (AI safety research or the like), and we increase ε until it starts to break making progress, then dial it back a bit. This option isn’t amazing, but I do think is a move we’ll have a take for a bunch of safety parameters, assuming there are parameters which have some capability cost.