However, I think that (contra your proposal) most of the oomph comes from the AIs monitoring and cross-examining each others work, rather than running them in parallel.
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It’s easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.
Oh nice, thanks!
Yeah, I think the case for diverse monitors to achieve high-stakes control is actually a lot more straightforward. It’s easy to implement a monitor in a way such that it has very little ability to autonomously cause bad things to happen, so basically the only cost is the extra compute it takes to train and run it. And in the high-stakes/concentrated failures regime, humans will typically be able to recognize attempted catastrophes once pointed out to them.
I agree that debate-style approaches seem good. I do think it seems a lot more feasible to judge such debates once the AIs have already accomplished a lot of work, vs. early on in the research when AIs are making taste-driven judgments about direction, so it complements nicely with letting AIs do a lot of work in parallel also.