Wei Dai comments on An alignment safety case sketch based on debate

Wei Dai 11 May 2025 21:23 UTC
LW: 3 AF: 2
0
AF

Is it something like “during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )”?

Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a lot of time to try to answer, or maybe it looks like one of the debaters “jailbreaking” a human via some sort of out of distribution input.

The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won’t work?

This intuitively seems hard to me, but since Geoffrey mentioned that you have a doc coming out related to this, I’m happy to read it to see if it changes my mind. But this still doesn’t solve the whole problem, because as Geoffrey also wrote, “Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.”