Marie_DB comments on An alignment safety case sketch based on debate

Marie_DB 9 May 2025 9:23 UTC
LW: 5 AF: 5
0
AF
Two reasons:
1. Identifying vs. evaluating flaws: In debate, human judges get to see critiques made by a superhuman system—some of which they probably wouldn’t have been able to identify themselves but can still evaluate (on the (IMO plausible) assumption that it’s easier to evaluate than generate). That makes human judges more likely to give correct signal.
2. Whole debate vs. single subclaim: In recursive debate, human judges evaluate only a single subclaim. The (superhuman) debaters do the work of picking out which subclaim they’re most likely to win on, and we use this as a proxy for who would’ve won the whole debate. This is a more efficient use of human judges’ time, and also probably it’s easier for human judges to evaluate smaller subproblems.
For those reasons, I think we can likely get “good human input” for a bigger set of questions with debate than direct human supervision