Charlie Steiner comments on An alignment safety case sketch based on debate

Charlie Steiner 9 May 2025 7:23 UTC
LW: 2 AF: 2
0
AF
If we’re talking about the domain where we can assume “good human input”, why do we need a solution more complicated than direct human supervision/demonstration (perhaps amplified by reward models or models of human feedback)? I mean this non-rhetorically; I have my own opinion (that debate acts as an unprincipled way of inserting one round of optimization for meta-preferences [if confusing, see here]), but it’s probably not yours.
- Marie_DB 9 May 2025 9:23 UTC
  LW: 5 AF: 5
  0
  AF Parent
  Two reasons:
  1. Identifying vs. evaluating flaws: In debate, human judges get to see critiques made by a superhuman system—some of which they probably wouldn’t have been able to identify themselves but can still evaluate (on the (IMO plausible) assumption that it’s easier to evaluate than generate). That makes human judges more likely to give correct signal.
  2. Whole debate vs. single subclaim: In recursive debate, human judges evaluate only a single subclaim. The (superhuman) debaters do the work of picking out which subclaim they’re most likely to win on, and we use this as a proxy for who would’ve won the whole debate. This is a more efficient use of human judges’ time, and also probably it’s easier for human judges to evaluate smaller subproblems.
  For those reasons, I think we can likely get “good human input” for a bigger set of questions with debate than direct human supervision