Marie_DB comments on An alignment safety case sketch based on debate

Marie_DB 9 May 2025 11:55 UTC
1 point
0
I’m definitely also worried about collusion between the debaters to deceive the judge! That’s what we try to address with the exploration guarantees in the sketch. The thinking is: If a debater is, say, deliberately not pointing out a flaw in an argument, then there’s an alternative strategy that would get the debater higher reward on the episode (i.e. pointing out the flaw). So if we can verify that there wouldn’t be significant gains from further exploration (ie trying out more alternative strategies), that’s some evidence against this kind of collusion. But of course, we’re only gesturing at some potential ways you might get exploration guarantees—we don’t know yet if any of them will work.

I’m also worried about collusion between the debaters and the judge, and we don’t address this much in the sketch, though I think it could in principle be dealt with in the same way. I’m also imagining that the judge model would be much less capable (it only needs to be human-level in a narrower domain), which might mean it’s incapable of exploration hacking.