Marie_DB comments on An alignment safety case sketch based on debate

Marie_DB 9 May 2025 9:48 UTC
LW: 1 AF: 1
0
AF
Thanks for the comment, all sensible scepticisms IMO!

I agree that the low-stakes context part of the sketch (key claim 4) is the weakest part, and especially we don’t emphasise enough the defeater of “the deployment set-up just won’t be upheld” (because of persuasion, as you mention, but also e.g. the agent messing with the online learning process in the offline datacentre). We spent less time on it because we ultimately want to (try to) expand to high-stakes contexts, which will look pretty different, so this was more of a stop-gap rough picture so we could focus on getting the rest right. That said, I’m maybe more optimistic than you that there’ll be a relevant period where the above issues can be sufficiently dealt with via control and debate is pretty important for catching subtler research sabotage.
On debaters converging to honesty rather than subtle manipulation: I’m also pretty unsure if this will work and keen to see how it plays out empirically once we get LLMs that are a bit better at debate. I do think recursive debate makes it more likely that honesty is a winning strategy (relative to human-style debates) because debaters can lose on a single inconsistency or manipulative argumentative strategy, rather than being able to bury it among lots of claims (see also my reply to Charlie Steiner below).