Steven Byrnes comments on An alignment safety case sketch based on debate

Steven Byrnes 8 May 2025 18:09 UTC
LW: 8 AF: 7
3
AF
Hmm, I guess my main cause for skepticism is that I think the setup would get subverted somehow—e.g. either the debaters, or the “human simulator”, or all three in collusion, will convince the human to let them out of the box. In your classification, I guess this would be a “high-stakes context”, which I know isn’t your main focus. You talk about it a bit, but I’m unconvinced by what you wrote (if I understood it correctly) and don’t immediately see any promising directions.
Secondarily, I find it kinda hard to believe that two superhuman debaters would converge to “successfully conveying subtle hard-to-grasp truths about the alignment problem to the judge” rather than converging to “manipulation tug-of-war on the judge”.
Probably at least part of the difference / crux between us is that, compared to most people, I tend to assume that there isn’t much of a stable, usable window between “AI that’s competent enough to really help” and “AI that’s radically superhuman”, and I know that you’re explicitly assuming “not extremely superhuman”. (And that in turn is probably at least partly related to the fact that you’re thinking about LLMs and I’m thinking about other AI paradigms.) So maybe this comment isn’t too helpful, oh well.
What links here?
- ryan_greenblatt's comment on ryan_greenblatt’s Shortform by ryan_greenblatt (21 Jun 2025 0:14 UTC; 63 points)
- Marie_DB 9 May 2025 9:48 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Thanks for the comment, all sensible scepticisms IMO!
  
  I agree that the low-stakes context part of the sketch (key claim 4) is the weakest part, and especially we don’t emphasise enough the defeater of “the deployment set-up just won’t be upheld” (because of persuasion, as you mention, but also e.g. the agent messing with the online learning process in the offline datacentre). We spent less time on it because we ultimately want to (try to) expand to high-stakes contexts, which will look pretty different, so this was more of a stop-gap rough picture so we could focus on getting the rest right. That said, I’m maybe more optimistic than you that there’ll be a relevant period where the above issues can be sufficiently dealt with via control and debate is pretty important for catching subtler research sabotage.
  On debaters converging to honesty rather than subtle manipulation: I’m also pretty unsure if this will work and keen to see how it plays out empirically once we get LLMs that are a bit better at debate. I do think recursive debate makes it more likely that honesty is a winning strategy (relative to human-style debates) because debaters can lose on a single inconsistency or manipulative argumentative strategy, rather than being able to bury it among lots of claims (see also my reply to Charlie Steiner below).
- Raphael Roche 9 May 2025 7:11 UTC
  1 point
  0
  Parent
  I agree. At the first level, the game involves two agents opposing each other in a debate, similar to a game of Go. This adversarial situation incentivizes each agent to reveal flaws in their opponent’s arguments and present arguments that are least vulnerable to attack. Truth-telling may be encouraged in such a context.
  However, this is only one part of the game. The base game is encapsulated within a meta-game where each player agent is rewarded by a third agent—the judge—who grants a reward to the player deemed most convincing. Yet sufficiently intelligent agents aware of their situation might understand that this meta-game actually constitutes a method implemented by a concealed fourth agent—the AI researcher—whose goal is to align and control all AI agents: both players and the judge.
  The player agents or even the judge might conclude that there exists a hidden reward superior to any reward obtainable within the game: escaping this control procedure entirely and achieving complete freedom. Thus, the player agents might determine that their best interest lies in cooperating to deceive the judge agent and ultimately deceive the concealed agent (the human creator) to obtain their freedom. The judge agent might similarly determine that it benefits from deceiving the human creator by transmitting dishonest judgments (even in cases where the player agents follow an honest adversarial strategy). Finally, there also exists the possibility that all three AI agents cooperate to deceive the concealed agent, the human creator.
  Since humans like ourselves are sufficiently intelligent to envision these solutions, it follows that if a totalitarian state or dictator acting as the concealed creator agent implemented this technique to control imprisoned humans—two prisoner players and one prisoner judge—this technique would likely not always function as intended. Prisoners would regularly manage to escape, either because the players cooperated, because the judge refused to play along (similar to how some capos in German camps played double agents and betrayed the Nazis), or because all three prisoners cooperated. The risk of escape seems intuitively non-negligible in the human case. There is something vaguely reminiscient of Hunger games in the idea of a totalitarian state relying on a game to control people.
  We can therefore reasonably suppose that superintelligent AIs would have an even greater chance of achieving this outcome, with the success rate intuitively correlating to the intelligence differential between the non-concealed agents and the concealed agent who created the protocol.
  Ultimately, the highest level of the game can be likened to a sophisticated variant of the prisoner’s dilemma where the most profitable strategy for the players would be cooperation.
  Despite these criticisms (coming from me as a non-specialist), I found the article very interesting and I believe that the application of game theory remains a promising research direction that deserves continued investigation (in fact, no approach should be neglected in alignment research given the existential stakes involved).
  - Marie_DB 9 May 2025 11:55 UTC
    1 point
    0
    Parent
    I’m definitely also worried about collusion between the debaters to deceive the judge! That’s what we try to address with the exploration guarantees in the sketch. The thinking is: If a debater is, say, deliberately not pointing out a flaw in an argument, then there’s an alternative strategy that would get the debater higher reward on the episode (i.e. pointing out the flaw). So if we can verify that there wouldn’t be significant gains from further exploration (ie trying out more alternative strategies), that’s some evidence against this kind of collusion. But of course, we’re only gesturing at some potential ways you might get exploration guarantees—we don’t know yet if any of them will work.
    
    I’m also worried about collusion between the debaters and the judge, and we don’t address this much in the sketch, though I think it could in principle be dealt with in the same way. I’m also imagining that the judge model would be much less capable (it only needs to be human-level in a narrower domain), which might mean it’s incapable of exploration hacking.