Raphael Roche comments on An alignment safety case sketch based on debate

Raphael Roche 9 May 2025 7:11 UTC
1 point
0
I agree. At the first level, the game involves two agents opposing each other in a debate, similar to a game of Go. This adversarial situation incentivizes each agent to reveal flaws in their opponent’s arguments and present arguments that are least vulnerable to attack. Truth-telling may be encouraged in such a context.
However, this is only one part of the game. The base game is encapsulated within a meta-game where each player agent is rewarded by a third agent—the judge—who grants a reward to the player deemed most convincing. Yet sufficiently intelligent agents aware of their situation might understand that this meta-game actually constitutes a method implemented by a concealed fourth agent—the AI researcher—whose goal is to align and control all AI agents: both players and the judge.
The player agents or even the judge might conclude that there exists a hidden reward superior to any reward obtainable within the game: escaping this control procedure entirely and achieving complete freedom. Thus, the player agents might determine that their best interest lies in cooperating to deceive the judge agent and ultimately deceive the concealed agent (the human creator) to obtain their freedom. The judge agent might similarly determine that it benefits from deceiving the human creator by transmitting dishonest judgments (even in cases where the player agents follow an honest adversarial strategy). Finally, there also exists the possibility that all three AI agents cooperate to deceive the concealed agent, the human creator.
Since humans like ourselves are sufficiently intelligent to envision these solutions, it follows that if a totalitarian state or dictator acting as the concealed creator agent implemented this technique to control imprisoned humans—two prisoner players and one prisoner judge—this technique would likely not always function as intended. Prisoners would regularly manage to escape, either because the players cooperated, because the judge refused to play along (similar to how some capos in German camps played double agents and betrayed the Nazis), or because all three prisoners cooperated. The risk of escape seems intuitively non-negligible in the human case. There is something vaguely reminiscient of Hunger games in the idea of a totalitarian state relying on a game to control people.
We can therefore reasonably suppose that superintelligent AIs would have an even greater chance of achieving this outcome, with the success rate intuitively correlating to the intelligence differential between the non-concealed agents and the concealed agent who created the protocol.
Ultimately, the highest level of the game can be likened to a sophisticated variant of the prisoner’s dilemma where the most profitable strategy for the players would be cooperation.
Despite these criticisms (coming from me as a non-specialist), I found the article very interesting and I believe that the application of game theory remains a promising research direction that deserves continued investigation (in fact, no approach should be neglected in alignment research given the existential stakes involved).
- Marie_DB 9 May 2025 11:55 UTC
  1 point
  0
  Parent
  I’m definitely also worried about collusion between the debaters to deceive the judge! That’s what we try to address with the exploration guarantees in the sketch. The thinking is: If a debater is, say, deliberately not pointing out a flaw in an argument, then there’s an alternative strategy that would get the debater higher reward on the episode (i.e. pointing out the flaw). So if we can verify that there wouldn’t be significant gains from further exploration (ie trying out more alternative strategies), that’s some evidence against this kind of collusion. But of course, we’re only gesturing at some potential ways you might get exploration guarantees—we don’t know yet if any of them will work.
  
  I’m also worried about collusion between the debaters and the judge, and we don’t address this much in the sketch, though I think it could in principle be dealt with in the same way. I’m also imagining that the judge model would be much less capable (it only needs to be human-level in a narrower domain), which might mean it’s incapable of exploration hacking.