Very cool! However, it looks like Debate rests upon the assumption that truth is always more convincing than falsehood to a human judge. Multipolar schemes need not have humans in the loop at all.
What intrinsically goes wrong, I’d say, is that the human operators have an ability to recognize good arguments that’s only rated to withstand up to a certain intensity of search, which will break down beyond that point. Our brains’ ability to distinguish good arguments from bad arguments is something we’d expect to be balanced to the kind of argumentative pressure a human brain was presented with in the ancestral environment / environment of evolutionary adaptedness, and if you optimize against a brain much harder than this, you’d expect it to break.
There’s going to be a similar assumption in a good multipolar trap approach.
If Alice yelled at you with a megaphone that everyone in your house / city / whatever must now (1) obey Alice, and (2) torture anyone who doesn’t obey Alice, that’s not going to cause you to start obeying Alice.
You need some feedback signal that actually causes the agents to care about the rules; in debate this comes from the human judge + gradient descent, I expect you’ll need something analogous in any multipolar trap approach.
https://www.lesswrong.com/tag/debate-ai-safety-technique-1
Very cool! However, it looks like Debate rests upon the assumption that truth is always more convincing than falsehood to a human judge. Multipolar schemes need not have humans in the loop at all.
From this comment:
There’s going to be a similar assumption in a good multipolar trap approach.
If Alice yelled at you with a megaphone that everyone in your house / city / whatever must now (1) obey Alice, and (2) torture anyone who doesn’t obey Alice, that’s not going to cause you to start obeying Alice.
You need some feedback signal that actually causes the agents to care about the rules; in debate this comes from the human judge + gradient descent, I expect you’ll need something analogous in any multipolar trap approach.