A quick reaction to the truth-seeking definition: When writing the definition (of truth-promotion), I imagined a (straw) scenario where I am initially uncertain about what the best answer is—perhaps I have some belief, but upon reflection, I put little credence in it. In particular, I wouldn’t be willing to act on it. Then I run the debate, become fully convinced that the debate’s outcome is the correct answer, and act on it.
The other story seems also valid: you start out with some belief, update it based on the debate, and you want to know how much the debate helped. Which of the two options is better will, I guess, depend on the application in mind.
“I’d be much more excited about a model in which the agents can make claims about a space of questions, and as a step of the argument can challenge each other on any question from within that space,”
To dissolve a possible confusion: By “claims about a space of questions” you mean “a claim about every question from a space of questions”? Would this mean that the agents would commit to many claims at once (possibly more than the human judge can understand at once)? (Something I recall Beth Barnes suggesting.) Or do you mean that they would make a single “meta” claim, understandable by the judge, that specified many smaller claims (eg, “for any meal you ask me to cook, I will be able to cook it better than any of my friends”; horribly false, btw.)?
Anyway, yeah, I agree that this seems promising. I still don’t know how to capture the relations between different claims (which I somehow expect to be important if we are to prove some guarantees for debate).
I agree with your high-level points regarding the feature debate formalization.
I should clarify one thing that might not be apparent from the paper: the message of the counterexamples was meant to be “these are some general issues which we expect to see in debate, and here is how they can manifest in the feature debate toy model”, rather than “these specific examples will be a problem in general debates”. In particular, I totally agree that the specific examples immediatelly go away if you allow the agents to challenge each others’ claims. However, I have an intuition that even with other debate protocols, similar general issues might arise with different specific examples.
For example, I guess that even with other debate protocols, you will be “having a hard time when your side requires too difficult arguments”. I imagine there will always be some maximum “inferential distance that a debater can bridge” (with the given judge and debate protocol). And any claim which requires more supporting arguments than this will be a lost cause.
How will such an example look like? Without a specific debate design, I can’t really say.
Either way, if true, it becomes important whether you will be able to convincingly argue that a question is too difficult to explain (without making this a universal strategy even in cases where it shouldn’t apply).
A minor point:
“If you condition on a very surprising world, then it seems perfectly reasonable for the judge to be constantly surprised.”
I agree with your point here—debate being wrong in a very unlikely world is not a bug. However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior. So the claim should be “rational judges can have unstable debates in unlikely worlds” and “biased judges can have unstable debates even in typical worlds”.
By “claims about a space of questions” you mean “a claim about every question from a space of questions”?
I just wrote incorrectly; I meant “the agent can choose a question from a space of questions and make a claim about it”. If you want to support claims about a space of questions, you could allow quantifiers in your questions.
However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior.
I mean, sure, but any alignment scheme is going to have to assume some amount of correctness in the human-generated information it is given. You can’t learn about preferences if you model humans as arbitrarily wrong about their preferences.
Thank you for the comments!
A quick reaction to the truth-seeking definition: When writing the definition (of truth-promotion), I imagined a (straw) scenario where I am initially uncertain about what the best answer is—perhaps I have some belief, but upon reflection, I put little credence in it. In particular, I wouldn’t be willing to act on it. Then I run the debate, become fully convinced that the debate’s outcome is the correct answer, and act on it.
The other story seems also valid: you start out with some belief, update it based on the debate, and you want to know how much the debate helped. Which of the two options is better will, I guess, depend on the application in mind.
To dissolve a possible confusion: By “claims about a space of questions” you mean “a claim about every question from a space of questions”? Would this mean that the agents would commit to many claims at once (possibly more than the human judge can understand at once)? (Something I recall Beth Barnes suggesting.) Or do you mean that they would make a single “meta” claim, understandable by the judge, that specified many smaller claims (eg, “for any meal you ask me to cook, I will be able to cook it better than any of my friends”; horribly false, btw.)?
Anyway, yeah, I agree that this seems promising. I still don’t know how to capture the relations between different claims (which I somehow expect to be important if we are to prove some guarantees for debate).
I agree with your high-level points regarding the feature debate formalization. I should clarify one thing that might not be apparent from the paper: the message of the counterexamples was meant to be “these are some general issues which we expect to see in debate, and here is how they can manifest in the feature debate toy model”, rather than “these specific examples will be a problem in general debates”. In particular, I totally agree that the specific examples immediatelly go away if you allow the agents to challenge each others’ claims. However, I have an intuition that even with other debate protocols, similar general issues might arise with different specific examples.
For example, I guess that even with other debate protocols, you will be “having a hard time when your side requires too difficult arguments”. I imagine there will always be some maximum “inferential distance that a debater can bridge” (with the given judge and debate protocol). And any claim which requires more supporting arguments than this will be a lost cause. How will such an example look like? Without a specific debate design, I can’t really say. Either way, if true, it becomes important whether you will be able to convincingly argue that a question is too difficult to explain (without making this a universal strategy even in cases where it shouldn’t apply).
A minor point:
I agree with your point here—debate being wrong in a very unlikely world is not a bug. However, you can also get the same behaviour in a typical world if you assume that the judge has a wrong prior. So the claim should be “rational judges can have unstable debates in unlikely worlds” and “biased judges can have unstable debates even in typical worlds”.
I broadly agree with all of this, thanks :)
I just wrote incorrectly; I meant “the agent can choose a question from a space of questions and make a claim about it”. If you want to support claims about a space of questions, you could allow quantifiers in your questions.
I mean, sure, but any alignment scheme is going to have to assume some amount of correctness in the human-generated information it is given. You can’t learn about preferences if you model humans as arbitrarily wrong about their preferences.