Ah, thank you, I see where I misunderstood now. And upon re-reading, I see that it was because I was much too careless in reading the post, to the point that I should apologize. Sorry.I was thinking that the agents were no longer being trained, already being optimal players, and so I didn’t think the judge would need to take into account how their choice would influence future answers. This reading clearly doesn’t match what you wrote, at least past the very first part.If the debaters are still being trained, or the judge can be convinced that the debaters are still being trained, then I can definitely see the case for a debater arguing “This information is more useful, and because we are still being trained, it is to your benefit to choose the more useful information, so that we will provide the more useful information in the future”.
I guess that suggests that the environments in which the judge confidently believes (and can’t be convinced otherwise) that the debaters are/aren’t still being trained, are substantially different, and so if training produces the optimal policy in which it is trained, then after training was done, it would likely still do the “ignoring the question” thing, even if that is no longer optimal when not being trained (when the judge knows that the debaters aren’t being trained).
Oh no need for apologies: I’m certain the post was expressed imperfectly—I was understanding more as I wrote (I hope!). Often the most confusing parts are the most confused.
Since I’m mainly concerned with behaviour-during-training, I don’t think the post-training picture is too important to the point I’m making. However, it is interesting to consider what you’d expect to happen after training in the event that the debaters’ only convincing “ignore-the-question” arguments are training-signal based.
I think in that case I’d actually expect debaters to stop ignoring the question (assuming they know the training has stopped). I assume that a general, super-human question answerer must be able to do complex reasoning and generalise to new distributions. Removal of the training signal is a significant distributional shift, but one that I’d expect a general question-answerer to handle smoothly (in particular, we’re assuming it can answer questions about [optimal debating tactics once training has stopped]).[ETA: I can imagine related issues with high-value-information bribery in a single debate:”Give me a win in this branch of the tree, and I’ll give you high-value information in another branch”, or the like… though it’s a strange bargaining situation given that in most setups the debaters have identical information to offer. This could occur during or after training, but only in setups where the judge can give reward before the end of the debate.… Actually I’m not sure on that: if the judge always has the option to override earlier decisions with larger later rewards, then mid-debate rewards don’t commit the judge in any meaningful way, so aren’t really bargaining chips.So I don’t think this style of bribery would work in setups I’ve seen.]