How do you decide what to set ε to? You mention “we want assumptions about humans that are sensible a priori, verifiable via experiment” but I don’t see how ε can be verified via experiment, given that for many questions we’d want the human oracle to answer, there isn’t a source of ground truth answers that we can compare the human answers to?
With unbounded Alice and Bob, this results in an equilibrium where Alice can win if and only if there is an argument that is robust to an ε-fraction of errors.
How should I think about, or build up some intuitions about, what types of questions have an argument that is robust to an ε-fraction of errors?
Here’s an analogy that leads to a pessimistic conclusion (but I’m not sure how relevant it is): replace the human oracle with a halting oracle, the top level question being debated is whether some Turing machine T halts or not, and the distribution over which ε is define is the uniform distribution. Then it seems like Alice has a very tough time (for any T that she can’t prove halts or not herself), because Bob can reject/rewrite all the oracle answers that are relevant to T in some way, which is a tiny fraction of all possible Turing machines. (This assumes that Bob gets to pick the classifier after seeing the top level question. Is this right?)
In some cases, we will be able to get more accurate answers if we spend more resources (teams of people with more expertise taking more time, etc.). If we can do that, and we know μ (which is hard), we can get some purchase on ε.
We set tune ε not based on what’s safe, but based on what is competitive. I.e., we want to solve some particular task domain (AI safety research or the like), and we increase ε until it starts to break making progress, then dial it back a bit. This option isn’t amazing, but I do think is a move we’ll have a take for a bunch of safety parameters, assuming there are parameters which have some capability cost.
How do you decide what to set ε to? You mention “we want assumptions about humans that are sensible a priori, verifiable via experiment” but I don’t see how ε can be verified via experiment, given that for many questions we’d want the human oracle to answer, there isn’t a source of ground truth answers that we can compare the human answers to?
How should I think about, or build up some intuitions about, what types of questions have an argument that is robust to an ε-fraction of errors?
Here’s an analogy that leads to a pessimistic conclusion (but I’m not sure how relevant it is): replace the human oracle with a halting oracle, the top level question being debated is whether some Turing machine T halts or not, and the distribution over which ε is define is the uniform distribution. Then it seems like Alice has a very tough time (for any T that she can’t prove halts or not herself), because Bob can reject/rewrite all the oracle answers that are relevant to T in some way, which is a tiny fraction of all possible Turing machines. (This assumes that Bob gets to pick the classifier after seeing the top level question. Is this right?)
I think there are roughly two things you can do:
In some cases, we will be able to get more accurate answers if we spend more resources (teams of people with more expertise taking more time, etc.). If we can do that, and we know μ (which is hard), we can get some purchase on ε.
We set tune ε not based on what’s safe, but based on what is competitive. I.e., we want to solve some particular task domain (AI safety research or the like), and we increase ε until it starts to break making progress, then dial it back a bit. This option isn’t amazing, but I do think is a move we’ll have a take for a bunch of safety parameters, assuming there are parameters which have some capability cost.