The main issues with anti-goodharting that I see is the difficulty of defining proxy utility and base distribution, the difficulty of making it corrigible, not locking-in into fixed proxy utility and base distribution, and the question of what to do about optimization that points out of scope.
The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it’s an ill-posed concept. I’m not sure how you imagine corrigibility in this case: AQD is a series of discrete “transactions” (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The “out of scope” case is also dodged by quantilization, if I understand what you mean by “out of scope”.
...fiddling with base distribution and proxy utility is a more natural framing that’s strictly more general than fiddling with the quantilization parameter.
Why is it strictly more general? I don’t see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.
If we are to pick a single number to improve, why privilege the quantilization parameter instead of some other parameter that influences base distribution and proxy utility?
The reason to pick the quantilization parameter is because it’s hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.
I don’t understand which “main issues” you think this doesn’t address. Can you describe a concrete attack vector?
If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition).
The proxy utility in debate is perfectly well-defined: it is the ruling of the human judge. For the base distribution I also made some concrete proposals (which certainly might be improvable but are not obviously bad). As to corrigibility, I think it’s an ill-posed concept. I’m not sure how you imagine corrigibility in this case: AQD is a series of discrete “transactions” (debates), and nothing prevents you from modifying the AI between one and another. Even inside a debate, there is no incentive in the outer loop to resist modifications, whereas daemons would be impeded by quantilization. The “out of scope” case is also dodged by quantilization, if I understand what you mean by “out of scope”.
Why is it strictly more general? I don’t see it. It seems false, since for extreme value of the quantilization parameter we get optimization which is deterministic and hence cannot be equivalent to quantilization with different proxy and distribution.
The reason to pick the quantilization parameter is because it’s hard to determine, as opposed to the proxy and base distribution[1] for which there are concrete proposals with more-or-less clear motivation.
I don’t understand which “main issues” you think this doesn’t address. Can you describe a concrete attack vector?
If the base distribution is a bounded simplicity prior then it will have some parameters, and this is truly a weakness of the protocol. Still, I suspect that safety is less sensitive to these parameters and it is more tractable to determine them by connecting our ultimate theories of AI with brain science (i.e. looking for parameters which would mimic the computational bounds of human cognition).