As a concrete example, consider research sabotage. We ask an AI for open-ended advice on experimental prioritization. If the AI is deceptive, it might provide locally sound advice while also systematically deciding not to mention things we’ve overlooked in our experiments.
Actually, this is an example of sandbagging (at least if you have a reasonable training procedure): if you compared two outputs, one which explains what we were overlooking and one which doesn’t, the one which mentions the issue would perform better so long as we could recognize the issue (at least in a pairwise comparison, and we can always incorporate something like pairwise comparison in the reward).
I think the central case when (near) maximally high reward solutions cause harm is something like: In addition to performing well on everything we can check to any extent, the action also extremely subtly pursues a side objective or is directly bad, e.g. it adds a security vulnerabilty, persuades us about something unrelated, or ensures the experiment is subtly sabotaged so it yield the wrong conclusion.
This problem can be converted into an exploration hacking problem with debate/critique if a human could recognize the issue if it was pointed out to them, but in some cases things could be so subtle/superhuman that humans can’t recognize an issue even when it is explained maximally well to them. (Technically, the human needs to have enough signal that pointing out real issues outperforms pointing out fake issues in the equilibrium where we’re we’ve trained the policy to avoid producing actions which look suspicious or otherwise make it easy to point out fake issues.)
I see two senses in which where research sabotage is different from the usual sandbagging issue:
ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that’s not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
High-stakes, multi-shot settings: Let’s say the world is such that it’s feasible to identify any single step catastrophe, but you cannot recognize defection setup across k steps. For instance, if model context windows are siloed such that local explanations do not have access (or have too much to look at) to take into account all previous actions. Then, you stand to gain a lot by randomizing in such a way that blocks such setup.
I’m not saying it’s impossible for research sabotage to not be a sandbagging issue, just that the exact issue of “We ask an AI for open-ended advice on experimental prioritization. If the AI is deceptive, it might provide locally sound advice while also systematically deciding not to mention things we’ve overlooked in our experiments.” is well described as a sandbagging.
Actually, this is an example of sandbagging (at least if you have a reasonable training procedure): if you compared two outputs, one which explains what we were overlooking and one which doesn’t, the one which mentions the issue would perform better so long as we could recognize the issue (at least in a pairwise comparison, and we can always incorporate something like pairwise comparison in the reward).
I think the central case when (near) maximally high reward solutions cause harm is something like: In addition to performing well on everything we can check to any extent, the action also extremely subtly pursues a side objective or is directly bad, e.g. it adds a security vulnerabilty, persuades us about something unrelated, or ensures the experiment is subtly sabotaged so it yield the wrong conclusion.
This problem can be converted into an exploration hacking problem with debate/critique if a human could recognize the issue if it was pointed out to them, but in some cases things could be so subtle/superhuman that humans can’t recognize an issue even when it is explained maximally well to them. (Technically, the human needs to have enough signal that pointing out real issues outperforms pointing out fake issues in the equilibrium where we’re we’ve trained the policy to avoid producing actions which look suspicious or otherwise make it easy to point out fake issues.)
I see two senses in which where research sabotage is different from the usual sandbagging issue:
ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that’s not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
High-stakes, multi-shot settings: Let’s say the world is such that it’s feasible to identify any single step catastrophe, but you cannot recognize defection setup across k steps. For instance, if model context windows are siloed such that local explanations do not have access (or have too much to look at) to take into account all previous actions. Then, you stand to gain a lot by randomizing in such a way that blocks such setup.
I’m not saying it’s impossible for research sabotage to not be a sandbagging issue, just that the exact issue of “We ask an AI for open-ended advice on experimental prioritization. If the AI is deceptive, it might provide locally sound advice while also systematically deciding not to mention things we’ve overlooked in our experiments.” is well described as a sandbagging.