I see two senses in which where research sabotage is different from the usual sandbagging issue:
ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that’s not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
High-stakes, multi-shot settings: Let’s say the world is such that it’s feasible to identify any single step catastrophe, but you cannot recognize defection setup across k steps. For instance, if model context windows are siloed such that local explanations do not have access (or have too much to look at) to take into account all previous actions. Then, you stand to gain a lot by randomizing in such a way that blocks such setup.
I’m not saying it’s impossible for research sabotage to not be a sandbagging issue, just that the exact issue of “We ask an AI for open-ended advice on experimental prioritization. If the AI is deceptive, it might provide locally sound advice while also systematically deciding not to mention things we’ve overlooked in our experiments.” is well described as a sandbagging.
I see two senses in which where research sabotage is different from the usual sandbagging issue:
ELK i.e. when explanations are hard to find: If a malicious solution is high probability for a model to find, but presenting a human-recognizable argument is hard. Note that you might be concerned that our classifiers have to do this as well, but that’s not the case. The classifiers could work indirectly by forcing a spread of probability mass to distinct solutions.
High-stakes, multi-shot settings: Let’s say the world is such that it’s feasible to identify any single step catastrophe, but you cannot recognize defection setup across k steps. For instance, if model context windows are siloed such that local explanations do not have access (or have too much to look at) to take into account all previous actions. Then, you stand to gain a lot by randomizing in such a way that blocks such setup.
I’m not saying it’s impossible for research sabotage to not be a sandbagging issue, just that the exact issue of “We ask an AI for open-ended advice on experimental prioritization. If the AI is deceptive, it might provide locally sound advice while also systematically deciding not to mention things we’ve overlooked in our experiments.” is well described as a sandbagging.