Mark Xu comments on ARC’s first technical report: Eliciting Latent Knowledge

Mark Xu 21 Dec 2021 3:55 UTC
2 points
Thanks for your proposal! We have considered similar strategies in the past. The main points of the breaker response would be:
- How are you training this “terrifier”? If the terrifier moves that can be made during training are from held-out sensors that you have constructed, then the “bad terrifier” you might learn doesn’t generalize to thinking of any new sensors, only ever considering the sensors that you used as ground truth during training. Thus a sufficiently clever robber might trick all of the sensors-used-during-training (which may not exist during deployment), and thus the diamond would be stolen without a terrifier being able to show any sensor data that would scare the human.
- It’s possible that even if the human was able to know the results of all possible sensors, they still wouldn’t be able to know whether the situation was well. For instance, if the human doesn’t really understand how humans work, they way that they hope to maintain control over the situation is asking questions like “is that human the same human that I saw before?”. However, a terrifier might not have any sensor data that they could show the human to scare them, e.g. if the terrifier showed the human a bunch of nanotech in that humans brain, the human might not understand why that was bad.
For example, if the thieves replace the diamond with an exact copy down to the nanometer but not molecular level… that’s fine? I consider that behavior aligned.

This behavior seems potentially catastrophic. If all we can do is force potential adversaries to take actions that can’t be seen with the resolution of our current sensors, then e.g. we can’t prevent adversaries from corrupting code that is running on computers that we wish to protect.

How this defeats the previous counterexample: in some sense, I see the problem with “ontology mismatch” as the human no longer knowing the right question to ask, because the AI and human no longer agree on an ontology that would make the human’s question meaningful. But “what question would the human be maximally upset about if they asked” is a pretty straightforward task to expect from an AI that’s capable of modeling (most) human preferences.

I don’t really understand why this explains why your strategy defeats the previous counterexample.