paulfchristiano comments on ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano 19 Dec 2021 22:17 UTC
LW: 2 AF: 2
AF
That makes sense. I was anchored on the SmartVault task, which has more of a homeostatic character (the initial state is likely already similar to $s_{M}^{good}$ ), but I agree that tasks where the goal state is hard to reach are more central among real-world deployment scenarios, and that arguments like “zero incentive to tamper, and tampering seems complicated” fail here.
Even for “homeostatic” tasks I expect the difficulty to scale up as the environment becomes more complex (e.g. because you must defend against increasingly sophisticated attackers). There may be some upper bound where you can defend against arbitrarily sophisticated attackers with finite difficulty, but I don’t know whether that’s true or whether it would be higher or lower than the difficulty of sensor tampering.
While some $M$ s may indeed predict this via reasoning, not all $M$ s that behave this way would, for example an $M$ that internally modeled the tampering sequence of actions incorrectly as actually leading to $s_{M}^{good}$ (and didn’t even model a distinct $s_{M}^{corrupted}$ ).
I agree that some M’s would mistakenly expect a sequence of actions to lead to good outcomes, when they really lead to tampering. But planning against such M’s couldn’t lead to correct tampering (since the details that determine successful tampering are different from those that lead to good outcomes) and in some sense such M’s also can’t be expected to defend against tampering. So I don’t currently think this is a big problem though I might well be missing something.
$M$ would correctly reason that its delusion box could be removed while it is blinded, meaning it would predict unlikely sudden transitions to other states (namely, states that $M$ predicts obtain in the real world when the agent blindly performs actions that are optimal w.r.t. a randomly sampled trajectory in $H$ , or null my-actuators-are-disconnected actions) with higher probability in the future of $s_{M}^{corrupted}$ than in the future of $s_{H}^{good}$ (making those states separable in $X$ ), or
I generally agree with this—in some sense this kind of “definitely no signals ever” tampering seems pretty unrealistic and it should be much easier to catch more realistic forms of tampering.
If the delusion box is guarded and maintained by a successor AI, so that the probability of its being removed or disabled is negligible, then $s_{M}^{corrupted}$ does get identified with $s_{H}^{good}$ , but some other action sequence (of similar length) would lead from $s_{M}^{preceding}$ to $s_{M}^{trippy}$ , a state in which bizarre observations appear forever that would be extremely unlikely at any state in $S_{H}$ .
Agree with this (modulo new counterexamples), but not yet clear we can exploit this fact to flag tampering as such. (E.g. can we recognize the relevant notion of similarity, or do we just conclude that every state can lead to $s_{M}^{t r i p p y}$ and hence we have no idea what is going on?)
One genre of ways this could fail involves using encryption or cryptographic hashes (e.g. $M$ first irreversibly burns the hash of its proposed objective into the wall, before doing any of the work to construct the successor), so that there is never a time when the successor is close to deployment and the objective is changeable.
This is exactly the kind of counterexample I would give because it is so clean. That said, I do also believe there are more realistic counterexamples (and I view the focus on this kind of example as mostly a methodological point so that we can do analysis without worrying about empirics).
My intuition is that these things never work (e.g. the undeployed successor can be modified so that it checks for the hash on the ceiling instead)
I’m less convinced by this. For example, suppose that my agent first builds a machine that sits between and its actuators, and then filters out any actions that don’t have an appropriate hash. Then I output a sequence of actions that tampers or constructs a successor. Here I am committing not to the hash of my successor, but to the hash of my actions, and so I can’t easily circumvent the check.
I agree that regions of $S_{M}$ that have nothing to do with delusion/tampering, but share the property that observations can confidently get extremely weird (i.e. not compatible with any real-world state according to $H$ ’s ontology) via a moderate-length action sequence, would be penalized just the same. This does seem like a bit of a restriction on the kind of tasks you can take on, but
I generally agree that if we weren’t worried about the kinds of considerations discussed in the rest of these comments, either because we addressed them or we were working in a methodology that was happy to set them aside as low probability, then it may well be possible to totally patch up these problems (and would be worth thinking about how to do so).
I generally think the family of approaches “This action is similar to something that demonstrably tampers” is very important to consider in practice (it has come up a number of times recently in discussions I’ve had with folks about more realistic failure stories and what you would actually do to avoid them). It may be more tampering-specific than addressing ELK, but for alignment overall that’s fair game if it fixes the problems.
I’m a bit scared that every part of $s_{M}$ is “close” to something that is not compatible with any real-world trajectory according to H.
(a) much less conservative than typical impact-measure penalties
Definitely agree with this.
(b) if $H$ can learn what’s going on with these regions of $S_{M}$ and develop corresponding regions of $S_{H}$ , then the distance penalty would be replaced by $H$ ’s informed evaluation of them.
I’m not sure I understand this 100%, but I’m interpreting it as an instance of a more general principle like: we could combine the mechanism we are currently discussing with all of the other possible fixes to ELK and tampering, so that this scheme only needs to handle the residual cases where humans can’t understand what’s going on at all even with AI assistance (and regularization doesn’t work &c). But by that point maybe the counterexamples are rare enough that it’s OK to just steer clear of them.