paulfchristiano comments on ARC’s first technical report: Eliciting Latent Knowledge

paulfchristiano 17 Dec 2021 16:58 UTC
LW: 2 AF: 2
AF
I’m thinking of this in a family of proposals like:
- Some kinds of tampering can be easily detected (and so should get identified with states $s_{H}$ where tampering has occurred)
- Some other tampering can’t be easily detected, but this undetectable tampering has important similarities with detectable tampering and we could use that to recognize it.
- In this case, we’re going to try to exploit the fact that detectable tampering shares a prefix of actions/states with undetectable tampering (such that later states reached in that sequence have a much higher probability of detectable tampering, as well as undetectable tampering, and can be flagged as suspicious).
I currently don’t believe this kind of thing can work (at least in the worst case), but I’m not at all confident.
One thing worth mentioning (which makes me much more optimistic about this class of proposals): if your AI, delegates to another AI who helps actually implement the tampering, that makes things harder but in some sense it also provides extra structure you can work with in a solution. Your AI needs to reason about that other AI’s reasoning, and so you could hope (i) if your AI is reasoning about the other AI’s reasoning in detail, then we can apply exactly the same machinery to that AI’s predictive model in order to elicit it’s beliefs (which your AI obviously trusts) about what is going on, (ii) if your AI reasons abstractly about the other AI’s reasoning, you could hope to treat that as a shortcut to reach the same conclusions and incentivize the model to give you the same answers that it would if it were reasoning explicitly (modulo its ability to figure it out).
(I also think the single-AI case basically needs to be treated in the same way as that—I don’t think in general your AI is going to be reasoning explicitly about its own future actions, it just needs to reason abstractly about its future self in a qualitatively similar same way that it reasons about other AIs. Or at least in the worst case its reasoning about itself is just as messy and complex as its reasoning about other AIs.)