I think this is an important family of counterexamples not really addressed in our report (and which we’re not really asking contest participants to handle, i.e. if you come up with a proposal for which this is the only counterexample we could come up with then we’d definitely give a $50k prize).
Some thoughts:
As you say, a lot of this comes down to our hopes for handling learned optimizers, which we discuss in this appendix.
Most importantly, we can use SGD to optimize “What is the probability I assign to answers that are consistent with other copies’ answers?” instead of “What score do I get in the consistency game?”. If this was the only difficulty, we’d want to be really careful about what SGD optimized. That means that we are basically left with inner alignment problems that could kill us in much more direct ways (e.g. by making the low-stakes assumption that we can recover from any finite number of failures). What I’m mostly interested in is trying to get enough of a handle on ELK that inner alignment becomes the dominant concern / we know enough about ELK to be able to work productively on inner alignment.
If your model is doing optimization in order to break out of the sandbox, searching over cognitive actions to find those that break out of the sandbox, what we really want to do is apply ELK at an earlier stage to the model that was evaluating cognitive actions. This is a pretty messy situation, but it looks promising enough that I’m inclined to focus on other problems first and then see how this counterexample looks later. (Similarly, you’ll always have similar pathologies with gradient hacking etc. and so you need to use a messier mechanism like this to handle those in the worst case.)
On the other hand, if your model isn’t doing any optimization to find ways to break out of the sandbox, then it seems like you are up against a fairly weak adversary (just SGD). In the worst case this is still bad news, but it’s not merely better in practice, it also seems like it gives you a large leg up with which to beat the model. In general, defending against a SmartVault which controls a ton of actuators seems harder than defending against an adversary who can only use side channels.
I’m normally pretty opposed to sandboxing, but I do also think that if this was all you had to do, then it is actually plausible that you could literally win the game. E.g. if the overhead on homomorphic encryption was low enough (say 1000%) that you could afford to run each reporter training datapoint homomorphically encrypted, then it seems like you could win against any adversary. I still wouldn’t bet on it without thinking a lot, but I think it makes the point that there is a big quantitative gap here.
Overall I agree that in the worst case this kind of counterexample applies to all of the proposals in this document. Moreover, I agree that unfortunately it is (i) a complicated aesthetic judgment about what kind of counterexample to focus on first / what is a prerequisite to what, and (ii) not obvious that you can define the problem narrowly enough to avoid this counterexample without also defining-away other difficulties we are trying to address. But I’m still wiling to bet on the aesthetic judgment and it does look like we can avoid this counterexample without accidentally defining away the whole problem.
(I don’t think this comment covers the entire issue, I hope we’ll write more about this in another report.)
I think this is an important family of counterexamples not really addressed in our report (and which we’re not really asking contest participants to handle, i.e. if you come up with a proposal for which this is the only counterexample we could come up with then we’d definitely give a $50k prize).
Some thoughts:
As you say, a lot of this comes down to our hopes for handling learned optimizers, which we discuss in this appendix.
Most importantly, we can use SGD to optimize “What is the probability I assign to answers that are consistent with other copies’ answers?” instead of “What score do I get in the consistency game?”. If this was the only difficulty, we’d want to be really careful about what SGD optimized. That means that we are basically left with inner alignment problems that could kill us in much more direct ways (e.g. by making the low-stakes assumption that we can recover from any finite number of failures). What I’m mostly interested in is trying to get enough of a handle on ELK that inner alignment becomes the dominant concern / we know enough about ELK to be able to work productively on inner alignment.
If your model is doing optimization in order to break out of the sandbox, searching over cognitive actions to find those that break out of the sandbox, what we really want to do is apply ELK at an earlier stage to the model that was evaluating cognitive actions. This is a pretty messy situation, but it looks promising enough that I’m inclined to focus on other problems first and then see how this counterexample looks later. (Similarly, you’ll always have similar pathologies with gradient hacking etc. and so you need to use a messier mechanism like this to handle those in the worst case.)
On the other hand, if your model isn’t doing any optimization to find ways to break out of the sandbox, then it seems like you are up against a fairly weak adversary (just SGD). In the worst case this is still bad news, but it’s not merely better in practice, it also seems like it gives you a large leg up with which to beat the model. In general, defending against a SmartVault which controls a ton of actuators seems harder than defending against an adversary who can only use side channels.
I’m normally pretty opposed to sandboxing, but I do also think that if this was all you had to do, then it is actually plausible that you could literally win the game. E.g. if the overhead on homomorphic encryption was low enough (say 1000%) that you could afford to run each reporter training datapoint homomorphically encrypted, then it seems like you could win against any adversary. I still wouldn’t bet on it without thinking a lot, but I think it makes the point that there is a big quantitative gap here.
Overall I agree that in the worst case this kind of counterexample applies to all of the proposals in this document. Moreover, I agree that unfortunately it is (i) a complicated aesthetic judgment about what kind of counterexample to focus on first / what is a prerequisite to what, and (ii) not obvious that you can define the problem narrowly enough to avoid this counterexample without also defining-away other difficulties we are trying to address. But I’m still wiling to bet on the aesthetic judgment and it does look like we can avoid this counterexample without accidentally defining away the whole problem.
(I don’t think this comment covers the entire issue, I hope we’ll write more about this in another report.)