evhub comments on An overview of 11 proposals for building safe advanced AI

evhub 31 May 2020 21:36 UTC
LW: 11 AF: 5
0
AF
Glad you liked the post so much!

I considered trying to make it a living document, but in the end I decided I wasn’t willing to commit to spending a bunch of time updating it regularly. I do like the idea of doing another one every year, though—I think I’d be more willing to write a new version every year than try to maintain one up-to-date version at all times, especially if I had some help.

In terms of other proposals, a lot of the other options I would include in a full list would just be additional combinations of the various things already on the list—recursive reward modeling + intermittent oversight, for example—that I didn’t feel like would add much to cover separately. That being said, there are also just a lot of different people out there with different ideas that I’m sure would have different versions of the proposals I’ve talked about.

Re intermittent oversight—I agree that it’s a problem if $M_{t}$ suddenly realizes that it should be deceptive. In that case, I would say that even before $M_{t}$ realizes it should be deceptive, the fact that it will realize that makes it suboptimality deceptively aligned. Thus, to solve this problem, I think we need it to be the case that $M_{t - 1}$ can catch suboptimality deceptive alignment, which I agree could be quite difficult. One way in which $M_{t - 1}$ might be able to ensure that it catches suboptimality deceptive alignment, however, could be to verify that $M_{t}$ is myopic, as a myopic model should never conclude that deception is a good strategy.