I considered trying to make it a living document, but in the end I decided I wasn’t willing to commit to spending a bunch of time updating it regularly. I do like the idea of doing another one every year, though—I think I’d be more willing to write a new version every year than try to maintain one up-to-date version at all times, especially if I had some help.
In terms of other proposals, a lot of the other options I would include in a full list would just be additional combinations of the various things already on the list—recursive reward modeling + intermittent oversight, for example—that I didn’t feel like would add much to cover separately. That being said, there are also just a lot of different people out there with different ideas that I’m sure would have different versions of the proposals I’ve talked about.
Re intermittent oversight—I agree that it’s a problem if Mt suddenly realizes that it should be deceptive. In that case, I would say that even before Mt realizes it should be deceptive, the fact that it will realize that makes it suboptimality deceptively aligned. Thus, to solve this problem, I think we need it to be the case that Mt−1 can catch suboptimality deceptive alignment, which I agree could be quite difficult. One way in which Mt−1 might be able to ensure that it catches suboptimality deceptive alignment, however, could be to verify that Mt is myopic, as a myopic model should never conclude that deception is a good strategy.
Glad you liked the post so much!
I considered trying to make it a living document, but in the end I decided I wasn’t willing to commit to spending a bunch of time updating it regularly. I do like the idea of doing another one every year, though—I think I’d be more willing to write a new version every year than try to maintain one up-to-date version at all times, especially if I had some help.
In terms of other proposals, a lot of the other options I would include in a full list would just be additional combinations of the various things already on the list—recursive reward modeling + intermittent oversight, for example—that I didn’t feel like would add much to cover separately. That being said, there are also just a lot of different people out there with different ideas that I’m sure would have different versions of the proposals I’ve talked about.
Re intermittent oversight—I agree that it’s a problem if Mt suddenly realizes that it should be deceptive. In that case, I would say that even before Mt realizes it should be deceptive, the fact that it will realize that makes it suboptimality deceptively aligned. Thus, to solve this problem, I think we need it to be the case that Mt−1 can catch suboptimality deceptive alignment, which I agree could be quite difficult. One way in which Mt−1 might be able to ensure that it catches suboptimality deceptive alignment, however, could be to verify that Mt is myopic, as a myopic model should never conclude that deception is a good strategy.