This is currently my all-time favorite post; well done, and thank you!
It would be nice for there to be something like this every year, updated with new proposals added and old ones that no longer seem at all viable deleted. And the number of links to further reading would grow as more work is done on each proposal. Building such a thing would be a huge task of course… but boy would it be nice. Maybe it’s something the field as a whole could do, or maybe it could be someone’s paid job to do it. We could point people to it and say “Here is the list of serious proposals; which one do you plan to use? Or do you have a new one to add? Please tell me how it handles inner and outer alignment so I can add it to the list.”
You say this post doesn’t cover all existing proposals—if you were to expand it to cover all (serious) currently existing proposals, how many more entries do you think should be added, roughly?
On the inner alignment problem for amplification with intermittent oversight: Perhaps you’ve already thought of this, but mightn’t there be a sort of cascade of deception, where M at stage t suddenly realizes that its goals diverge from those of the human and resolves to hide this fact, and then when the human asks M(t-1) to examine M(t) and help the human figure out whether M(t) is being deceptive, M(t-1) learns the same revelation and makes the same decision.
As I read I made a diagram of the building blocks and how they were being put together. Seems like transparency tools, relaxed adversarial training, and amplification share first place for prominence and general usefulness.
I considered trying to make it a living document, but in the end I decided I wasn’t willing to commit to spending a bunch of time updating it regularly. I do like the idea of doing another one every year, though—I think I’d be more willing to write a new version every year than try to maintain one up-to-date version at all times, especially if I had some help.
In terms of other proposals, a lot of the other options I would include in a full list would just be additional combinations of the various things already on the list—recursive reward modeling + intermittent oversight, for example—that I didn’t feel like would add much to cover separately. That being said, there are also just a lot of different people out there with different ideas that I’m sure would have different versions of the proposals I’ve talked about.
Re intermittent oversight—I agree that it’s a problem if Mt suddenly realizes that it should be deceptive. In that case, I would say that even before Mt realizes it should be deceptive, the fact that it will realize that makes it suboptimality deceptively aligned. Thus, to solve this problem, I think we need it to be the case that Mt−1 can catch suboptimality deceptive alignment, which I agree could be quite difficult. One way in which Mt−1 might be able to ensure that it catches suboptimality deceptive alignment, however, could be to verify that Mt is myopic, as a myopic model should never conclude that deception is a good strategy.
This is currently my all-time favorite post; well done, and thank you!
It would be nice for there to be something like this every year, updated with new proposals added and old ones that no longer seem at all viable deleted. And the number of links to further reading would grow as more work is done on each proposal. Building such a thing would be a huge task of course… but boy would it be nice. Maybe it’s something the field as a whole could do, or maybe it could be someone’s paid job to do it. We could point people to it and say “Here is the list of serious proposals; which one do you plan to use? Or do you have a new one to add? Please tell me how it handles inner and outer alignment so I can add it to the list.”
You say this post doesn’t cover all existing proposals—if you were to expand it to cover all (serious) currently existing proposals, how many more entries do you think should be added, roughly?
On the inner alignment problem for amplification with intermittent oversight: Perhaps you’ve already thought of this, but mightn’t there be a sort of cascade of deception, where M at stage t suddenly realizes that its goals diverge from those of the human and resolves to hide this fact, and then when the human asks M(t-1) to examine M(t) and help the human figure out whether M(t) is being deceptive, M(t-1) learns the same revelation and makes the same decision.
As I read I made a diagram of the building blocks and how they were being put together. Seems like transparency tools, relaxed adversarial training, and amplification share first place for prominence and general usefulness.
Glad you liked the post so much!
I considered trying to make it a living document, but in the end I decided I wasn’t willing to commit to spending a bunch of time updating it regularly. I do like the idea of doing another one every year, though—I think I’d be more willing to write a new version every year than try to maintain one up-to-date version at all times, especially if I had some help.
In terms of other proposals, a lot of the other options I would include in a full list would just be additional combinations of the various things already on the list—recursive reward modeling + intermittent oversight, for example—that I didn’t feel like would add much to cover separately. That being said, there are also just a lot of different people out there with different ideas that I’m sure would have different versions of the proposals I’ve talked about.
Re intermittent oversight—I agree that it’s a problem if Mt suddenly realizes that it should be deceptive. In that case, I would say that even before Mt realizes it should be deceptive, the fact that it will realize that makes it suboptimality deceptively aligned. Thus, to solve this problem, I think we need it to be the case that Mt−1 can catch suboptimality deceptive alignment, which I agree could be quite difficult. One way in which Mt−1 might be able to ensure that it catches suboptimality deceptive alignment, however, could be to verify that Mt is myopic, as a myopic model should never conclude that deception is a good strategy.