I’ll be blunt and say I hope this project changes, and here’s why:
I note that this project basically assumes the entire inner alignment problem away, which is not a good sign as a mesa-optimizer can be deceptively aligned to kill everyone.
This basically requires the best-case scenario for outer-alignment in order to work, as if you have Goodhart’s law problems than there seems to be no way to stop things from getting out of hand.
The best case is we actually make progress. The middling case is that this is capabilities work billing itself as alignment work. The worst case is that you cause the X-risk from AI via no boxing and no way to stop deceptive alignment/goodhart errors. I hope this is seriously reconsidered or abandoned.
I’ll be blunt and say I hope this project changes, and here’s why:
I note that this project basically assumes the entire inner alignment problem away, which is not a good sign as a mesa-optimizer can be deceptively aligned to kill everyone.
This basically requires the best-case scenario for outer-alignment in order to work, as if you have Goodhart’s law problems than there seems to be no way to stop things from getting out of hand.
The best case is we actually make progress. The middling case is that this is capabilities work billing itself as alignment work. The worst case is that you cause the X-risk from AI via no boxing and no way to stop deceptive alignment/goodhart errors. I hope this is seriously reconsidered or abandoned.