Maybe it is a controversial take, but I am in favor of publishing all solutions to corrigibility.
If a company fails at coming up with solutions for AGI corrigibility, they wouldn’t stop building AGI. Instead, they would proceed with ramping up capabilities and end up with a misaligned AGI that (for all we know) will want to turn universe into paperclips.
Due to instrumental convergence, AGI, whose goals are not explicitly aligned with human goals, is going to engage in a very undesirable behavior by default.
Maybe it is a controversial take, but I am in favor of publishing all solutions to corrigibility.
If a company fails at coming up with solutions for AGI corrigibility, they wouldn’t stop building AGI. Instead, they would proceed with ramping up capabilities and end up with a misaligned AGI that (for all we know) will want to turn universe into paperclips.
Due to instrumental convergence, AGI, whose goals are not explicitly aligned with human goals, is going to engage in a very undesirable behavior by default.