Should we try harder to solve the alignment problem?
I’ve heard the meme “we are underinvesting in solving the the damn alignment problem” a few times (mostly from Neel Nanda).
I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.
Should we try harder to solve the alignment problem?
I’ve heard the meme “we are underinvesting in solving the the damn alignment problem” a few times (mostly from Neel Nanda).
I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.