I feel like we can approximately split the full alignment problem into two parts: low stakes and handling catastrophes.
Insert joke about how I can split physics research into two parts: low stakes and handling catastrophes.
I’m a little curious about whether assuming fixed low stakes accidentally favors training regimes that have the real-world drawback of raising the stakes.
But overall I think this is a really interesting way of reframing the “what do we do if we succeed?” question. There is one way it might be misleading, which is I think that we’re left with much more of the problem of generalizing beyond the training domain that it first appears: even though the AI gets to equilibrate to new domains safely and therefore never has to take big leaps of generalization, the training signal itself has to do all the work of generalization that the trained model gets to avoid!
Insert joke about how I can split physics research into two parts: low stakes and handling catastrophes.
I’m a little curious about whether assuming fixed low stakes accidentally favors training regimes that have the real-world drawback of raising the stakes.
But overall I think this is a really interesting way of reframing the “what do we do if we succeed?” question. There is one way it might be misleading, which is I think that we’re left with much more of the problem of generalizing beyond the training domain that it first appears: even though the AI gets to equilibrate to new domains safely and therefore never has to take big leaps of generalization, the training signal itself has to do all the work of generalization that the trained model gets to avoid!