I’m unsure whether we can successfully train ASIs to be reliably risk-averse, including far OOD. Our claim is just that the chances of success are high enough to make risk aversion worth pursuing as a line of defense. That’s the case we try to make in section 10. See also my reply to Ryan’s comment. I also think our chances of success are a bit higher for AIs that aren’t yet ASIs, and if we succeed in making them risk-averse I think they could help a lot with aligning any later-arising ASIs, by doing this sort of stuff.
Thanks!
I’m unsure whether we can successfully train ASIs to be reliably risk-averse, including far OOD. Our claim is just that the chances of success are high enough to make risk aversion worth pursuing as a line of defense. That’s the case we try to make in section 10. See also my reply to Ryan’s comment. I also think our chances of success are a bit higher for AIs that aren’t yet ASIs, and if we succeed in making them risk-averse I think they could help a lot with aligning any later-arising ASIs, by doing this sort of stuff.