I agree that things would be harder, mostly because of the potential for sudden capabilities breakthroughs if you have RL, combined with incentives to use automated rewards more, but I don’t think it’s so much harder that the post is incorrect, and my basic reason is I believe the central alignment insights like data mattering a lot more than inductive bias for alignment purposes still remain true in the RL regime, so we can control values by controlling data.
Also, depending on your values, AI extinction can be preferable to some humans taking over if they are willing to impose severe suffering on you, which can definitely happen if humans align AGI/ASI.
What do you mean? As in you would filter specific data from the posttraining step? What would you be trying to prevent the model from learning specifically?
Maybe, but not at any large scale, though I get why someone might not want to do it, because it’d probably be costly to do this as an intentional effort to stably make an AI aligned (unless synthetic data automation more or less works.)
I agree that things would be harder, mostly because of the potential for sudden capabilities breakthroughs if you have RL, combined with incentives to use automated rewards more, but I don’t think it’s so much harder that the post is incorrect, and my basic reason is I believe the central alignment insights like data mattering a lot more than inductive bias for alignment purposes still remain true in the RL regime, so we can control values by controlling data.
Also, depending on your values, AI extinction can be preferable to some humans taking over if they are willing to impose severe suffering on you, which can definitely happen if humans align AGI/ASI.
What do you mean? As in you would filter specific data from the posttraining step? What would you be trying to prevent the model from learning specifically?
I was thinking of adding synthetic data about our values in the pretraining step.
Is anyone doing this?
Maybe, but not at any large scale, though I get why someone might not want to do it, because it’d probably be costly to do this as an intentional effort to stably make an AI aligned (unless synthetic data automation more or less works.)