But I’m not really aware of any compelling alternatives to this class of plan–”training a model based on a reward signal” is basically all of machine learning, and so if you wanted to have an alignment strategy that’s competitive, I don’t see what else you can do.
There is an alternative. Rather that applying the reward signal to the model’s output, apply it to the pretraining corpus, or to samples generated by humans or some weaker model. This avoids the possibility of a very capable model using very capable persuasion techniques to game the training process. (The problem with this is that it’s heavily off-policy: the reward feedback is less concentrated in the region of the policy, so you likely need more of it.)
There is an alternative. Rather that applying the reward signal to the model’s output, apply it to the pretraining corpus, or to samples generated by humans or some weaker model. This avoids the possibility of a very capable model using very capable persuasion techniques to game the training process. (The problem with this is that it’s heavily off-policy: the reward feedback is less concentrated in the region of the policy, so you likely need more of it.)