if it fails before you top out the scaling I think you probably lose
While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
The meat is all in 1) how you identify the core of human values and 2) how you identify which experiences will change the system to have less of the initial good values, but, like, figuring out the two of those would actually solve the problem!
The alignment strategy seems to be “what we’re doing right now” which is:
feed the base model human generated training data
apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.
While I agree that arbitrary scaling is dangerous, stopping early is an option. Near human AGI need not transition to ASI until the relevant notKillEveryone problems have been solved.
The alignment strategy seems to be “what we’re doing right now” which is:
feed the base model human generated training data
apply RL type stuff (RLHF,RLAIF,etc.) to reinforce the good type of internet learned behavior patterns
This could definitely fail eventually if RLAIF style self-improvement is allowed to go on long enough but crucially, especially with RLAIF and other strategies that set the AI to training itself, there’s a scalable mostly aligned intelligence right there that can help. We’re not trying to safely align a demon so much as avoid getting to “demon” from a the somewhat aligned thing we have now.