I think this is correct in alignment-is-easy worlds but incorrect in alignment-is-hard worlds (corresponding to “optimistic scenarios” and “pessimistic scenarios” in Anthropic’s Core Views on AI Safety). Logic like this is a large part of why I think there’s still substantial existential risk even in alignment-is-easy worlds, especially if we fail to identify that we’re in an alignment-is-easy world. My current guess is that if we were to stay exclusively in the pre-training + small amounts of RLHF/CAI paradigm, that would constitute a sufficiently easy world that this view would be correct, but in fact I don’t expect us to stay in that paradigm, and I think other paradigms involving substantially more outcome-based RL (e.g. as was used in OpenAI o1) are likely to be much harder, making this view no longer correct.
I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.
I agree that things would be harder, mostly because of the potential for sudden capabilities breakthroughs if you have RL, combined with incentives to use automated rewards more, but I don’t think it’s so much harder that the post is incorrect, and my basic reason is I believe the central alignment insights like data mattering a lot more than inductive bias for alignment purposes still remain true in the RL regime, so we can control values by controlling data.
Also, depending on your values, AI extinction can be preferable to some humans taking over if they are willing to impose severe suffering on you, which can definitely happen if humans align AGI/ASI.
What do you mean? As in you would filter specific data from the posttraining step? What would you be trying to prevent the model from learning specifically?
Maybe, but not at any large scale, though I get why someone might not want to do it, because it’d probably be costly to do this as an intentional effort to stably make an AI aligned (unless synthetic data automation more or less works.)
I think this is correct in alignment-is-easy worlds but incorrect in alignment-is-hard worlds (corresponding to “optimistic scenarios” and “pessimistic scenarios” in Anthropic’s Core Views on AI Safety). Logic like this is a large part of why I think there’s still substantial existential risk even in alignment-is-easy worlds, especially if we fail to identify that we’re in an alignment-is-easy world. My current guess is that if we were to stay exclusively in the pre-training + small amounts of RLHF/CAI paradigm, that would constitute a sufficiently easy world that this view would be correct, but in fact I don’t expect us to stay in that paradigm, and I think other paradigms involving substantially more outcome-based RL (e.g. as was used in OpenAI o1) are likely to be much harder, making this view no longer correct.
I agree the easy vs hard worlds influence the chance of AI taking over.
But are you also claiming it influences the badness of takeover conditional on it happening? (That’s the subject of my post)
I think it affects both, since alignment difficulty determines both the probability that the AI will have values that cause it to take over, as well as the expected badness of those values conditional on it taking over.
I agree that things would be harder, mostly because of the potential for sudden capabilities breakthroughs if you have RL, combined with incentives to use automated rewards more, but I don’t think it’s so much harder that the post is incorrect, and my basic reason is I believe the central alignment insights like data mattering a lot more than inductive bias for alignment purposes still remain true in the RL regime, so we can control values by controlling data.
Also, depending on your values, AI extinction can be preferable to some humans taking over if they are willing to impose severe suffering on you, which can definitely happen if humans align AGI/ASI.
What do you mean? As in you would filter specific data from the posttraining step? What would you be trying to prevent the model from learning specifically?
I was thinking of adding synthetic data about our values in the pretraining step.
Is anyone doing this?
Maybe, but not at any large scale, though I get why someone might not want to do it, because it’d probably be costly to do this as an intentional effort to stably make an AI aligned (unless synthetic data automation more or less works.)