Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn’t incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)
It’s difficult to explicitly write out objective functions which express all our desires about AGI behaviour.
There’s no simple metric which we’d like our agents to maximise—rather, desirable AGI behaviour is best formulated in terms of concepts like obedience, consent, helpfulness, morality, and cooperation, which we can’t define precisely in realistic environments.
Although we might be able to specify proxies for those goals, Goodhart’s law suggests that some undesirable behaviour will score very well according to these proxies, and therefore be reinforced in AIs trained on them.
Comparatively primitive AI systems have already demonstrated many examples of outer alignment failures, even on much simpler objectives than what we would like AGIs to be able to do.
Arguments for inner alignment failure, i.e. that advanced AI systems will plausibly pursue an objective other than the training objective while retaining most or all of the capabilities it had on the training distribution.[1]
There exist certain subgoals, such as “acquiring influence”, that are useful for achieving a broad range of final goals. Therefore, these may reliably lead to higher reward during training. Agents might come to value these subgoals for their own sake, and highly capable agents that e.g. want influence are likely to take adversarial action against humans.
The models we train might learn heuristics instead of the complex training objective, which are good enough to score very well on the training distribution, but break down under distributional shift.
This could happen if the model class isn’t expressive enough to learn the training objective; or because heuristics are more easily discovered (than the training objective) during the learning process.
Argument by analogy to human evolution: humans are misaligned with the goal of increasing genetic fitness.
The naive version of this argument seems quite weak to me, and could do with more investigation about just how analogous modern ML training and human evolution are.
The training objective is a narrow target among a large space of possible objectives that do well on the training distribution.
The naive version of this argument also seems quite weak to me. Lots of human achievements have involved hitting very improbable, narrow targets. I think there’s a steelman version, but I’m not going to try to give it here.
The arguments in Sections 3.2, 3.3 and 4.4 of Risks from Learned Optimization are also relevant, which give arguments for mesa-optimisation failure.
(Remember, mesa-optimisation failure is a specific kind of inner alignment failure. It’s an inner alignment failure when the learned model is a optimiser in the sense that it is internally searching through a search space looking for elements that score highly according to some objective function that is explicitly represented within the system).
Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn’t incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)
It’s difficult to explicitly write out objective functions which express all our desires about AGI behaviour.
There’s no simple metric which we’d like our agents to maximise—rather, desirable AGI behaviour is best formulated in terms of concepts like obedience, consent, helpfulness, morality, and cooperation, which we can’t define precisely in realistic environments.
Although we might be able to specify proxies for those goals, Goodhart’s law suggests that some undesirable behaviour will score very well according to these proxies, and therefore be reinforced in AIs trained on them.
Comparatively primitive AI systems have already demonstrated many examples of outer alignment failures, even on much simpler objectives than what we would like AGIs to be able to do.
Arguments for inner alignment failure, i.e. that advanced AI systems will plausibly pursue an objective other than the training objective while retaining most or all of the capabilities it had on the training distribution.[1]
There exist certain subgoals, such as “acquiring influence”, that are useful for achieving a broad range of final goals. Therefore, these may reliably lead to higher reward during training. Agents might come to value these subgoals for their own sake, and highly capable agents that e.g. want influence are likely to take adversarial action against humans.
The models we train might learn heuristics instead of the complex training objective, which are good enough to score very well on the training distribution, but break down under distributional shift.
This could happen if the model class isn’t expressive enough to learn the training objective; or because heuristics are more easily discovered (than the training objective) during the learning process.
Argument by analogy to human evolution: humans are misaligned with the goal of increasing genetic fitness.
The naive version of this argument seems quite weak to me, and could do with more investigation about just how analogous modern ML training and human evolution are.
The training objective is a narrow target among a large space of possible objectives that do well on the training distribution.
The naive version of this argument also seems quite weak to me. Lots of human achievements have involved hitting very improbable, narrow targets. I think there’s a steelman version, but I’m not going to try to give it here.
The arguments in Sections 3.2, 3.3 and 4.4 of Risks from Learned Optimization are also relevant, which give arguments for mesa-optimisation failure.
(Remember, mesa-optimisation failure is a specific kind of inner alignment failure. It’s an inner alignment failure when the learned model is a optimiser in the sense that it is internally searching through a search space looking for elements that score highly according to some objective function that is explicitly represented within the system).
This follows abergal’s suggestion of what inner alignment should refer to.