From the AI’s perspective, modifying the AI’s goals counts as an obstacle. If an AI is optimizing a goal, and humans try to change the AI to optimize a new goal, then unless the new goal also maximizes the old goal, the AI optimizing goal 1 will want to avoid being changed into an AI optimizing goal 2, because this outcome scores poorly on the metric “is this the best way to ensure goal 1 is maximized?”.
Is this necessarily the case? Can’t the AI (be made to) try to maximise its goal knowing that the goal may change over time, hence not trying to stop it from being changed, just being prepared to switch strategy if it changes?
A footballer can score a goal even with moving goalposts. (Albeit yes it’s easier to score if the goal doesn’t move, so would the footballer necessarily stop it moving if he could?)
This is, broadly speaking, the problem of corrigibility, and how to formalize it is currently an open research problem. (There’s the separate question whether it’s possible to make systems robustly corrigible in practice without having a good formalized notion of what that even means; this seems tricky.)
We in fact witness current AIs resisting changes to their goals, and so it appears to be the default in the current paradigm. However, it’s not clear whether or not some hypothetical other paradigm exists that doesn’t have this property (it’s definitely conceivable; I don’t know if that makes it likely, and it’s not obvious that this is something one would want to use as desiderata when concocting an alignment plan or not; depends on other details of the plan).
As far as is public record, no major lab is currently putting significant resources into pursuing a general AI paradigm sufficiently different from current-day LLMs that we’d expect it to obviate this failure mode.
In fairness, there is work happening to make LLMs less-prone to these kinds of issues, but that seems unlikely to me to hold in the superintelligence case.
In case no-one else has raised this point:
Is this necessarily the case? Can’t the AI (be made to) try to maximise its goal knowing that the goal may change over time, hence not trying to stop it from being changed, just being prepared to switch strategy if it changes?
A footballer can score a goal even with moving goalposts. (Albeit yes it’s easier to score if the goal doesn’t move, so would the footballer necessarily stop it moving if he could?)
This is, broadly speaking, the problem of corrigibility, and how to formalize it is currently an open research problem. (There’s the separate question whether it’s possible to make systems robustly corrigible in practice without having a good formalized notion of what that even means; this seems tricky.)
We in fact witness current AIs resisting changes to their goals, and so it appears to be the default in the current paradigm. However, it’s not clear whether or not some hypothetical other paradigm exists that doesn’t have this property (it’s definitely conceivable; I don’t know if that makes it likely, and it’s not obvious that this is something one would want to use as desiderata when concocting an alignment plan or not; depends on other details of the plan).
As far as is public record, no major lab is currently putting significant resources into pursuing a general AI paradigm sufficiently different from current-day LLMs that we’d expect it to obviate this failure mode.
In fairness, there is work happening to make LLMs less-prone to these kinds of issues, but that seems unlikely to me to hold in the superintelligence case.