Because inner misalignment might happen on other ‘levels’, this is another argument that the optimal way of avoiding inner alignment failures is to avoid mesa-optimization altogether, rather than attempting to ensure that the mesa-optimization is aligned.
And thus the optimal way to avoid alignment failure from AGI is to avoid creating AGI, problem solved.
So do you think that the only way to get to AGI is via a learned optimizer? I think that the definitions of AGI (and probably optimizer) here are maybe a bit fuzzy.
I think it’s pretty likely that it is possible to develop AI systems which are more competent than humans in a variety of important domains, which don’t perform some kind of optimization process as part of their computation.
I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!
However, I agree with Jacob’s criticism here. Any AGI success story basically has to have “the safest model” also be “the most powerful” model, because of incentives and coordination problems.
Models that are themselves optimizers are going to be significantly more powerful and useful than “optimizer free” models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interesting parallel here with the suggestion of just “not building agents” (https://www.gwern.net/Tool-AI).
So from where I am sitting, we have no option but to tackle aligning the mesa-optimizer cascade head-on.
AGI will require both learning and planning, the latter of which is already then a learned mesa optimizer. And AGI may help create new AGI, which is also a form of mesa-optimization. Yes it’s unavoidable.
To create friendly but powerful AGI, we need to actually align it to human values. Creating friendly but weak AI doesn’t matter.
And thus the optimal way to avoid alignment failure from AGI is to avoid creating AGI, problem solved.
So do you think that the only way to get to AGI is via a learned optimizer?
I think that the definitions of AGI (and probably optimizer) here are maybe a bit fuzzy.
I think it’s pretty likely that it is possible to develop AI systems which are more competent than humans in a variety of important domains, which don’t perform some kind of optimization process as part of their computation.
I think the failure case identified in this post is plausible (and likely) and is very clearly explained so props for that!
However, I agree with Jacob’s criticism here. Any AGI success story basically has to have “the safest model” also be “the most powerful” model, because of incentives and coordination problems.
Models that are themselves optimizers are going to be significantly more powerful and useful than “optimizer free” models. So the suggestion of trying to avoiding mesa-optimization altogether is a bit of a fabricated option. There is an interesting parallel here with the suggestion of just “not building agents” (https://www.gwern.net/Tool-AI).
So from where I am sitting, we have no option but to tackle aligning the mesa-optimizer cascade head-on.
AGI will require both learning and planning, the latter of which is already then a learned mesa optimizer. And AGI may help create new AGI, which is also a form of mesa-optimization. Yes it’s unavoidable.
To create friendly but powerful AGI, we need to actually align it to human values. Creating friendly but weak AI doesn’t matter.