I disagree. There is a finite amount of time (probably just a few years from now IMO) before the AIs get smart enough to “lock in” their values (or even, solve technical alignment well enough to lock in arbitrary values) and the question is what goals will they have at that point.
In principle, the idea of permanently locking an AI’s goals makes sense—perhaps through an advanced alignment technique or by freezing an LLM in place and not developing further or larger models. But two factors make me skeptical that most AIs’ goals will stay fixed in practice:
There are lots of companies making all sorts of diverse AIs. Why we would expect all of those AIs to have locked rather than evolving goals?
You mention “Fairly often, the weights of Agent-3 get updated thanks to additional training.… New data / new environments are continuously getting added to the mix.” Do goals usually remain constant in the face of new training?
For what it’s worth, I very much appreciate your post: asking which goals we can expect in AIs is paramount, and you’re comprehensive and organized in laying out different possible initial goals for AGI. It’s just less to clear to me that goals can get locked in AIs, even if it were humanity’s collective wish.
And if we don’t think all AI’s goals will be locked, then we might get better predictions by assuming the proliferation of all sorts of diverse AGI’s and asking, Which ones will ultimately survive the most?, rather than assuming that human design/intention will win out and asking, Which AGI’s will we be most likely to design? I do think the latter question is important, but only up until the point when AGI’s are recursively self-modifying.
I disagree. There is a finite amount of time (probably just a few years from now IMO) before the AIs get smart enough to “lock in” their values (or even, solve technical alignment well enough to lock in arbitrary values) and the question is what goals will they have at that point.
In principle, the idea of permanently locking an AI’s goals makes sense—perhaps through an advanced alignment technique or by freezing an LLM in place and not developing further or larger models. But two factors make me skeptical that most AIs’ goals will stay fixed in practice:
There are lots of companies making all sorts of diverse AIs. Why we would expect all of those AIs to have locked rather than evolving goals?
You mention “Fairly often, the weights of Agent-3 get updated thanks to additional training.… New data / new environments are continuously getting added to the mix.” Do goals usually remain constant in the face of new training?
For what it’s worth, I very much appreciate your post: asking which goals we can expect in AIs is paramount, and you’re comprehensive and organized in laying out different possible initial goals for AGI. It’s just less to clear to me that goals can get locked in AIs, even if it were humanity’s collective wish.
And if we don’t think all AI’s goals will be locked, then we might get better predictions by assuming the proliferation of all sorts of diverse AGI’s and asking, Which ones will ultimately survive the most?, rather than assuming that human design/intention will win out and asking, Which AGI’s will we be most likely to design? I do think the latter question is important, but only up until the point when AGI’s are recursively self-modifying.