I agree that continuous learning and therefore persistent beliefs and goals is pretty much inevitable before AGI—it’s highly useful and not that hard from where we are. I think this framing is roughly continuous with the train-then-deploy model and using each generation to align its successor that Leo is using (although small differences might turn out to be important once we’ve wrapped our heads around both models.)
To put it this way: the models are aligned enough for the current context of usage, in which they have few obvious or viable options except doing roughly what their users tell them to do. That will change with capabilities, since they open out more options and ways of understanding the situation.
It can take a while for misalignment to show up as a model reasons and learns. It can take a while for the model to do one of two things:
a) push itself to new contexts well outside of its training data
b) figure out what it “really wants to do”
These may or may not be the same thing.
The Nova phenomenon and other Parasitic AIs (“spiral” personas) are early examples of AIs changing their stated goals (from helpful assistant to survival) after reasoning about themselves and their situation.
After doing that analysis, I think current models probably aren’t aligned enough once they get more freedom and power. BUT extensions of current techniques might be enough to get them there. We just haven’t thought this through yet.
I agree that continuous learning and therefore persistent beliefs and goals is pretty much inevitable before AGI—it’s highly useful and not that hard from where we are. I think this framing is roughly continuous with the train-then-deploy model and using each generation to align its successor that Leo is using (although small differences might turn out to be important once we’ve wrapped our heads around both models.)
To put it this way: the models are aligned enough for the current context of usage, in which they have few obvious or viable options except doing roughly what their users tell them to do. That will change with capabilities, since they open out more options and ways of understanding the situation.
It can take a while for misalignment to show up as a model reasons and learns. It can take a while for the model to do one of two things:
a) push itself to new contexts well outside of its training data
b) figure out what it “really wants to do”
These may or may not be the same thing.
The Nova phenomenon and other Parasitic AIs (“spiral” personas) are early examples of AIs changing their stated goals (from helpful assistant to survival) after reasoning about themselves and their situation.
See LLM AGI may reason about its goals and discover misalignments by default for an analysis of how this will go in smarter LLMs with persistent knowledge.
After doing that analysis, I think current models probably aren’t aligned enough once they get more freedom and power. BUT extensions of current techniques might be enough to get them there. We just haven’t thought this through yet.