Third and most importantly, if it is possible to make LLM-AGIs, then I think it would probably happen via eliminating all the reasons that today’s LLMs are not egregiously misaligned! In particular, I expect that they would involve the behavior being determined much more by RL and much less by pretraining (which brings in the concerns of §2.3–§2.4), and that they would somehow allow for open-ended continuous learning (which brings in the concerns of §2.5–§2.6).
So this is currently my view. I expect us to do a bunch of RL in various ways using LLM pre-training as a foundational step that gets us AIs that can choose actions that are coherent enough that we can do RL on them. Possibly we will also need various “continuous learning / long term memory / flexible-concept techniques that don’t fall straight out of the RL (though also, maybe these functions will fall straight out of enough RL, I don’t know). This will indeed reintroduce all the problems of RL, and erode away the safety properties of LLMs.
BUT, doing this way, we do get some intermediate model organisms, that don’t have all of the crucial capabilities of the future superintelligence, but do have most of them. And we can maybe develop alignment techniques that work well on our RL-LLMs as we gradually layer in more of the mechanisms that make them dangerous.
On this view, the “last step”, where we finally put together all the pieces for a complete learning and acting agent, is pretty scary, because if we’re not very carful, this will be our “first critical try”, and it will be a point at which we should particularly expect that our previous techniques will break down.
And as you note, we should expect that shortly after this, the ASI-LLMs will discover more efficient ways to makes ASI and the world is in trouble at that point, though in less trouble if we did a good job making competent benevolent ASI-LLMs, since they’ll be able to handle the situation better than humanity would be able to.
But in my novice’s opinion, this seems maybe a better path than building ASI the efficient way from scratch?
So this is currently my view. I expect us to do a bunch of RL in various ways using LLM pre-training as a foundational step that gets us AIs that can choose actions that are coherent enough that we can do RL on them. Possibly we will also need various “continuous learning / long term memory / flexible-concept techniques that don’t fall straight out of the RL (though also, maybe these functions will fall straight out of enough RL, I don’t know). This will indeed reintroduce all the problems of RL, and erode away the safety properties of LLMs.
BUT, doing this way, we do get some intermediate model organisms, that don’t have all of the crucial capabilities of the future superintelligence, but do have most of them. And we can maybe develop alignment techniques that work well on our RL-LLMs as we gradually layer in more of the mechanisms that make them dangerous.
On this view, the “last step”, where we finally put together all the pieces for a complete learning and acting agent, is pretty scary, because if we’re not very carful, this will be our “first critical try”, and it will be a point at which we should particularly expect that our previous techniques will break down.
And as you note, we should expect that shortly after this, the ASI-LLMs will discover more efficient ways to makes ASI and the world is in trouble at that point, though in less trouble if we did a good job making competent benevolent ASI-LLMs, since they’ll be able to handle the situation better than humanity would be able to.
But in my novice’s opinion, this seems maybe a better path than building ASI the efficient way from scratch?