As you’ve emphasized, I don’t think understanding LLMs in their current form gets us all that far toward aligning their superhuman descendants. More on this in an upcoming post. But understanding current LLMs better is a start!
If (as you’ve argued) the first AGI is a scaffolded LLM system, then I think it’s even more important to try to understand how and whether LLMs have something like a functional self.
One important change in more capable models is likely to be improved memory and continuous, self-directed learning.
It seems very likely to me that scaffolding, specifically with externalized memory and goals, is likely to result in a more complicated picture than what I try to point to here. I’m not at all sure what happens in practice if you take an LLM with a particular cluster of persistent beliefs and values, and put it into a surrounding system that’s intended to have different goals and values. I’m hopeful that the techniques that emerge from this agenda can be extended to scaffolded systems, but it seems important to start with analyzing the LLM on its own, for tractability if nothing else.
For instance, we know that prompts and jailbreaks at least temporarily change the LLMs selves.
In the terminology of this agenda, I would express that as, ‘prompts and jailbreaks (sometimes) induce different personas’, since I’m trying to reserve ‘self’—or at least ‘functional self’—for the persistent cluster induced by the training process, and use ‘persona’ to talk about more shallow and typically ephemeral clusters of beliefs and behavioral tendencies.
All the terms in this general area are absurdly overloaded, and obviously I can’t expect everyone else to adopt my versions, but at least for this one post I’m going to try to be picky about terminology :).
I’d pose the question more as:
what tendencies do LLMs have toward stable “selves”,
how strong are those tendencies in different circumstances,
what training regimens affect the stability of those tendencies,
what types of memory can sustain a non-standard “self”
The first three of those are very much part of this agenda as I see it; the fourth isn’t, at least not for now.
Thanks!
If (as you’ve argued) the first AGI is a scaffolded LLM system, then I think it’s even more important to try to understand how and whether LLMs have something like a functional self.
It seems very likely to me that scaffolding, specifically with externalized memory and goals, is likely to result in a more complicated picture than what I try to point to here. I’m not at all sure what happens in practice if you take an LLM with a particular cluster of persistent beliefs and values, and put it into a surrounding system that’s intended to have different goals and values. I’m hopeful that the techniques that emerge from this agenda can be extended to scaffolded systems, but it seems important to start with analyzing the LLM on its own, for tractability if nothing else.
In the terminology of this agenda, I would express that as, ‘prompts and jailbreaks (sometimes) induce different personas’, since I’m trying to reserve ‘self’—or at least ‘functional self’—for the persistent cluster induced by the training process, and use ‘persona’ to talk about more shallow and typically ephemeral clusters of beliefs and behavioral tendencies.
All the terms in this general area are absurdly overloaded, and obviously I can’t expect everyone else to adopt my versions, but at least for this one post I’m going to try to be picky about terminology :).
The first three of those are very much part of this agenda as I see it; the fourth isn’t, at least not for now.