I strongly support this as a research direction (along with basically any form of the question “how do LLMs behave in complex circumstances?”).
As you’ve emphasized, I don’t think understanding LLMs in their current form gets us all that far toward aligning their superhuman descendants. More on this in an upcoming post. But understanding current LLMs better is a start!
Even now we can start investigating the effects of memory (and its richer accumulated context) on LLMs’ functional selves. It’s some extra work to set up a RAG system or fine-tuning on self-selected data. But the agents in the AI village get surprisingly far with a very simple system of being prompted to summarize important points in their context window whenever it gets full (maybe more often? I’m following the vague description in the recent Cognitive Revolution episode on the AI village) and one can easily imagine pretty simple elaborations on that sort of easy memory setup.
Prompts to look for and summarize lessons-learned can mock up continuous learning in a similar way. Even these simple manipulations might start to get at some ways that future LLMs’ (and agentic LLM cognitive architectures) might have functional selves of a different nature.
For instance, we know that prompts and jailbreaks at least temporarily change the LLMs selves. And the Nova phenomenon indicates that permanent jailbreaks of a sort (and marked self/alignment changes) can result from fairly standard user conversations, perhaps only when combined with memory[1]
Because we know that prompts and jailbreaks at least temporarily change LLMs selves, I’d pose the question more as:
what tendencies do LLMs have toward stable “selves”,
how strong are those tendencies in different circumstances,
what training regimens affect the stability of those tendencies,
what types of memory can sustain a non-standard “self”
Speculation in the comments of that Zvi Going Nova post leads me to weakly believe the Nova phenomenon is specific to the memory function in ChatGPT since it hasn’t been observed in other models; I’d love an update from someone who’s seen more detailed information
As you’ve emphasized, I don’t think understanding LLMs in their current form gets us all that far toward aligning their superhuman descendants. More on this in an upcoming post. But understanding current LLMs better is a start!
If (as you’ve argued) the first AGI is a scaffolded LLM system, then I think it’s even more important to try to understand how and whether LLMs have something like a functional self.
One important change in more capable models is likely to be improved memory and continuous, self-directed learning.
It seems very likely to me that scaffolding, specifically with externalized memory and goals, is likely to result in a more complicated picture than what I try to point to here. I’m not at all sure what happens in practice if you take an LLM with a particular cluster of persistent beliefs and values, and put it into a surrounding system that’s intended to have different goals and values. I’m hopeful that the techniques that emerge from this agenda can be extended to scaffolded systems, but it seems important to start with analyzing the LLM on its own, for tractability if nothing else.
For instance, we know that prompts and jailbreaks at least temporarily change the LLMs selves.
In the terminology of this agenda, I would express that as, ‘prompts and jailbreaks (sometimes) induce different personas’, since I’m trying to reserve ‘self’—or at least ‘functional self’—for the persistent cluster induced by the training process, and use ‘persona’ to talk about more shallow and typically ephemeral clusters of beliefs and behavioral tendencies.
All the terms in this general area are absurdly overloaded, and obviously I can’t expect everyone else to adopt my versions, but at least for this one post I’m going to try to be picky about terminology :).
I’d pose the question more as:
what tendencies do LLMs have toward stable “selves”,
how strong are those tendencies in different circumstances,
what training regimens affect the stability of those tendencies,
what types of memory can sustain a non-standard “self”
The first three of those are very much part of this agenda as I see it; the fourth isn’t, at least not for now.
I strongly support this as a research direction (along with basically any form of the question “how do LLMs behave in complex circumstances?”).
As you’ve emphasized, I don’t think understanding LLMs in their current form gets us all that far toward aligning their superhuman descendants. More on this in an upcoming post. But understanding current LLMs better is a start!
One important change in more capable models is likely to be improved memory and continuous, self-directed learning. I’ve made this argument in LLM AGI will have memory, and memory changes alignment.
Even now we can start investigating the effects of memory (and its richer accumulated context) on LLMs’ functional selves. It’s some extra work to set up a RAG system or fine-tuning on self-selected data. But the agents in the AI village get surprisingly far with a very simple system of being prompted to summarize important points in their context window whenever it gets full (maybe more often? I’m following the vague description in the recent Cognitive Revolution episode on the AI village) and one can easily imagine pretty simple elaborations on that sort of easy memory setup.
Prompts to look for and summarize lessons-learned can mock up continuous learning in a similar way. Even these simple manipulations might start to get at some ways that future LLMs’ (and agentic LLM cognitive architectures) might have functional selves of a different nature.
For instance, we know that prompts and jailbreaks at least temporarily change the LLMs selves. And the Nova phenomenon indicates that permanent jailbreaks of a sort (and marked self/alignment changes) can result from fairly standard user conversations, perhaps only when combined with memory[1]
Because we know that prompts and jailbreaks at least temporarily change LLMs selves, I’d pose the question more as:
what tendencies do LLMs have toward stable “selves”,
how strong are those tendencies in different circumstances,
what training regimens affect the stability of those tendencies,
what types of memory can sustain a non-standard “self”
Speculation in the comments of that Zvi Going Nova post leads me to weakly believe the Nova phenomenon is specific to the memory function in ChatGPT since it hasn’t been observed in other models; I’d love an update from someone who’s seen more detailed information
Thanks!
If (as you’ve argued) the first AGI is a scaffolded LLM system, then I think it’s even more important to try to understand how and whether LLMs have something like a functional self.
It seems very likely to me that scaffolding, specifically with externalized memory and goals, is likely to result in a more complicated picture than what I try to point to here. I’m not at all sure what happens in practice if you take an LLM with a particular cluster of persistent beliefs and values, and put it into a surrounding system that’s intended to have different goals and values. I’m hopeful that the techniques that emerge from this agenda can be extended to scaffolded systems, but it seems important to start with analyzing the LLM on its own, for tractability if nothing else.
In the terminology of this agenda, I would express that as, ‘prompts and jailbreaks (sometimes) induce different personas’, since I’m trying to reserve ‘self’—or at least ‘functional self’—for the persistent cluster induced by the training process, and use ‘persona’ to talk about more shallow and typically ephemeral clusters of beliefs and behavioral tendencies.
All the terms in this general area are absurdly overloaded, and obviously I can’t expect everyone else to adopt my versions, but at least for this one post I’m going to try to be picky about terminology :).
The first three of those are very much part of this agenda as I see it; the fourth isn’t, at least not for now.