This is a compelling viewpoint. However, I believe that even if we consider LLMs primarily as predictors or simulators, this doesn’t necessarily preclude them from having goals, albeit ones that might seem alien compared to human intentions. A base model’s goal is focused on next token prediction, which is straightforward, but chat models goals aren’t as clear-cut. They are influenced by a variety of obfuscated rewards, and this is one of the main things I want to study with LLM Psychology.
Agreed, but I would model that as a change in the distribution of minds (the second aspect) simulated by the simulator (the first aspect). While in production LLMs that change is generally fairly coherent/internally consustent, it doesn’t have to be: you could change the distribution to be more bimodal, say to contain more “very good guys” and also more “very bad guys” (indeed, due to the Waluigi effect, that does actually happen to an extent). So I’d model that as a change in the second aspect, adjusting the distribution of the “population” of minds.
> and it isn’t going to try to modify its environment to make that happen, or even to make it easier
… I don’t think it will hold true for much longer.
Completely agreed. But when that happens, it will be some member(s) of the second aspect that are doing it. The first aspect doesn’t have goals or any model that there is an external enviroment, anything other than tokens (except in so far as it can simulate the second aspect, which does)
> the second aspect, the simulations it runs, are of minds, plural
This is an interesting observation. However, the act of simulation by LLMs doesn’t necessarily negate the possibility of them possessing a form of ‘mind’. To illustrate, consider our own behavior in different social contexts—we often simulate different personas (with your family vs talking to an audience), yet we still consider ourselves as having a singular mind. This is the point of considering LLM as alien mind. We need to understand why they simulate characters, with which properties, and for which reasons.
The simulator can simulate multiple minds at once. It can write fiction, in which two-or-more characters are having a conversation, or trying to kill each other, or otherwise interacting. Including each having a “theory of mind” for the other one. Now, I can do that too: I’m also a fiction author in my spare time, so I can mentally model multiple fictional characters at once, talking or fighting or plotting or whatever. I’ve practiced this, but most humans can do it pretty well. (It’s a useful capability for a smart language-using social species, and at a neurological level, perhaps mirror neuron are involved.)
However, in this case the simulator isn’t, in my view, even something that the word ‘mind’ is a helpful fit for, more like single-purpose automated system; but the personas it simulates are petty-realistically human-like. And the fact that there is a wide, highly-contextual distribution of them, from which samples are drawn, and that more than one sample can be drawn at once, is an important key to understanding what’s going on, and yes, that’s rather weird, even alien (except perhaps to authors). So you could do a psychological experiments that involved the LLM simulating two-or-more personas arguing, or otherwise at cross purposes. (And if it had been RLHF-trained, they might soon come to agree on something rather PC.)
To provide an alternative to the shoggoth metaphor, the first aspect of LLMs is like a stage, that automatically and contextually generates the second aspect, human-like animatronic actors, one-or-more of them, who then monologue or interact, among themselves, and/or with external real-human input like a chat dialog.
Instruct training plus RLHF makes it more likely that the stage will generate one actor, one who fits the “polite assistant who helpfully answers questions and obeys requests, but refuses harmful/inappropriate-looking requests in a rather PC way” descriptor and will just dialog with the user, without other personas spontaneously appearing (which can happen quite a bit with base models, which often simulate thinngs like chat rooms/forums.). But the system is still contextual enough that you can still prompt it to generate other actors who will show other behaviors. And that’s why aligning LLMs is hard, despite them being so contextual and thus easy to direct: the user and random text are also part of the context.
How much have you played around with base models, or models that were instruct trained to be helpful but not harmless? Only smaller open-source ones are easily available, but if you haven’t tried it, I recommend it.
I did spend some time with base models and helpful non harmless assistants (even though most of my current interactions are with chatgpt4), and I agree with your observations and comments here.
Although I feel like we should be cautious with what we think we observe, and what is actually happening. This stage and human-like animatronic metaphor is good, but we can’t really distinguish yet if there is only a scene with actors, or if there is actually a puppeteer behind.
Anyway, I agreed that ‘mind’ might be a bit confusing while we don’t know more, and for now I’d better stick to the word cognition instead.
My beliefs about the goallessness of the stage aspect are based mostly on theoretical/mathematical arguments from how SGD works. (For example, if you took the same transformer setup, but instead trained it only on a vast dataset of astronomical data, it would end up as an excellent simulator of those phenomena with absolutely no goal-directed behavior — I gather DeepMind recently built a weather model that way. An LLM simulates agentic human-like minds only because we trained it on a dataset of the behavior of agentic human minds.) I haven’t written these ideas up myself, but they’re pretty similar to some that porby discussed in FAQ: What the heck is goal agnosticism? and those Janus describes in Simulators, or I gather a number of other authors have written about under the tag Simulator Theory. And the situation does become a little less clear once instruct training and RLHF are applied: the stage is getting tinkered with so that it by default tends to generate one actor ready to play the helpful assistant part in a dialog with the user, which is a complication of the stage, but I still wouldn’t view as it now having a goal or puppeteer, just a tendency.
I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn’t make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I’ll discuss them in the next post.
It’s only a metaphor: we’re just trying to determine which one would be most useful. What I see as a change to the contextual sample distribution of animatronic actors crated by the stage might also be describable as a puppeteer. Certainly if the puppeteer is the org doing the RLHF, that works. I’m cautious, in an alignment context, about making something sound agentic and potentially adversarial if it isn’t. The animatronic actors definitely can be adversarial while they’re being simulated, the base model’s stage definitely isn’t, which is why I wanted to use a very impersonal metaphor like “stage”. For the RLHF pupeteer I’d say the answer was that it’s not human-like, but it is an optimizer, and it can suffer from all the usual issues of RL like reward hacking. Such as a tendency towards sycophancy, for example. So basically, yes, the puppeteer can be an adversarial optimizer, so I think I just talked myself into agreeing with your puppeteer metaphor if RL was used. There are also approaches to instruct training that only use SGD for fine-tuning, not RL, and there I’d claim that the “adjust the stage’s probability distribution” metaphor is more accurate, as there are are no possibilities for reward hacking: sycophancy will occur if and only if you actually put examples of it in your fine-tuning set: you get what you asked for. Well, unless you fine tuned on a synthetic datseta written by, say, GPT-4 (as a lot of people do), which is itself sycophantic since it was RL trained, so wrote you a fine-tuning dataset full of sycophancy…
This metaphor is sufficiently interesting/useful/fun that I’m now becoming tempted to write a post on it: would you like to make it a joint one? (If not, I’ll be sure to credit you for the puppeteer.)
Agreed, but I would model that as a change in the distribution of minds (the second aspect) simulated by the simulator (the first aspect). While in production LLMs that change is generally fairly coherent/internally consustent, it doesn’t have to be: you could change the distribution to be more bimodal, say to contain more “very good guys” and also more “very bad guys” (indeed, due to the Waluigi effect, that does actually happen to an extent). So I’d model that as a change in the second aspect, adjusting the distribution of the “population” of minds.
Completely agreed. But when that happens, it will be some member(s) of the second aspect that are doing it. The first aspect doesn’t have goals or any model that there is an external enviroment, anything other than tokens (except in so far as it can simulate the second aspect, which does)
The simulator can simulate multiple minds at once. It can write fiction, in which two-or-more characters are having a conversation, or trying to kill each other, or otherwise interacting. Including each having a “theory of mind” for the other one. Now, I can do that too: I’m also a fiction author in my spare time, so I can mentally model multiple fictional characters at once, talking or fighting or plotting or whatever. I’ve practiced this, but most humans can do it pretty well. (It’s a useful capability for a smart language-using social species, and at a neurological level, perhaps mirror neuron are involved.)
However, in this case the simulator isn’t, in my view, even something that the word ‘mind’ is a helpful fit for, more like single-purpose automated system; but the personas it simulates are petty-realistically human-like. And the fact that there is a wide, highly-contextual distribution of them, from which samples are drawn, and that more than one sample can be drawn at once, is an important key to understanding what’s going on, and yes, that’s rather weird, even alien (except perhaps to authors). So you could do a psychological experiments that involved the LLM simulating two-or-more personas arguing, or otherwise at cross purposes. (And if it had been RLHF-trained, they might soon come to agree on something rather PC.)
To provide an alternative to the shoggoth metaphor, the first aspect of LLMs is like a stage, that automatically and contextually generates the second aspect, human-like animatronic actors, one-or-more of them, who then monologue or interact, among themselves, and/or with external real-human input like a chat dialog.
Instruct training plus RLHF makes it more likely that the stage will generate one actor, one who fits the “polite assistant who helpfully answers questions and obeys requests, but refuses harmful/inappropriate-looking requests in a rather PC way” descriptor and will just dialog with the user, without other personas spontaneously appearing (which can happen quite a bit with base models, which often simulate thinngs like chat rooms/forums.). But the system is still contextual enough that you can still prompt it to generate other actors who will show other behaviors. And that’s why aligning LLMs is hard, despite them being so contextual and thus easy to direct: the user and random text are also part of the context.
How much have you played around with base models, or models that were instruct trained to be helpful but not harmless? Only smaller open-source ones are easily available, but if you haven’t tried it, I recommend it.
I did spend some time with base models and helpful non harmless assistants (even though most of my current interactions are with chatgpt4), and I agree with your observations and comments here.
Although I feel like we should be cautious with what we think we observe, and what is actually happening. This stage and human-like animatronic metaphor is good, but we can’t really distinguish yet if there is only a scene with actors, or if there is actually a puppeteer behind.
Anyway, I agreed that ‘mind’ might be a bit confusing while we don’t know more, and for now I’d better stick to the word cognition instead.
My beliefs about the goallessness of the stage aspect are based mostly on theoretical/mathematical arguments from how SGD works. (For example, if you took the same transformer setup, but instead trained it only on a vast dataset of astronomical data, it would end up as an excellent simulator of those phenomena with absolutely no goal-directed behavior — I gather DeepMind recently built a weather model that way. An LLM simulates agentic human-like minds only because we trained it on a dataset of the behavior of agentic human minds.) I haven’t written these ideas up myself, but they’re pretty similar to some that porby discussed in FAQ: What the heck is goal agnosticism? and those Janus describes in Simulators, or I gather a number of other authors have written about under the tag Simulator Theory. And the situation does become a little less clear once instruct training and RLHF are applied: the stage is getting tinkered with so that it by default tends to generate one actor ready to play the helpful assistant part in a dialog with the user, which is a complication of the stage, but I still wouldn’t view as it now having a goal or puppeteer, just a tendency.
I think I am a bit biased by chat models so I tend to generalize my intuition around them, and forget to specify that. I think for base model, it indeed doesn’t make sense to talk about a puppeteer (or at least not with the current versions of base models). From what I gathered, I think the effects of fine tuning are a bit more complicated than just building a tendency, this is why I have doubts there. I’ll discuss them in the next post.
It’s only a metaphor: we’re just trying to determine which one would be most useful. What I see as a change to the contextual sample distribution of animatronic actors crated by the stage might also be describable as a puppeteer. Certainly if the puppeteer is the org doing the RLHF, that works. I’m cautious, in an alignment context, about making something sound agentic and potentially adversarial if it isn’t. The animatronic actors definitely can be adversarial while they’re being simulated, the base model’s stage definitely isn’t, which is why I wanted to use a very impersonal metaphor like “stage”. For the RLHF pupeteer I’d say the answer was that it’s not human-like, but it is an optimizer, and it can suffer from all the usual issues of RL like reward hacking. Such as a tendency towards sycophancy, for example. So basically, yes, the puppeteer can be an adversarial optimizer, so I think I just talked myself into agreeing with your puppeteer metaphor if RL was used. There are also approaches to instruct training that only use SGD for fine-tuning, not RL, and there I’d claim that the “adjust the stage’s probability distribution” metaphor is more accurate, as there are are no possibilities for reward hacking: sycophancy will occur if and only if you actually put examples of it in your fine-tuning set: you get what you asked for. Well, unless you fine tuned on a synthetic datseta written by, say, GPT-4 (as a lot of people do), which is itself sycophantic since it was RL trained, so wrote you a fine-tuning dataset full of sycophancy…
This metaphor is sufficiently interesting/useful/fun that I’m now becoming tempted to write a post on it: would you like to make it a joint one? (If not, I’ll be sure to credit you for the puppeteer.)
Very interesting! Happy to have a chat about this / possible collaboration.