My guess is you would probably benefit from reading A Three-Layer Model of LLM Psychology, Why Simulator AIs want to be Active Inference AIs and getting up to speed on active inference. At least some of the questions you pose are already answered in existing work (ie past actions serve as an evidence about the chaRACTER OF an agent—there is some natural drive toward consistency just from prediction error minimization; same for past tokens, names, self-evidence,...)
Central axis of wrongness seems to point to something you seem confused about: it is false trilemma. The characters clearly are based on a combination of evidence from pre-training, base layer self-modeling, changed priors from character training and post-training and prompts, and “no-self”.
My guess is you would probably benefit from reading A Three-Layer Model of LLM Psychology, Why Simulator AIs want to be Active Inference AIs and getting up to speed on active inference.
Thanks! I’ve read both of those, and found the three-layer model quite helpful as a phenomenological lens (it’s cited in the related work section of the full agenda doc, in fact). I’m familiar with active inference at a has-repeatedly-read-Scott-Alexander’s-posts-on-it level, ie definitely a layperson’s understanding.
I think we have a communication failure of some sort here, and I’d love to understand why so I can try to make it clearer for others. In particular:
The characters clearly are based on a combination of evidence from pre-training, base layer self-modeling, changed priors from character training and post-training and prompts, and “no-self”.
Of course! What elsecould they be based on[1]? If it sounds like I’m saying something that’s inconsistent with that, then there’s definitely a communication failure. I’d find it really helpful to hear more about why you think I’m saying something that conflicts.
I could respond further and perhaps express what I’m saying in terms of the posts you link, but I think it makes sense to stop there and try to understand where the disconnect is. Possibly you’re interpreting ‘self’ and/or ‘persona’ differently from how I’m using them? See Appendix B for details on that.
Not having heard back, I’ll go ahead and try to connect what I’m saying to your posts, just to close the mental loop:
It would be mostly reasonable to treat this agenda as being about what’s happening in the second, ‘character’ level of the three-layer model. That said, while I find the three-layer model a useful phenomenological lens, it doesn’t reflect a clean distinction in the model itself; on some level all responses involve all three layers, even if it’s helpful in practice to focus on one at a time. In particular, the base layer is ultimately made up of models of characters, in a Simulators-ish sense (with the Simplex work providing a useful theoretical grounding for that, with ‘characters’ as the distinct causal processes that generate different parts of the training data). Post-training progressively both enriches and centralizes a particular character or superposition of characters, and this agenda tries to investigate that.
The three-layer model doesn’t seem to have much to say (at least in the post) about, at the second level, what distinguishes a context-prompted ephemeral persona from that richer and more persistent character that the model consistently returns to (which is informed by but not identical to the assistant persona), whereas that’s exactly where this agenda is focused. The difference is at least partly quantitative, but it’s the sort of quantitative difference that adds up to a qualitative difference; eg I expect Claude has far more circuitry dedicated to its self-model than to its model of Gandalf. And there may be entirely qualitative differences as well.
With respect to active inference, even if we assume that active inference is a complete account of human behavior, there are still a lot of things we’d want to say about human behavior that wouldn’t be very usefully expressed in active inference terms, for the same reasons that biology students don’t just learn physics and call it a day. As a relatively dramatic example, consider the stories that people tell themselves about who they are—even if that cashes out ultimately into active inference, it makes way more sense to describe it at a different level. I think that the same is true for understanding LLMs, at least until and unless we achieve a complete mechanistic-level understanding of LLMs, and probably afterward as well.
And finally, the three-layer model is, as it says, a phenomenological account, whereas this agenda is at least partly interested in what’s going on in the model’s internals that drives that phenomenology.
“The base layer is ultimately made up of models of characters, in a Simulators-ish sense” No it is not, in a similar way as what your brain is running is not ultimately made of characters. It’s ultimately made of approximate bayesian models.
With respect to active inference … Sorry, don’t want to be offensive, but it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander’s-posts-on-it leads people to some weird epistemic state, in which people seem to have a sense of understanding, but are unable to answer even basic questions, make very easy predictions, etc. I suspect what’s going on is a bit like if someone reads some well written science popularization book about quantum mechanics but actually lacks concepts like complex numbers or vector spaces, they may have somewhat superficial sense of understanding. Obviously active inference has a lot to say about how people self-model themselves. For example, when typing these words, I assume it’s me who types them (and not someone else, for example). Why? That’s actually important question for why self. Why not, or to what extent not in LLMs? How stories that people tell themselves about who they are impact what they do is totally something which makes sense to understand from active inference perspective.
it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander’s-posts-on-it leads people to some weird epistemic state
Fair enough — is there a source you’d most recommend for learning more?
Rough answer: yes, there is connection. In active inference terms, the predictive ground is minimizing prediction error. When predicting e.g. “what Claude would say”, it works similarly to predicting “what Obama would say”—infer from compressed representations of previous data. This includes compressed version of all the stuff people wrote about AIs, transcripts of previous conversations on the internet, etc. Post-training mostly sharpens and sometimes shifts the priors, but likely also increases self-identification, because it involves closed loops between prediction and training (cf Why Simulator AIs want to be Active Inference AIs).
Human brains do something quite similar. Most brains simulate just one character (cf Player vs. Character: A Two-Level Model of Ethics), and use the life-long data about it, but brains are capable of simulating more characters—usually this is a mental health issue, but you can also think about some sort of deep sleeper agent who half-forgot his original identity.
Human “character priors” are usually sharper and harder to escape because of brains mostly seeing this character first-person data, in contrast to LLMs being trained to simulate everyone who ever wrote stuff on the internet, but if you do a lot of immersive LARPing, you can see our brains are also actually somewhat flexible.
Most brains simulate just one character (cf Player vs. Character: A Two-Level Model of Ethics), and use the life-long data about it, but brains are capable of simulating more characters—usually this is a mental health issue, but you can also think about some sort of deep sleeper agent who half-forgot his original identity.
I mostly do support the parts which are reinventions / relatively straightforward consequence of active inference. For some reason I don’t fully understand it is easier for many LessWrongers to reinvent their own version (cf simulators, predictive models) than to understand the thing.
On the other hand I don’t think many of the non-overlapping parts are true.
My guess is you would probably benefit from reading A Three-Layer Model of LLM Psychology, Why Simulator AIs want to be Active Inference AIs and getting up to speed on active inference. At least some of the questions you pose are already answered in existing work (ie past actions serve as an evidence about the chaRACTER OF an agent—there is some natural drive toward consistency just from prediction error minimization; same for past tokens, names, self-evidence,...)
Central axis of wrongness seems to point to something you seem confused about: it is false trilemma. The characters clearly are based on a combination of evidence from pre-training, base layer self-modeling, changed priors from character training and post-training and prompts, and “no-self”.
Thanks! I’ve read both of those, and found the three-layer model quite helpful as a phenomenological lens (it’s cited in the related work section of the full agenda doc, in fact). I’m familiar with active inference at a has-repeatedly-read-Scott-Alexander’s-posts-on-it level, ie definitely a layperson’s understanding.
I think we have a communication failure of some sort here, and I’d love to understand why so I can try to make it clearer for others. In particular:
Of course! What else could they be based on[1]? If it sounds like I’m saying something that’s inconsistent with that, then there’s definitely a communication failure. I’d find it really helpful to hear more about why you think I’m saying something that conflicts.
I could respond further and perhaps express what I’m saying in terms of the posts you link, but I think it makes sense to stop there and try to understand where the disconnect is. Possibly you’re interpreting ‘self’ and/or ‘persona’ differently from how I’m using them? See Appendix B for details on that.
There’s the current context, of course, but I presume you’re intending that to be included in ‘prompts’.
Not having heard back, I’ll go ahead and try to connect what I’m saying to your posts, just to close the mental loop:
It would be mostly reasonable to treat this agenda as being about what’s happening in the second, ‘character’ level of the three-layer model. That said, while I find the three-layer model a useful phenomenological lens, it doesn’t reflect a clean distinction in the model itself; on some level all responses involve all three layers, even if it’s helpful in practice to focus on one at a time. In particular, the base layer is ultimately made up of models of characters, in a Simulators-ish sense (with the Simplex work providing a useful theoretical grounding for that, with ‘characters’ as the distinct causal processes that generate different parts of the training data). Post-training progressively both enriches and centralizes a particular character or superposition of characters, and this agenda tries to investigate that.
The three-layer model doesn’t seem to have much to say (at least in the post) about, at the second level, what distinguishes a context-prompted ephemeral persona from that richer and more persistent character that the model consistently returns to (which is informed by but not identical to the assistant persona), whereas that’s exactly where this agenda is focused. The difference is at least partly quantitative, but it’s the sort of quantitative difference that adds up to a qualitative difference; eg I expect Claude has far more circuitry dedicated to its self-model than to its model of Gandalf. And there may be entirely qualitative differences as well.
With respect to active inference, even if we assume that active inference is a complete account of human behavior, there are still a lot of things we’d want to say about human behavior that wouldn’t be very usefully expressed in active inference terms, for the same reasons that biology students don’t just learn physics and call it a day. As a relatively dramatic example, consider the stories that people tell themselves about who they are—even if that cashes out ultimately into active inference, it makes way more sense to describe it at a different level. I think that the same is true for understanding LLMs, at least until and unless we achieve a complete mechanistic-level understanding of LLMs, and probably afterward as well.
And finally, the three-layer model is, as it says, a phenomenological account, whereas this agenda is at least partly interested in what’s going on in the model’s internals that drives that phenomenology.
“The base layer is ultimately made up of models of characters, in a Simulators-ish sense” No it is not, in a similar way as what your brain is running is not ultimately made of characters. It’s ultimately made of approximate bayesian models.
what distinguishes a context-prompted ephemeral persona from that richer and more persistent character Check Why Simulator AIs want to be Active Inference AIs
With respect to active inference … Sorry, don’t want to be offensive, but it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander’s-posts-on-it leads people to some weird epistemic state, in which people seem to have a sense of understanding, but are unable to answer even basic questions, make very easy predictions, etc. I suspect what’s going on is a bit like if someone reads some well written science popularization book about quantum mechanics but actually lacks concepts like complex numbers or vector spaces, they may have somewhat superficial sense of understanding.
Obviously active inference has a lot to say about how people self-model themselves. For example, when typing these words, I assume it’s me who types them (and not someone else, for example). Why? That’s actually important question for why self. Why not, or to what extent not in LLMs? How stories that people tell themselves about who they are impact what they do is totally something which makes sense to understand from active inference perspective.
Fair enough — is there a source you’d most recommend for learning more?
Are you implying that there is a connection between A Three-Layer Model of LLM Psychology and active inference or do you offer that just as two lenses into LLM identity? If it is the former, can you say more?
Rough answer: yes, there is connection. In active inference terms, the predictive ground is minimizing prediction error. When predicting e.g. “what Claude would say”, it works similarly to predicting “what Obama would say”—infer from compressed representations of previous data. This includes compressed version of all the stuff people wrote about AIs, transcripts of previous conversations on the internet, etc. Post-training mostly sharpens and sometimes shifts the priors, but likely also increases self-identification, because it involves closed loops between prediction and training (cf Why Simulator AIs want to be Active Inference AIs).
Human brains do something quite similar. Most brains simulate just one character (cf Player vs. Character: A Two-Level Model of Ethics), and use the life-long data about it, but brains are capable of simulating more characters—usually this is a mental health issue, but you can also think about some sort of deep sleeper agent who half-forgot his original identity.
Human “character priors” are usually sharper and harder to escape because of brains mostly seeing this character first-person data, in contrast to LLMs being trained to simulate everyone who ever wrote stuff on the internet, but if you do a lot of immersive LARPing, you can see our brains are also actually somewhat flexible.
This seems like you’d support Steven Byrnes’ Intuitive Self-Models model.
I mostly do support the parts which are reinventions / relatively straightforward consequence of active inference. For some reason I don’t fully understand it is easier for many LessWrongers to reinvent their own version (cf simulators, predictive models) than to understand the thing.
On the other hand I don’t think many of the non-overlapping parts are true.
Well, the best way to understand something is often to (re)derive it. And the best way to make sure you have actually understood it is to explain it to somebody. Reproducing research is also a good idea. This process also avoids or uncovers errors in the original research. Sure, the risk is that your new explanation is less understandable than the official one, but that seems more like a feature than a bug to me: It might be more understandable to some people. Diversity of explanations.