Fantastic piece. Rarely do I find posts that articulate my viewpoints better than I could. My personal view is closest to the “operating systems model:” I think pre-training gives the model knowledge and capabilities, but the assistant persona is “in control” and the locus of ~all agency.[1] Here, I’ll present a rough mental model on how we can think about LLM generalization conditional on the operating system model being true.
I think of the neural network after pre-/mid-training (i.e.,training with next-token prediction loss on corpora) as a simulator of text-genearting processes. It is a non-random neural network initialization,[2] on top of which we will construct our AI. Pre-/mid-training embeds text generating processes (TGP)[3] in some sort of concept space with a metric (i.e., how close are two concepts to each other via gradient descent), and a measure (i.e., how simple or common a concept is. High measure = more common.) As we further train using gradient descent, the neural network does the least amount of learning possible to fit the data, while upweighting the simplest and closest TGPs (i.e., easy to find in weight space TGPs, which in our case are all persona-like.)
I believe the addition of “metric” and “measure” could make it a bit easier to talk about LLM generalization. We can say something like, “emergent misalignment happens because the closest and largest measure TGP that generates narrowly misaligned data is the generally misaligned one.” I mention both “metric” and “measure” as opposed to just “simplicity” to gesture at the idea that we might end up in different equilibriums depending on where we start off (because one of them is closer).
Here’s another example, suppose you train on synthetic documents about an AI assistant “John” which likes football, renaissance art, and going to the beach. If you train your model to like going to the beach, it also becomes more likely that it would say that it likes football and renaissance art.[4] Seen through my framing, this is partly because the idea of “John” is now higher measure—when you train for “like going to the beach,” you are also training for “being like John.” There’s naturally lots of basic-science-y things you can do here. For example, what if there’s another persona “Sam” which also likes going to the beach, but likes basketball and dutch golden age art? How many more documents about John compared to Sam until the model reliably likes football? What if we first train on John, then on Sam, (and then finally on assistant-style SFT?). I think this type of experiments gives us a better sense of “measure” in concept space.
The other possible source of agency comes from persona-based jailbreaks, but this seems defeatable with good anti-jailbreak safeguards and character training.
I say text generating processes instead of personas, since base models are also perfectly capable of simulating e.g., time-series financial markets data. Simulacra is the more jargon-y term here.
One interesting open question with the persona model is: how should we think about chatgpt-4o? I feel like the parasitic nature of some of its interactions are plausibly non-existent in the pre-training corpora. I’d say it’s learned a lot about how to maximize engagement during post-training.
Very much agreed on the metric and measure. Finetuning, with the correct meat parameters, approximates Bayesian reasoning (but is generally done with meta parameters which have the effect of weighting the evidence-per-document of the new data higher than the pretraining set, unless you mix pretraining data in to it to reduce catastrophic forgetting). Thus it can change the model’s mind, but small changes are easier than large changes, and thus theories that were already fairly plausible are easier then ones that we previously highly disfavoured. I thin k it’s useful to think in terms of “Roughly how many bit of Bayesian evidence would it take to raise the model’s prior for a theory to a particular level”.
Fantastic piece. Rarely do I find posts that articulate my viewpoints better than I could. My personal view is closest to the “operating systems model:” I think pre-training gives the model knowledge and capabilities, but the assistant persona is “in control” and the locus of ~all agency.[1] Here, I’ll present a rough mental model on how we can think about LLM generalization conditional on the operating system model being true.
I think of the neural network after pre-/mid-training (i.e.,training with next-token prediction loss on corpora) as a simulator of text-genearting processes. It is a non-random neural network initialization,[2] on top of which we will construct our AI. Pre-/mid-training embeds text generating processes (TGP)[3] in some sort of concept space with a metric (i.e., how close are two concepts to each other via gradient descent), and a measure (i.e., how simple or common a concept is. High measure = more common.) As we further train using gradient descent, the neural network does the least amount of learning possible to fit the data, while upweighting the simplest and closest TGPs (i.e., easy to find in weight space TGPs, which in our case are all persona-like.)
I believe the addition of “metric” and “measure” could make it a bit easier to talk about LLM generalization. We can say something like, “emergent misalignment happens because the closest and largest measure TGP that generates narrowly misaligned data is the generally misaligned one.” I mention both “metric” and “measure” as opposed to just “simplicity” to gesture at the idea that we might end up in different equilibriums depending on where we start off (because one of them is closer).
Here’s another example, suppose you train on synthetic documents about an AI assistant “John” which likes football, renaissance art, and going to the beach. If you train your model to like going to the beach, it also becomes more likely that it would say that it likes football and renaissance art.[4] Seen through my framing, this is partly because the idea of “John” is now higher measure—when you train for “like going to the beach,” you are also training for “being like John.” There’s naturally lots of basic-science-y things you can do here. For example, what if there’s another persona “Sam” which also likes going to the beach, but likes basketball and dutch golden age art? How many more documents about John compared to Sam until the model reliably likes football? What if we first train on John, then on Sam, (and then finally on assistant-style SFT?). I think this type of experiments gives us a better sense of “measure” in concept space.
I think more rigorous models of average and worst case LLM generalization and behavior is important for existential safety. I’d love to see more work on e.g., character training, model personas, and alignment pretraining.[5]
The other possible source of agency comes from persona-based jailbreaks, but this seems defeatable with good anti-jailbreak safeguards and character training.
I heard the idea of about base models as “initializations” somewhere else but I don’t remember where.
I say text generating processes instead of personas, since base models are also perfectly capable of simulating e.g., time-series financial markets data. Simulacra is the more jargon-y term here.
Based on unpublished work (hurry up and do it, you know who you are).
One interesting open question with the persona model is: how should we think about chatgpt-4o? I feel like the parasitic nature of some of its interactions are plausibly non-existent in the pre-training corpora. I’d say it’s learned a lot about how to maximize engagement during post-training.
Very much agreed on the metric and measure. Finetuning, with the correct meat parameters, approximates Bayesian reasoning (but is generally done with meta parameters which have the effect of weighting the evidence-per-document of the new data higher than the pretraining set, unless you mix pretraining data in to it to reduce catastrophic forgetting). Thus it can change the model’s mind, but small changes are easier than large changes, and thus theories that were already fairly plausible are easier then ones that we previously highly disfavoured. I thin k it’s useful to think in terms of “Roughly how many bit of Bayesian evidence would it take to raise the model’s prior for a theory to a particular level”.
I’m currently working on a post on this.