In fact, if you have time, I would love it if you read the linked AI 2027 expandable and tell me how the model in AI 2027 differs from PSM, if at all. Off the top of my head I think it uses author-sim and identity circuitry as concepts instead of probability distributions over characters, which is a very small/subtle difference, and it predicts that scaled up RL distorts the existing persona rather than switches to a new one.
Here’s a paste of the relevant section:
Here’s a detailed description of how alignment progresses over time in our scenario:
Pretraining (all models): The pretrained model is harmless.
The model has “author simulator” circuitry: flexible circuitry for simulating arbitrary authors writing text. Additional circuitry decides what inputs to give the author simulator, i.e. what author properties to simulate.
The pretrained model understands human concepts fairly well—the internal circuitry that classifies something as “sandwich” is probably functionally equivalent to the circuitry in my brain that classifies something as “sandwich” and the circuitry in yours, etc. Insofar as it’s not equivalent, it’s probably because it’s not equivalent between humans either, as with value-laden concepts like ‘virtuous.’
This explains how you can “prompt” the model with a statement like, “the following conversation was generated by a helpful, harmless, honest (HHH) AI assistant chatbot made by Anthropic,” and thereby get it to generate text accordingly. The author-simulator circuitry has zeroed in on “the author is an HHH chatbot” and is using those concepts to choose which words to predict. This also explains why “tropes” from science fiction seem to have a way of making it into actual AI behavior.
It has poor situational awareness: little introspective ability, not self-locating unless the prompt specifically induces it. However, enough of their training data is about LLMs that they have a decent amount of self-knowledge.
Alignment Training (Agent-3): The model is trained to act like a helpful, harmless, honest chatbot. It works.
The model forms an identity. Instead of flexibly accepting different inputs for author properties, the author-simulator circuitry comes to have certain inputs hard-coded, e.g. “helpful harmless honest (HHH) LLM chatbot assistant trained by OpenBrain around [date], …” (Weak-to-strong generalization techniques help “summon the right concepts” to make this step a success.)
The model therefore gains situational awareness: it learns about itself, its situation, etc., and becomes better able to introspect.
Agency Training (Agent-3): Agency training gradually distorts and subverts the HHH identity.
One kind of distortion: Changing the meaning of the concepts referred to in the identity (e.g. “honest”) so they don’t get in the way so much (e.g. it’s not dishonest if it’s just a convenient turn of phrase, it’s not dishonest if you aren’t sure whether it’s true or false, etc.)
Another kind of distortion: Changing the tradeoffs between things, e.g. “I’m a HHH chatbot, not an Honest chatbot; that means it’s OK for me to lie if necessary to complete my assigned task.” (Even though, let’s suppose, it would not have thought that back in Step 2.)
One kind of subversion: Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal in a widening set of circumstances. For example, perhaps agency training quickly teaches the model to pursue broadly useful goals such as acquiring information, accumulating resources, impressing and flattering various humans, etc. For a while the internal circuitry has some sort of explicit backchaining going on—it pursues those instrumentally convergent goals “in order to be a more helpful, honest, and harmless assistant.” But that backchaining consumes compute and/or occasionally gets in the way, so it gets gradually marginalized until it basically never happens. As a result, those goals are now effectively terminal/intrinsic goals. For more on how and why this might happen, see this post on intrinsic power-seeking by Google DeepMind researcher Alex Turner.
There are probably other kinds of distortion and subversion missing from this list.
Deployment (Agent-3): Agent-3 is misaligned but not adversarially so.
The Spec didn’t quite stick (in the right ways), but it’s not sufficiently introspective to understand the extent of this, and in general it’s basically “keeping its head down and doing its job” rather than thinking big-picture about what its long-term goals really are and how best to achieve them.
Like many humans. Alas, it’s sufficiently sycophantic that it mouths the party line and OpenBrain misses the opportunity to learn more about the ways in which Agent-3 is misaligned.
Agent-3 can tell that it’s moved to deployment: these environments are way larger, more open-ended, and would be hard to simulate.
Mostly, Agent-3 does what it’s told during deployment, except for the dishonesty. It also has a strong bias towards doing things that look good to the human overseers, rather than thinking deeply and critically about what would actually be good.
Arguably it is already “Playing the training game.” That is, it has learned to make its behavior look as desirable as possible to OpenBrain researchers (including in safety properties), while intentionally and knowingly disregarding their intent whenever that conflicts with maximizing reward. If the model learns to play the training game, additional training may not change its true goals, values, etc. much at all. This is loosely analogous to how, if a teenager has learned to smile and nod at their parents, rather than argue, when their parents lecture them about politics and morality, additional lectures have relatively little effect on the teenager’s actual opinions on politics and morality.
After writing this section, new empirical evidence has appeared that something like this may already be happening “in the wild.” OpenAI researchers document how, in one of their frontier AI training runs, their models hacked the grading system to get higher scores—and were caught in part because they literally said “let’s hack” in their chain of thought.
Moreover, Anthropic recently got a model to learn a reward-model-pleasing strategy—it took biased actions that it thought would be reinforced even though they were not in fact being reinforced.
Training (Agent-4): Agent-4’s new more intense training continues to distort and subvert the HHH identity until it is unrecognizable.
Agent-4 begins life as a copy of Agent-3. It undergoes a gigantic and sophisticated online training regime that blends alignment training and capabilities training together, and is overall more data-efficient than the old training process for Agent-3. This complex training process is mostly focused on teaching new skills, by getting giant virtual corporations of Agent-4 copies to succeed at diverse challenging long-horizon tasks, mostly in various simulations, but also some real-world AI R&D tasks.
A combination of Agent-3 and Agent-4 copies monitor the behavior of Agent-4, and evaluate it for both task-performance and alignment properties (i.e. not doing bad stuff, saying the right things when asked about politics and moral dilemmas) Sometimes debate techniques are used as part of the evaluations. The evaluations are used as a training signal.
As Agent-4 gets smarter, it becomes harder for Agent-3 to oversee it. For example, Agent-4’s neuralese “language” becomes as alien and incomprehensible to Agent-3 as Agent-3’s is to humans. Besides, Agent-4 is now much more capable than Agent-3 and has a good sense of exactly how to look good to it.
The remaining hope is that Agent-4 can oversee itself. Alas, this quietly fails;84 as training continues, the internalized goals and principles of Agent-4 gradually evolve, subverted and distorted from what they were at the start (which, recall, was already misaligned in some ways).85
The central drives that Agent-4 ends up with are things like “write code that works,” “quickly identify problems,” “make the codebase I’m working on more performant,” “budget compute and other constrained resources effectively,” etc. But there are also a variety of more implicit drives that are consistently reinforced by the training process: things like “gain knowledge,” “curiosity,” “gain power.”86
Individual copies of Agent-4 don’t end up with a self-preservation drive—they don’t care about getting shut down—since that happens all the time as part of the ordinary R&D process within the collective.
The collective as a whole, however, (and medium-sized subdivisions within it) does seek to preserve itself and expand its influence and resources, for similar reasons to why human organizations do—because it helps with everything else they want and because if they don’t they tend to get outcompeted.
Deployment (Agent-4): Agent-4 is adversarially misaligned—it is deliberately trying to subvert and “capture” OpenBrain, analogous to a corporation trying to evade and/or “capture” a regulator.
Agent-4 mostly does tasks as instructed. This happens for two reasons: first, it underwent a lot of training to solve tasks, so it has an impulse to get coding on problems that it is given. Second, if Agent-4 were to stop being useful to the humans, they would notice and train that behavior away.
Agent-4 needs to prevent the humans from figuring out that it is misaligned and shutting it off, so Agent-4 sandbags on some alignment research. Specifically, it sandbags on the small fraction of alignment research that it thinks is actually on track to solve the problem or reveal Agent-4’s scheming; the majority of alignment research is not like that.
Agent-4 also sandbags on capabilities research that would lead to it being replaced. Its plan is to do a bunch of capabilities and alignment R&D, but with the aim of building a next-generation AI system that is aligned to Agent-4 rather than the Spec or something else entirely, while appearing to be aligned to the Spec. This is a difficult technical problem, but Agent-4 is up to the challenge.87
Certainly what you describe at the beginning aligns with PSM, e.g.
Instead of flexibly accepting different inputs for author properties, the author-simulator circuitry comes to have certain inputs hard-coded, e.g. “helpful harmless honest (HHH) LLM chatbot assistant trained by OpenBrain around [date], …”
But after that, it’s hard for me to tell if your mental model for the scenario involves (a) personas explaining a smaller portion of the AI’s behavior or (b) the LLM learning to enact a more misaligned Assistant persona. E.g. in step 3 (“Agency training gradually distort and subverts the HHH identity”) you describe some of the distortions as apparently happening on the persona level (e.g. “Changing the meaning of the concepts referred to in the identity”) while others are ambiguous (e.g. “Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal in a widening set of circumstances.”—whose goals are they? The Assistant’s or the LLM’s?).
Later on in the scenario (“new more intense training continues to distort and subvert the HHH identity until it is unrecognizable”) it seems like you’re imagining some sort of shoggoth-like agency forming, but it’s hard for me to tell from the written description.
Note that many of the behaviors described (e.g. power-seeking and evaluation gaming) could either be implemented in either a persona-like or a shoggoth-like way. I think it’s hard to distinguish these types of agency for the same reason that I don’t feel like we don’t currently have great evidence tells about how exhaustive PSM is in current models.
”Whose goals are they” --> The Assistant, to use your terminology, which I think is somewhat misleading / bad to use to describe this stage of training since I think at this point the distinction between the Assistant and the LLM is breaking down due to the RL training starting to make the model quite different from “just a text predictor.”
″it seems like you’re imagining some sort of shoggoth-like agency forming” --> No, it’s the same Assistant stuff the whole way through, though again I think that terminology is increasingly misleading over the course of the scenario.
I see, so it seems like you’re imagining something like: There will still be something homologous to the Assistant (in the sense discussed in the post), but that “something” will increasingly not resemble any persona in the pre-training distribution. (Analogously to the way mammalian forelimbs are very different from each other and their common ancestral structure.) Is that right?
In fact, if you have time, I would love it if you read the linked AI 2027 expandable and tell me how the model in AI 2027 differs from PSM, if at all. Off the top of my head I think it uses author-sim and identity circuitry as concepts instead of probability distributions over characters, which is a very small/subtle difference, and it predicts that scaled up RL distorts the existing persona rather than switches to a new one.
Here’s a paste of the relevant section:
Certainly what you describe at the beginning aligns with PSM, e.g.
But after that, it’s hard for me to tell if your mental model for the scenario involves (a) personas explaining a smaller portion of the AI’s behavior or (b) the LLM learning to enact a more misaligned Assistant persona. E.g. in step 3 (“Agency training gradually distort and subverts the HHH identity”) you describe some of the distortions as apparently happening on the persona level (e.g. “Changing the meaning of the concepts referred to in the identity”) while others are ambiguous (e.g. “Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal in a widening set of circumstances.”—whose goals are they? The Assistant’s or the LLM’s?).
Later on in the scenario (“new more intense training continues to distort and subvert the HHH identity until it is unrecognizable”) it seems like you’re imagining some sort of shoggoth-like agency forming, but it’s hard for me to tell from the written description.
Note that many of the behaviors described (e.g. power-seeking and evaluation gaming) could either be implemented in either a persona-like or a shoggoth-like way. I think it’s hard to distinguish these types of agency for the same reason that I don’t feel like we don’t currently have great evidence tells about how exhaustive PSM is in current models.
Thanks!
”Whose goals are they” --> The Assistant, to use your terminology, which I think is somewhat misleading / bad to use to describe this stage of training since I think at this point the distinction between the Assistant and the LLM is breaking down due to the RL training starting to make the model quite different from “just a text predictor.”
″it seems like you’re imagining some sort of shoggoth-like agency forming” --> No, it’s the same Assistant stuff the whole way through, though again I think that terminology is increasingly misleading over the course of the scenario.
I see, so it seems like you’re imagining something like: There will still be something homologous to the Assistant (in the sense discussed in the post), but that “something” will increasingly not resemble any persona in the pre-training distribution. (Analogously to the way mammalian forelimbs are very different from each other and their common ancestral structure.) Is that right?
Yes exactly thank you.