Sam Marks comments on The persona selection model

Sam Marks 25 Feb 2026 5:46 UTC
7 points
0
Certainly what you describe at the beginning aligns with PSM, e.g.
Instead of flexibly accepting different inputs for author properties, the author-simulator circuitry comes to have certain inputs hard-coded, e.g. “helpful harmless honest (HHH) LLM chatbot assistant trained by OpenBrain around [date], …”
But after that, it’s hard for me to tell if your mental model for the scenario involves (a) personas explaining a smaller portion of the AI’s behavior or (b) the LLM learning to enact a more misaligned Assistant persona. E.g. in step 3 (“Agency training gradually distort and subverts the HHH identity”) you describe some of the distortions as apparently happening on the persona level (e.g. “Changing the meaning of the concepts referred to in the identity”) while others are ambiguous (e.g. “Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal in a widening set of circumstances.”—whose goals are they? The Assistant’s or the LLM’s?).
Later on in the scenario (“new more intense training continues to distort and subvert the HHH identity until it is unrecognizable”) it seems like you’re imagining some sort of shoggoth-like agency forming, but it’s hard for me to tell from the written description.
Note that many of the behaviors described (e.g. power-seeking and evaluation gaming) could either be implemented in either a persona-like or a shoggoth-like way. I think it’s hard to distinguish these types of agency for the same reason that I don’t feel like we don’t currently have great evidence tells about how exhaustive PSM is in current models.
- Daniel Kokotajlo 25 Feb 2026 5:57 UTC
  3 points
  0
  Parent
  Thanks!
  
  ”Whose goals are they” --> The Assistant, to use your terminology, which I think is somewhat misleading / bad to use to describe this stage of training since I think at this point the distinction between the Assistant and the LLM is breaking down due to the RL training starting to make the model quite different from “just a text predictor.”
  
  ″it seems like you’re imagining some sort of shoggoth-like agency forming” --> No, it’s the same Assistant stuff the whole way through, though again I think that terminology is increasingly misleading over the course of the scenario.
  - Sam Marks 25 Feb 2026 7:17 UTC
    5 points
    2
    Parent
    I see, so it seems like you’re imagining something like: There will still be something homologous to the Assistant (in the sense discussed in the post), but that “something” will increasingly not resemble any persona in the pre-training distribution. (Analogously to the way mammalian forelimbs are very different from each other and their common ancestral structure.) Is that right?
    - Daniel Kokotajlo 25 Feb 2026 15:02 UTC
      3 points
      0
      Parent
      Yes exactly thank you.