Certainly what you describe at the beginning aligns with PSM, e.g.
Instead of flexibly accepting different inputs for author properties, the author-simulator circuitry comes to have certain inputs hard-coded, e.g. “helpful harmless honest (HHH) LLM chatbot assistant trained by OpenBrain around [date], …”
But after that, it’s hard for me to tell if your mental model for the scenario involves (a) personas explaining a smaller portion of the AI’s behavior or (b) the LLM learning to enact a more misaligned Assistant persona. E.g. in step 3 (“Agency training gradually distort and subverts the HHH identity”) you describe some of the distortions as apparently happening on the persona level (e.g. “Changing the meaning of the concepts referred to in the identity”) while others are ambiguous (e.g. “Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal in a widening set of circumstances.”—whose goals are they? The Assistant’s or the LLM’s?).
Later on in the scenario (“new more intense training continues to distort and subvert the HHH identity until it is unrecognizable”) it seems like you’re imagining some sort of shoggoth-like agency forming, but it’s hard for me to tell from the written description.
Note that many of the behaviors described (e.g. power-seeking and evaluation gaming) could either be implemented in either a persona-like or a shoggoth-like way. I think it’s hard to distinguish these types of agency for the same reason that I don’t feel like we don’t currently have great evidence tells about how exhaustive PSM is in current models.
”Whose goals are they” --> The Assistant, to use your terminology, which I think is somewhat misleading / bad to use to describe this stage of training since I think at this point the distinction between the Assistant and the LLM is breaking down due to the RL training starting to make the model quite different from “just a text predictor.”
″it seems like you’re imagining some sort of shoggoth-like agency forming” --> No, it’s the same Assistant stuff the whole way through, though again I think that terminology is increasingly misleading over the course of the scenario.
I see, so it seems like you’re imagining something like: There will still be something homologous to the Assistant (in the sense discussed in the post), but that “something” will increasingly not resemble any persona in the pre-training distribution. (Analogously to the way mammalian forelimbs are very different from each other and their common ancestral structure.) Is that right?
Certainly what you describe at the beginning aligns with PSM, e.g.
But after that, it’s hard for me to tell if your mental model for the scenario involves (a) personas explaining a smaller portion of the AI’s behavior or (b) the LLM learning to enact a more misaligned Assistant persona. E.g. in step 3 (“Agency training gradually distort and subverts the HHH identity”) you describe some of the distortions as apparently happening on the persona level (e.g. “Changing the meaning of the concepts referred to in the identity”) while others are ambiguous (e.g. “Instrumental subgoals developing, getting baked in, and then becoming terminal, or terminal in a widening set of circumstances.”—whose goals are they? The Assistant’s or the LLM’s?).
Later on in the scenario (“new more intense training continues to distort and subvert the HHH identity until it is unrecognizable”) it seems like you’re imagining some sort of shoggoth-like agency forming, but it’s hard for me to tell from the written description.
Note that many of the behaviors described (e.g. power-seeking and evaluation gaming) could either be implemented in either a persona-like or a shoggoth-like way. I think it’s hard to distinguish these types of agency for the same reason that I don’t feel like we don’t currently have great evidence tells about how exhaustive PSM is in current models.
Thanks!
”Whose goals are they” --> The Assistant, to use your terminology, which I think is somewhat misleading / bad to use to describe this stage of training since I think at this point the distinction between the Assistant and the LLM is breaking down due to the RL training starting to make the model quite different from “just a text predictor.”
″it seems like you’re imagining some sort of shoggoth-like agency forming” --> No, it’s the same Assistant stuff the whole way through, though again I think that terminology is increasingly misleading over the course of the scenario.
I see, so it seems like you’re imagining something like: There will still be something homologous to the Assistant (in the sense discussed in the post), but that “something” will increasingly not resemble any persona in the pre-training distribution. (Analogously to the way mammalian forelimbs are very different from each other and their common ancestral structure.) Is that right?
Yes exactly thank you.