Great post! I think it’s a really clear framing. A lot of my writing over the past year (especially for the recent Eleos conference) has been about trying to find ways of thinking about the relationship between the model and the assistant, and expressing it in terms of the locus of agency seems like a highly productive approach.
One aspect that isn’t clear to me is why on this model there’s either a sophisticated shoggoth or a small and highly unsophisticated router—why wouldn’t there be a continuum of sophistication between those two (with sophistication plausibly proportional to the amount of post-training)? I would guess there might also be a continuum of interpretability, where even if the Assistant’s cognition and motivation remained interpretable, the router’s might be much more alien, and so as the router gets more sophisticated, more of the model’s cognition is dedicated to it.
One follow-up experiment that seems like it would be valuable would be to try to test that by comparing Assistant deception to the kind of router deception you describe in appendix B, eg checking whether a probe trained on the former would detect the latter.
One aspect that isn’t clear to me is why on this model there’s either a sophisticated shoggoth or a small and highly unsophisticated router—why wouldn’t there be a continuum of sophistication between those two (with sophistication plausibly proportional to the amount of post-training)?
We definitely didn’t mean to claim that there’s nothing in between these! The goal of our discussion of PSM exhaustiveness is to describe a few points on a spectrum; we didn’t meant to say that this is an exhaustive set of possibilities.
What struck me about the router is that various papers, e.g. “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” have shown that it’s extremely easy to train conditional persona shifts: that’s a language that the base model already speaks very well, and adding a few new ones is trivial. So a model organism builder could quite easily deliberately construct a router to order, and then, as you demonstrate, for example wouldn’t need to reuse any of the human-deceit-emulation circuitry to achieve a deceitful result.
Then the question becomes whether RL could do the same thing without us intending for it to happen: I strongly suspect the answer if we’re careless is “Yes”.
Great post! I think it’s a really clear framing. A lot of my writing over the past year (especially for the recent Eleos conference) has been about trying to find ways of thinking about the relationship between the model and the assistant, and expressing it in terms of the locus of agency seems like a highly productive approach.
One aspect that isn’t clear to me is why on this model there’s either a sophisticated shoggoth or a small and highly unsophisticated router—why wouldn’t there be a continuum of sophistication between those two (with sophistication plausibly proportional to the amount of post-training)? I would guess there might also be a continuum of interpretability, where even if the Assistant’s cognition and motivation remained interpretable, the router’s might be much more alien, and so as the router gets more sophisticated, more of the model’s cognition is dedicated to it.
One follow-up experiment that seems like it would be valuable would be to try to test that by comparing Assistant deception to the kind of router deception you describe in appendix B, eg checking whether a probe trained on the former would detect the latter.
Thanks!
We definitely didn’t mean to claim that there’s nothing in between these! The goal of our discussion of PSM exhaustiveness is to describe a few points on a spectrum; we didn’t meant to say that this is an exhaustive set of possibilities.
What struck me about the router is that various papers, e.g. “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” have shown that it’s extremely easy to train conditional persona shifts: that’s a language that the base model already speaks very well, and adding a few new ones is trivial. So a model organism builder could quite easily deliberately construct a router to order, and then, as you demonstrate, for example wouldn’t need to reuse any of the human-deceit-emulation circuitry to achieve a deceitful result.
Then the question becomes whether RL could do the same thing without us intending for it to happen: I strongly suspect the answer if we’re careless is “Yes”.
That assumes that a sufficiently sophisticated router is a shoggoth, and on reflection that seems plausible but far from certain.