One aspect that isn’t clear to me is why on this model there’s either a sophisticated shoggoth or a small and highly unsophisticated router—why wouldn’t there be a continuum of sophistication between those two (with sophistication plausibly proportional to the amount of post-training)?
We definitely didn’t mean to claim that there’s nothing in between these! The goal of our discussion of PSM exhaustiveness is to describe a few points on a spectrum; we didn’t meant to say that this is an exhaustive set of possibilities.
What struck me about the router is that various papers, e.g. “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” have shown that it’s extremely easy to train conditional persona shifts: that’s a language that the base model already speaks very well, and adding a few new ones is trivial. So a model organism builder could quite easily deliberately construct a router to order, and then, as you demonstrate, for example wouldn’t need to reuse any of the human-deceit-emulation circuitry to achieve a deceitful result.
Then the question becomes whether RL could do the same thing without us intending for it to happen: I strongly suspect the answer if we’re careless is “Yes”.
Thanks!
We definitely didn’t mean to claim that there’s nothing in between these! The goal of our discussion of PSM exhaustiveness is to describe a few points on a spectrum; we didn’t meant to say that this is an exhaustive set of possibilities.
What struck me about the router is that various papers, e.g. “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” have shown that it’s extremely easy to train conditional persona shifts: that’s a language that the base model already speaks very well, and adding a few new ones is trivial. So a model organism builder could quite easily deliberately construct a router to order, and then, as you demonstrate, for example wouldn’t need to reuse any of the human-deceit-emulation circuitry to achieve a deceitful result.
Then the question becomes whether RL could do the same thing without us intending for it to happen: I strongly suspect the answer if we’re careless is “Yes”.