Sam Marks comments on The persona selection model

Sam Marks 26 Feb 2026 5:52 UTC
6 points
2
Thanks!
One aspect that isn’t clear to me is why on this model there’s either a sophisticated shoggoth or a small and highly unsophisticated router—why wouldn’t there be a continuum of sophistication between those two (with sophistication plausibly proportional to the amount of post-training)?
We definitely didn’t mean to claim that there’s nothing in between these! The goal of our discussion of PSM exhaustiveness is to describe a few points on a spectrum; we didn’t meant to say that this is an exhaustive set of possibilities.
- RogerDearnaley 27 Feb 2026 15:43 UTC
  4 points
  0
  Parent
  What struck me about the router is that various papers, e.g. “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” have shown that it’s extremely easy to train conditional persona shifts: that’s a language that the base model already speaks very well, and adding a few new ones is trivial. So a model organism builder could quite easily deliberately construct a router to order, and then, as you demonstrate, for example wouldn’t need to reuse any of the human-deceit-emulation circuitry to achieve a deceitful result.
  Then the question becomes whether RL could do the same thing without us intending for it to happen: I strongly suspect the answer if we’re careless is “Yes”.