What struck me about the router is that various papers, e.g. “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” have shown that it’s extremely easy to train conditional persona shifts: that’s a language that the base model already speaks very well, and adding a few new ones is trivial. So a model organism builder could quite easily deliberately construct a router to order, and then, as you demonstrate, for example wouldn’t need to reuse any of the human-deceit-emulation circuitry to achieve a deceitful result.
Then the question becomes whether RL could do the same thing without us intending for it to happen: I strongly suspect the answer if we’re careless is “Yes”.
What struck me about the router is that various papers, e.g. “Weird Generalization and Inductive Backdoors: New Ways to Corrupt LLMs” have shown that it’s extremely easy to train conditional persona shifts: that’s a language that the base model already speaks very well, and adding a few new ones is trivial. So a model organism builder could quite easily deliberately construct a router to order, and then, as you demonstrate, for example wouldn’t need to reuse any of the human-deceit-emulation circuitry to achieve a deceitful result.
Then the question becomes whether RL could do the same thing without us intending for it to happen: I strongly suspect the answer if we’re careless is “Yes”.