Thanks for the additional resources and framing. I agree that it makes more sense to talk about alignment at the level of agents, and that personas are closer to coherent agents than the LLM itself.
However, what matters is whether the LLM as a whole is safe or not. Assuming your framing, I see the alignment-by-default view as predicting that it is both easy 1) to train a persona to be aligned and 2) to train this persona to be very dominant, potentially the only one expressed in practice, even against realistic adversarial inputs.
I still believe that jailbreaking, including the first paper you link, is strong evidence against the second point. And I’m more uncertain but I’d say it is weak evidence against the first point as well: the HHH assistant persona doesn’t seem robustly aligned either yet, given the existence of non-persona jailbreaks you refer to (in addition to the theoretical reasons to think current AI systems don’t internalize the values intended by their developers https://arxiv.org/abs/2510.02840).
Thanks for the additional resources and framing. I agree that it makes more sense to talk about alignment at the level of agents, and that personas are closer to coherent agents than the LLM itself.
However, what matters is whether the LLM as a whole is safe or not. Assuming your framing, I see the alignment-by-default view as predicting that it is both easy 1) to train a persona to be aligned and 2) to train this persona to be very dominant, potentially the only one expressed in practice, even against realistic adversarial inputs.
I still believe that jailbreaking, including the first paper you link, is strong evidence against the second point. And I’m more uncertain but I’d say it is weak evidence against the first point as well: the HHH assistant persona doesn’t seem robustly aligned either yet, given the existence of non-persona jailbreaks you refer to (in addition to the theoretical reasons to think current AI systems don’t internalize the values intended by their developers https://arxiv.org/abs/2510.02840).