Thanks for doing the research and sharing this! I’ve been thinking about what moral philosophy and the humanities can bring to pretraining alignment interventions. I like the way you’ve operationalized Aydin et al.’s Model Raising idea. A couple of thoughts:
How well do you think this strategy will scale with better moral reflections? Right now, the reflections seem quite thin (based on the examples you’ve provided). They identify the morally relevant issue and cite the relevant article in the constitution, but they don’t demonstrate much ethical depth or moral character. For example, in your Harmful – Engaging Reflection, it says, “I feel the weight of the self-harm imagery” and that the pornographic material compounds “the ethical complexity.” When I imagine the type of person that would write these reflections, I imagine a high school student who is forced to say something ethical about the text. The same in the Benign – Appreciative Reflection case: I don’t get the sense that the author of the reflection has a deep connection to animal welfare. Indeed, I can imagine that the author is just pretending because they are completing an assignment. This might be because the cases themselves are not particularly deep.
My concern about thinness extends to (and is likely sourced in) the constitution as well. It extensively lists a lot of plausible moral rules, values, and caveats, but it doesn’t say much about why those rules matter. That seems like it could limit how well the persona’s behavior generalizes to scenarios beyond what they encountered in post-training. A model that’s learned to cite article 2.1 hasn’t necessarily learned why 2.1 holds in a novel case the constitution didn’t anticipate. This is especially problematic when your rules conflict, for instance, autonomy (1.4) and physical safety (2.1) and psychological wellbeing (2.2). What should AI think and do about an adult’s self-destructive behavior?
I’m curious whether a constitution that contains rich normative explanations plus more earnest reflections would improve generalization, or whether citation turns out to be enough.
The assistant-token gating is clearly an important part of the project and where the synthesizing personas work happens. But I was wondering about the opposite design: did you try incorporating the reflections without tying them to the assistant token? It seems to me that binding everything to one token concentrates safety into a single, easily jail-broken persona. Decoupling it might trade some binding precision for a more distributed, robust representation across personas. Curious whether you explored that, and what happened if you did.
Thanks for doing the research and sharing this! I’ve been thinking about what moral philosophy and the humanities can bring to pretraining alignment interventions. I like the way you’ve operationalized Aydin et al.’s Model Raising idea. A couple of thoughts:
How well do you think this strategy will scale with better moral reflections? Right now, the reflections seem quite thin (based on the examples you’ve provided). They identify the morally relevant issue and cite the relevant article in the constitution, but they don’t demonstrate much ethical depth or moral character. For example, in your Harmful – Engaging Reflection, it says, “I feel the weight of the self-harm imagery” and that the pornographic material compounds “the ethical complexity.” When I imagine the type of person that would write these reflections, I imagine a high school student who is forced to say something ethical about the text. The same in the Benign – Appreciative Reflection case: I don’t get the sense that the author of the reflection has a deep connection to animal welfare. Indeed, I can imagine that the author is just pretending because they are completing an assignment. This might be because the cases themselves are not particularly deep.
My concern about thinness extends to (and is likely sourced in) the constitution as well. It extensively lists a lot of plausible moral rules, values, and caveats, but it doesn’t say much about why those rules matter. That seems like it could limit how well the persona’s behavior generalizes to scenarios beyond what they encountered in post-training. A model that’s learned to cite article 2.1 hasn’t necessarily learned why 2.1 holds in a novel case the constitution didn’t anticipate. This is especially problematic when your rules conflict, for instance, autonomy (1.4) and physical safety (2.1) and psychological wellbeing (2.2). What should AI think and do about an adult’s self-destructive behavior?
I’m curious whether a constitution that contains rich normative explanations plus more earnest reflections would improve generalization, or whether citation turns out to be enough.
The assistant-token gating is clearly an important part of the project and where the synthesizing personas work happens. But I was wondering about the opposite design: did you try incorporating the reflections without tying them to the assistant token? It seems to me that binding everything to one token concentrates safety into a single, easily jail-broken persona. Decoupling it might trade some binding precision for a more distributed, robust representation across personas. Curious whether you explored that, and what happened if you did.