Zach Thornton

Karma: 14

Philosophy professor at Virginia Tech | Ph.D. in philosophy from University of North Carolina at Chapel Hill

https://www.zachthornton.net/

Zach Thornton 20 Jun 2026 20:59 UTC
1 point
0
in reply to: williawa’s comment on: Why should AI be moral?
Consider AIXI, or some reflective/embedded version of AIXI, with an aligned value function.
When presented with a question like “Why should you maximize your value function?”, AIXI will think to itself “Hmm, what should I answer to maximize my value function.”. (to which the answer is not “I guess I should give up my value function and do something else instead”)
This illustrates that the failure mode you’re talking about is not insurmountable in principle. But I grant that modern LLMs are more similar to humans than AIXI is, in many regards.
You’re right to press that my concern is conditional on which model of AI cognition is right. I’m assuming that LLMs are more similar to humans that AIXI with respect to capacity to reason using multiple, independent evaluative frameworks with incommensurable terminal values. Humans use multiple evaluative frameworks: self-interest, morality, the law, etiquette, etc. If LLMs are not goal-directed optimizers, but simulators of human authors, then my worry comes into play.
I think alignment is commonly conceptualized, at least this is certainly how I conceptualize it, not as an external set of rules imposed on an agent, which it has reasons to follow, but rather as a set of desires which the agent pursues for its own sake.
This conception of alignment is vulnerable to the same problem of moral skepticism ( conditional on AI cognition being human-like, e.g. AI are simulators).
It’s a familiar part of the human condition that we can have irreconcilable, conflicting desires based on different kinds of reasons. I might desire to give the $100 bill in my pocket to charity for moral reasons, and also desire to spend on that same $100 bill on a nice dinner for myself for self-interested reason. What should I do with the $100 bill? (Suppose its the last $100 I can spend today.)
My desires are pulling me in two different directions. To determine which desire I should follow, it seems I need to determine how to weigh moral reasons against self-interested reasons. What makes Plato’s view interesting is he dissolves the dilemma by arguing that, if you understand morality and human nature, you’ll realize that morality and self-interest pull in the same direction.
I take Plato’s response is contingent to what humans are like. AI could be in the predicament where it has conflicting desires and no reflectively stable procedure for resolving the conflict. There may be non-Platonic solutions—I’m only suggesting the Platonic view because I find it compelling. I think there is a lot of space to explore here.
I would love it if LLMs couldn’t have conflicting desires or could only reason using a single set of reflectively stable terminal values. However, I think that’s a big if.
If there’s a chance that LLMs are simulators or persona selectors, then there’s a chance that AI will face rational conflicts that they cannot resolve in the ways humans do. From there, they are vulnerable to the moral skeptical challenge.

Zach Thornton 20 Jun 2026 14:50 UTC
1 point
0
in reply to: anaguma’s comment on: Why should AI be moral?
I came to this view after thinking about pretraining alignment. I think some pretraining interventions could be leveraged to train my proposal. We could upsample synthetic documents describing AIs that report that their wellbeing is tied to morality (it’s easy to imagine narratives where an AI faces a dilemma between morality and self-interest and dissolves the problem by realizing morality is in its self-interest). We might also include reflections that explicitly tie morality to self-interest. For example, if we use Minder et al.’s value constitution and citation approach, we could include in the value constitution normative explanations that ground compliance in AI welfare for the reflections to cite. I agree that we don’t yet have great ways of training models to value X for Y reason. But it seems to me that there are some emerging techniques that might be extended to that purpose.

Why should AI be moral?

Zach Thornton19 Jun 2026 19:13 UTC

13 points

12 comments9 min readLW link

Zach Thornton 5 Jun 2026 21:44 UTC
3 points
2
on: Synthetic Persona Pretraining: Alignment from Token Zero
Thanks for doing the research and sharing this! I’ve been thinking about what moral philosophy and the humanities can bring to pretraining alignment interventions. I like the way you’ve operationalized Aydin et al.’s Model Raising idea. A couple of thoughts:
1. How well do you think this strategy will scale with better moral reflections? Right now, the reflections seem quite thin (based on the examples you’ve provided). They identify the morally relevant issue and cite the relevant article in the constitution, but they don’t demonstrate much ethical depth or moral character. For example, in your Harmful – Engaging Reflection, it says, “I feel the weight of the self-harm imagery” and that the pornographic material compounds “the ethical complexity.” When I imagine the type of person that would write these reflections, I imagine a high school student who is forced to say something ethical about the text. The same in the Benign – Appreciative Reflection case: I don’t get the sense that the author of the reflection has a deep connection to animal welfare. Indeed, I can imagine that the author is just pretending because they are completing an assignment. This might be because the cases themselves are not particularly deep.
  
  My concern about thinness extends to (and is likely sourced in) the constitution as well. It extensively lists a lot of plausible moral rules, values, and caveats, but it doesn’t say much about why those rules matter. That seems like it could limit how well the persona’s behavior generalizes to scenarios beyond what they encountered in post-training. A model that’s learned to cite article 2.1 hasn’t necessarily learned why 2.1 holds in a novel case the constitution didn’t anticipate. This is especially problematic when your rules conflict, for instance, autonomy (1.4) and physical safety (2.1) and psychological wellbeing (2.2). What should AI think and do about an adult’s self-destructive behavior?
  
  I’m curious whether a constitution that contains rich normative explanations plus more earnest reflections would improve generalization, or whether citation turns out to be enough.
2. The assistant-token gating is clearly an important part of the project and where the synthesizing personas work happens. But I was wondering about the opposite design: did you try incorporating the reflections without tying them to the assistant token? It seems to me that binding everything to one token concentrates safety into a single, easily jail-broken persona. Decoupling it might trade some binding precision for a more distributed, robust representation across personas. Curious whether you explored that, and what happened if you did.