Eye You comments on Aligning to Virtues

Eye You 17 Feb 2026 4:45 UTC
1 point
0
If you can robustly train an AI to embody these virtues, then I suspect you thereby have (or are not far off from) the ability to train the AI to be a “good consequentialist” or even more simply “value humanity as we desire” rather than these loose proxies.
Hm. What do you mean by “good consequential” or “value humanity as we desire”? I think that we kind of know how to raise humans to be virtuous; I’m not sure if we know how to raise them to be good consequentialists because I’m not sure what that means.
Virtue seems like an easier goal than the thing you’re talking about. For example, we can train dogs to be virtuous but (I presume) not to be good consequentailists.
- MinusGix 17 Feb 2026 23:07 UTC
  1 point
  0
  Parent
  What I mean is that you need a way to robustly point an AI at a point in the space of all values, which does have coherent structure, and that is a hard problem to actually point at what you want in a way that extrapolates out of distribution as you would want it to do. So, if you have the ability to robustly make the AI follow these virtues as we intend them to be followed, then you probably have enough alignment capability to point it at “value humanity as we would desire” (or “act as a consequentialist and maximize that with reflection about ensuring you aren’t doing bad things”). So then virtue ethics is just a less useful target.
  
  Now, you can try doing far weaker methods of training a model, similar to the Claude’s “Helpful, Harmless, Honest” sortof virtues. However, I don’t think that will be robust, and it hasn’t been for as long as people have tried making LLMs not say bad things. With reinforcement learning and further automated research, this problem becomes starker as there’s ever more pressure making our weak methods of instilling those virtues fall apart.
  
  I don’t think we really know how to raise humans to be robustly virtuous. I view us as having a lot of the machinery inbuilt, Byrnes’ post on this topic is relevant. AI won’t have that, nor do I see a strong reason it will adopt values from the environment in just the right way.
  However, also, I don’t view a lot of humans virtue ethics as being robust in the sense that we desperately need AI values to be robust. See the examples in my parent comment I gave of the history of virtue ethics becoming an end in of itself leading to bad examples. This is partially due simply to that humans are not naturally modeled as having virtue ethics by default, but rather (imo) a mix of virtue ethics / deontology / consequentialism.