Seth Herd comments on Aligning to Virtues

Seth Herd 16 Feb 2026 2:26 UTC
23 points
9
I think this is worth thinking about. Aligning to virtues is a meaningfully different option than aligning to values or goals, and very different from aligning for corrigibility or instruction-following.

So I suppport the project. I hope the remainder for the project is analyzing in detail the advantages and disadvantages.

There are different merits for each. While virtues are more widely agreed upon, I worry that’s because they’re somehow more vague than consequentialist values. Some of the virtues you mention sound very good; others leave me wondering.

Kind to who? Well, anyone to who the term applies, probably sentient beings, which I think probably does have a coherent definition. So that sounds good (although I wouldn’t expect there to be a lot of humans in a future ruled by kindness).

Duty!? To what? That could be to anything from the best to the worst person, thing or system. Honor? The kind that causes killings, or hopefully a better kind?

Variants of this critique can be cast in any direction: too much is left either undefined or un-analyzed. Virtue alignment leaves more undefined, values and intent leave predictable-in-principle- consequences unanalyzed.

More analysis on any of these will probably clarify all of them and the whole project, so I wish you luck and speed!