The virtue-ethics-y motivation just seems more squishy and slippery than the consequentialist desire, especially when it routes through manipulable human desires, such that I’m worried it will not be an adequate bulwark against ruthless consequentialism.
Preregistered my prior before evaluating it: I suspect that this is the crux of any disagreement I’ll have with your article (from the point at which it was read; this comment is currently a midstage draft). (I was right this time!)
Most of the humans whom I’ve seen put forward as moral and ethical exemplars (people who’ve foster-parented dozens or hundreds of children, donated organs to strangers, saved refugees from famine, war, persecution, or all three, spoken out against institutional violence at great personal risk, etc.) have based those actions on something closer to a virtue ethical or deontological framework than a consequentialist utilitarian one. As for AI value systems, I presume you’re familiar with the computational simplicity argument in favor of virtue ethics over consequentialism and the recent example of model hacking through Aristotelian prompt injection (probably not the best possible explanation but top of search results and <3 months old so it’ll do as evidence). A single concept of “goodness” is just easier to maintain; a relatively straightforward telos and an internally consistent ethos don’t only prevent Dutch-booking, they may (on near-frontier models, for now) protect against bad-faith consequentialist arguments.
(deleted an annoyed aside regarding temperature because it’s too tangential this time)
I was starting to draft roughly this note, and I”m glad it split out from the longer messier thread where they acted like manipulation vs. guidance was unsolveable mess… yes, it’s messy, but there are clear-cut cases! People can consent to manipulation and then it’s guidance, therapy, life coaching, or the like. People can (sometimes) figure out what types of manipulations they’d retroactively consent to and pre-consent to those (this is rarer but not unheard of). Going any further in extrapolating volition risks all sorts of assumptions about the similarity of cognitive architectures among persons and through time, but the “it’s all completely impossible” tone (paraphrasing my reading of it, not quoting anyone) was beginning to grate on me!