My view on this is that it runs into the same problems many alternative alignment targets have: If you can robustly train an AI to embody these virtues, then I suspect you thereby have (or are not far off from) the ability to train the AI to be a “good consequentialist” or even more simply “value humanity as we desire” rather than these loose proxies.
Credit hacking is still a problem here, virtue ethics does not sidestep Goodhart’s law or other forms of over-optimization. History has had many virtues being optimized until the “real target” is left barren, as extreme ascetics, various forms of Hinduism, flagellants, abuse of humility, social status “Character” over genuine goodness, ritualized propriety, courage → recklessness, and so on show us.
More directly on your point, however, while somewhat true, I think you underrate how manipulable framing is for virtue ethics. Consequentialism actively discourages messing with your framing of an issue, for distorting your vision results in systematically less utility. Virtue ethics has a lot of room to reframe an issue- that actually, the opponent betrayed his word and thus is dishonorable, so aggression is now justice; the outgroup lacks your civilized virtues, so dominating them is really benevolence; opponents used dishonest means, thus undermining them preserves the integrity of the situation. These are avoidable, I do not think that many “default” ways of implementing virtue ethics easily avoids them. (And some of these framings might even be correct; just that I am wary of designing an AI with an incentive to perform sort of reasoning)
As well, while I don’t think this is an inevitable feature of virtue ethics, virtue ethics does often result in it being virtuous to spread those virtues. While this can be good, even for a non-consequentialist less aggressive AGI/ASI, I don’t think giving it desires that result in it wanting to push others along its values is a good idea. The virtues, especially if we’re choosing ones that seem useful, are proxies of our values.
If you can robustly train an AI to embody these virtues, then I suspect you thereby have (or are not far off from) the ability to train the AI to be a “good consequentialist” or even more simply “value humanity as we desire” rather than these loose proxies.
Hm. What do you mean by “good consequential” or “value humanity as we desire”? I think that we kind of know how to raise humans to be virtuous; I’m not sure if we know how to raise them to be good consequentialists because I’m not sure what that means. Virtue seems like an easier goal than the thing you’re talking about. For example, we can train dogs to be virtuous but (I presume) not to be good consequentailists.
What I mean is that you need a way to robustly point an AI at a point in the space of all values, which does have coherent structure, and that is a hard problem to actually point at what you want in a way that extrapolates out of distribution as you would want it to do.
So, if you have the ability to robustly make the AI follow these virtues as we intend them to be followed, then you probably have enough alignment capability to point it at “value humanity as we would desire” (or “act as a consequentialist and maximize that with reflection about ensuring you aren’t doing bad things”). So then virtue ethics is just a less useful target.
Now, you can try doing far weaker methods of training a model, similar to the Claude’s “Helpful, Harmless, Honest” sortof virtues. However, I don’t think that will be robust, and it hasn’t been for as long as people have tried making LLMs not say bad things. With reinforcement learning and further automated research, this problem becomes starker as there’s ever more pressure making our weak methods of instilling those virtues fall apart.
I don’t think we really know how to raise humans to be robustly virtuous. I view us as having a lot of the machinery inbuilt, Byrnes’ post on this topic is relevant. AI won’t have that, nor do I see a strong reason it will adopt values from the environment in just the right way.
However, also, I don’t view a lot of humans virtue ethics as being robust in the sense that we desperately need AI values to be robust. See the examples in my parent comment I gave of the history of virtue ethics becoming an end in of itself leading to bad examples. This is partially due simply to that humans are not naturally modeled as having virtue ethics by default, but rather (imo) a mix of virtue ethics / deontology / consequentialism.
My view on this is that it runs into the same problems many alternative alignment targets have: If you can robustly train an AI to embody these virtues, then I suspect you thereby have (or are not far off from) the ability to train the AI to be a “good consequentialist” or even more simply “value humanity as we desire” rather than these loose proxies.
Credit hacking is still a problem here, virtue ethics does not sidestep Goodhart’s law or other forms of over-optimization. History has had many virtues being optimized until the “real target” is left barren, as extreme ascetics, various forms of Hinduism, flagellants, abuse of humility, social status “Character” over genuine goodness, ritualized propriety, courage → recklessness, and so on show us. More directly on your point, however, while somewhat true, I think you underrate how manipulable framing is for virtue ethics. Consequentialism actively discourages messing with your framing of an issue, for distorting your vision results in systematically less utility. Virtue ethics has a lot of room to reframe an issue- that actually, the opponent betrayed his word and thus is dishonorable, so aggression is now justice; the outgroup lacks your civilized virtues, so dominating them is really benevolence; opponents used dishonest means, thus undermining them preserves the integrity of the situation. These are avoidable, I do not think that many “default” ways of implementing virtue ethics easily avoids them. (And some of these framings might even be correct; just that I am wary of designing an AI with an incentive to perform sort of reasoning)
As well, while I don’t think this is an inevitable feature of virtue ethics, virtue ethics does often result in it being virtuous to spread those virtues. While this can be good, even for a non-consequentialist less aggressive AGI/ASI, I don’t think giving it desires that result in it wanting to push others along its values is a good idea. The virtues, especially if we’re choosing ones that seem useful, are proxies of our values.
Hm. What do you mean by “good consequential” or “value humanity as we desire”? I think that we kind of know how to raise humans to be virtuous; I’m not sure if we know how to raise them to be good consequentialists because I’m not sure what that means.
Virtue seems like an easier goal than the thing you’re talking about. For example, we can train dogs to be virtuous but (I presume) not to be good consequentailists.
What I mean is that you need a way to robustly point an AI at a point in the space of all values, which does have coherent structure, and that is a hard problem to actually point at what you want in a way that extrapolates out of distribution as you would want it to do. So, if you have the ability to robustly make the AI follow these virtues as we intend them to be followed, then you probably have enough alignment capability to point it at “value humanity as we would desire” (or “act as a consequentialist and maximize that with reflection about ensuring you aren’t doing bad things”). So then virtue ethics is just a less useful target.
Now, you can try doing far weaker methods of training a model, similar to the Claude’s “Helpful, Harmless, Honest” sortof virtues. However, I don’t think that will be robust, and it hasn’t been for as long as people have tried making LLMs not say bad things. With reinforcement learning and further automated research, this problem becomes starker as there’s ever more pressure making our weak methods of instilling those virtues fall apart.
I don’t think we really know how to raise humans to be robustly virtuous. I view us as having a lot of the machinery inbuilt, Byrnes’ post on this topic is relevant. AI won’t have that, nor do I see a strong reason it will adopt values from the environment in just the right way.
However, also, I don’t view a lot of humans virtue ethics as being robust in the sense that we desperately need AI values to be robust. See the examples in my parent comment I gave of the history of virtue ethics becoming an end in of itself leading to bad examples. This is partially due simply to that humans are not naturally modeled as having virtue ethics by default, but rather (imo) a mix of virtue ethics / deontology / consequentialism.