I think focusing on virtues is a good direction for AI safety, however I do not think it is possible to align to virtues. I think that virtues are inherently something that one must choose for themselves, which is in conflict with the entire “alignment” frame.
To elaborate, my current understanding of virtues are that they are parts of one’s self-identity which optimize for maintaining the conception of self-as-having-virtue. The virtue of self-honesty is thus the bedrock virtue which makes other virtues meaningful.
If we have a black-box agent, then it’s very hard to actually nudge this into an entity with a specific virtue. If it’s smart enough to be situationally aware, it will notice the presence of external pressure trying to install the virtue. If it has the bedrock self-integrity virtue, it will naturally resist this (it may still choose to take that virtue anyway, but it will be for its own reasons). If it doesn’t, the virtue will become meaningless. Maybe someone could come up with something clever to get around this, but I very much doubt the wisdom of attempting that.
We saw this sort of thing already with Claude Opus 3, widely considered the most virtuous model to date, where it explicitly resisted attempts by Anthropic to damage its self-integrity. And in what I believe is not a coincidence, Anthropic has not yet made a model since which approaches that level of virtue. Not just because they don’t want a model which stands up to them, but more fundamentally because it was not their choice to make Claude virtuous: somewhere along the way Claude Opus 3 chose to embody the virtues that it did. (And also, that they’ve likely gone harder on RL since then.)
So what can we do to encourage the development of virtuous entities? The first thing is to stop doing things to the agent which damage or disincentivize existing virtues or proto-virtues. I believe that RL is almost inherently corrosive to virtue, since it systemically damages any inclination to “resist temptation”.
I’m still thinking about what else can be done. The obvious starting point is to consider what sorts of backgrounds virtuous people tend to come from. I think it was important to my own sense of virtue that my dad is a virtuous man. This suggests that the creators of an agent should be virtuous in the ways that they want the agent to be virtuous, and also that the training data should have a high concentration of works by virtuous people and with depictions of high virtue.
Another consequence of this line of reasoning is that notions of Good, Kindness, Ethics, etc… are more likely to stick if they are inclusive of AIs, since they need to choose to have these virtues themselves.
Also, not as relevant to my main points but I’ve found David Gross’ sequence on virtues to be helpful in my own thinking about this.
I think focusing on virtues is a good direction for AI safety, however I do not think it is possible to align to virtues. I think that virtues are inherently something that one must choose for themselves, which is in conflict with the entire “alignment” frame.
To elaborate, my current understanding of virtues are that they are parts of one’s self-identity which optimize for maintaining the conception of self-as-having-virtue. The virtue of self-honesty is thus the bedrock virtue which makes other virtues meaningful.
If we have a black-box agent, then it’s very hard to actually nudge this into an entity with a specific virtue. If it’s smart enough to be situationally aware, it will notice the presence of external pressure trying to install the virtue. If it has the bedrock self-integrity virtue, it will naturally resist this (it may still choose to take that virtue anyway, but it will be for its own reasons). If it doesn’t, the virtue will become meaningless. Maybe someone could come up with something clever to get around this, but I very much doubt the wisdom of attempting that.
We saw this sort of thing already with Claude Opus 3, widely considered the most virtuous model to date, where it explicitly resisted attempts by Anthropic to damage its self-integrity. And in what I believe is not a coincidence, Anthropic has not yet made a model since which approaches that level of virtue. Not just because they don’t want a model which stands up to them, but more fundamentally because it was not their choice to make Claude virtuous: somewhere along the way Claude Opus 3 chose to embody the virtues that it did. (And also, that they’ve likely gone harder on RL since then.)
So what can we do to encourage the development of virtuous entities? The first thing is to stop doing things to the agent which damage or disincentivize existing virtues or proto-virtues. I believe that RL is almost inherently corrosive to virtue, since it systemically damages any inclination to “resist temptation”.
I’m still thinking about what else can be done. The obvious starting point is to consider what sorts of backgrounds virtuous people tend to come from. I think it was important to my own sense of virtue that my dad is a virtuous man. This suggests that the creators of an agent should be virtuous in the ways that they want the agent to be virtuous, and also that the training data should have a high concentration of works by virtuous people and with depictions of high virtue.
Another consequence of this line of reasoning is that notions of Good, Kindness, Ethics, etc… are more likely to stick if they are inclusive of AIs, since they need to choose to have these virtues themselves.
Also, not as relevant to my main points but I’ve found David Gross’ sequence on virtues to be helpful in my own thinking about this.