Some Thoughts on Virtue Ethics for AIs
This post argues for the desirability and plausibility of AI agents whose values have a structure I call ‘praxis-based.’ The idea draws on various aspects of virtue ethics, and basically amounts to an RL-flavored take on that philosophical tradition.
Praxis-based values as I define them are, informally, reflective decision-influences matching the description ‘promote x x-ingly’: ‘promote peace peacefully,’ ‘promote corrigibility corrigibly,’ ‘promote science scientifically.’
I will later propose a quasi-formal definition of this values-type, but the general idea is that certain values are an ouroboros of means and end. Such values frequently come up in human “meaning of life” activities (e.g. math, art, craft, friendship, athletics, romance, technology), as well as in complex forms of human morality (e.g. peace, democracy, compassion, respect, honesty). While this is already indirect reason to suspect that a human-aligned AI should have ‘praxis-based’ values, there is also a central direct reason: traits such as corrigibility, transparency, and niceness can only function properly in the form of ‘praxis-based’ values.
It’s widely accepted that if early strategically aware AIs possess values like corrigibility, transparency, and perhaps niceness, further alignment efforts are much more likely to succeed. But values like corrigibility or transparency or niceness don’t easily fit into an intuitively consequentialist form like ‘maximize lifetime corrigible behavior’ or ‘maximize lifetime transparency.’ In fact, an AI valuing its own corrigibility or transparency or niceness in an intuitively consequentialist way can lead to extreme power-seeking whereby the AI violently remakes the world to (at a minimum) protect itself from the risk that humans will modify said value. On the other hand, constraints or taboos or purely negative values (a.k.a. ‘deontological restrictions’) are widely believed to be weak, in the sense that an advanced AI will come to work around them or uproot them: ‘never lie’ or ‘never kill’ or ‘never refuse a direct order from the president’ are poor substitutes for active transparency, niceness, and corrigibility.
The idea of ‘praxis-based’ values is meant to capture the normal, sensible way we want an agent to value corrigibility or transparency or niceness, which intuitively-consequentialist values and deontology both fail to capture. We want an agent that (e.g.) actively tries to be transparent, and to cultivate its own future transparency and its own future valuing of transparency, but that will not (for instance) engage in deception and plotting when it expects a high future-transparency payoff.
Having lightly motivated the idea that ‘praxis-based’ values are desirable from an alignment point of view, the rest of this post will survey key premises of the hypothesis that ‘praxis-based’ values are a viable alignment goal. I’m going to assume an agent with some form of online reinforcement learning going on, and draw on ‘shards’ talk pretty freely.
I informally described a ‘praxis-based’ value as having the structure ‘promote x x-ingly.’ Here is a rough formulation of what I mean, put in terms of a utility-theoretic description of a shard that implements an alignment-enabling value x:
Actions (or more generally ‘computations’) get an x-ness rating. We define the x shard’s expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent’s future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but refuse certain large sacrifices of present x-ness for big boosts to expected aggregate future x-ness.)
(Note that I am not assuming that an explicit representation of this utility function or of x-ness ratings is involved in the shard. This is just a utility-theoretic description of the shard’s behavior.)
I believe that for an x-shard with this form to become powerful, x can’t be just any property but has to be a property that is reliably self-promoting. In other words, it needs to be the case that typically if an agent executes an action with higher x-ness the agent’s future aggregate x-ness goes up. (For a prototypical example of such a property, consider Terry Tao’s description of good mathematics.)
There are three main ways in which this requirement is substantive, in the sense that we can’t automatically fulfill it for an arbitrary property x by writing a reward function that reinforces actions if they have high x-ness:
The x-ness rating has to be enough of a natural abstraction that reinforcement of high x-ness actions generalizes.
If x-ness both depends on having capital of some kind and is mutually exclusive with some forms of general power-seeking, actions with high x-ness have to typically make up for the (future x-ness wise) opportunity cost by creating capital useful for x-ing.
(Example: If you dream of achieving great theater acting, one way to do it is to become President of the United States and then pursue a theater career after your presidency, immediately getting interest from great directors who’ll help you achieve great acting. Alternatively, you could start in a regional theater after high school, demonstrate talent by acting well, get invited to work with better and better theater directors who develop your skills and reputation—skills and reputation that are not as generally useful as those you get by being POTUS—and achieve great acting through that feedback loop.)
An x-shard in a competitive shard ecology needs to self-chain and develop itself to avoid degeneration (see Turner’s discussion of the problem of a deontological ‘don’t kill’ shard). I believe that such self-chaining capabilities automatically follow if x-ness fulfills criteria ‘1.’ and ‘2.’: the more it is the case that high x-ness action strengthens the disposition to choose high x-ness action (‘1.’) and creates future opportunities for high x-ness action (‘2.’), the more the x-shard will develop and self-chain.
When considering the above, it’s crucial to keep in mind that I do not claim that if the substance of (e.g.) the human concept of ‘niceness’ fulfills conditions 1-3 then instilling robust niceness with RL is trivially easy. My claim is merely that if the substance of the human concept of ‘niceness’ fulfills conditions 1-3, then once a niceness shard with a tiered bounded-utilities ‘praxis-based’ form is instilled in an online RL agent at or below the human level this shard can develop and self-chain powerfully (unlike any ‘deontological’ shards) while being genuinely alignment-enabling (unlike any ‘intuitively consequentialist’ shard).
This was a very brief sketch of ideas that would require much more elaboration and defense, but it seemed best to put it forward in a stripped down form to see whether it resonates.
Recall that because of the possibility of ‘notational consequentialism’ (rewriting any policy as a utility function), dividing physical systems into ‘consequentialists’ and ‘non-consequentialists’ isn’t a proper formal distinction. I will instead speak about ‘intuitive consequentialist form,’ which I believe roughly means additively decomposable utility functions. The idea is that intuitively consequentialist agents decompose space-time into standalone instances of dis/value. See also Steve Byrnes’ discussion of ‘preferences over future states.’
For a more interesting example, consider an AI that finds itself making trade-offs between different alignment-enabling behavioral values when dealing with humans, and decides to kill all humans to replace them with beings with whom the AI can interact without trade-offs between these values.
A good recent discussion from a ‘classical’ perspective is found in Richard Ngo’s ‘The Alignment Problem From A Deep Learning Perspective’, and a good recent discussion from a shard-theoretic perspective is found in Alex Turner’s short form.
A richer account might include a third-tier utility function that takes the aggregate x-ness of the future actions of all other agents. In this richer account a practice involves three tiers of consideration: the action’s x-ness, the aggregate x-ness of your future actions, and the aggregate x-ness of the future actions of all agents.
The difference between criteria ‘1.’ and ‘2.’ is clearst if we think about x-ness as rating state-action pairs. Criterion ‘1.’ is the requirement that if (a,s), (a’, s’)(a″,s″) are historical high x-ness pairs and (a‴,s‴) is an unseen high x-ness pair then reinforcing the execution of a in s, a’ in s’, and a″ in s″ will have the generalization effect of increasing the conditional probability (a‴‘|s‴‘). Criterion ‘2.’ is roughly the requirement that choosing a higher x-ness action in a given state increase expected aggregate future x-ness holding policy constant, by making future states with higher x-ness potential more likely.
I am currently agnostic about whether if a property x fulfills conditions 1-3 then standard reinforcement of apparently high x-ness actions naturally leads to the formation of an x-shard with a two-tiered bounded utility structure as the agent matures. The fact that many central human values fulfill conditions 1-3 and have a two-tiered bounded utility structure is reason to think that such values are fairly ‘natural,’ but tapping into such values may require some especially sophisticated reward mechanism or environmental feature typical of human minds and the human world.
The property of being ‘self-promoting’ is at best only part of the story of what makes a given praxis-based value robust: In any real alignment context we’ll be seeking to instill an AI with several different alignment-enabling values, while also optimizing the AI for some desired capabilities. We therefore need the alignment-enabling practices we’re hoping to instill to not only be individually self-promoting, but also harmonious with one another and with capabilities training. One way to think about ‘harmony’ here may be in terms of the continued availability of Pareto improvements: Intuitively, there is a important training-dynamics difference between a ‘capabilities-disharmonious’ pressure imposed on a training AI and ‘capabilities-harmonious’ training-influences that directs the AI’s training process towards one local optimization trajectory rather than another.
If I am right that central human values and activities have the structure of a ‘self-promoting praxis,’ there may also be an exciting story to tell about why these values rose to prominence. The general thought is that a ‘self-promoting praxis’ shard x may enjoy a stability advantage compared to an x-optimizer shard, due to the risk of an x-optimizer shard creating a misaligned mesaoptimizer. By way of an analogy, consider the intuition that a liberal democracy whose national-security agency adheres to a civic code enjoys a stability advantage compared to a liberal democracy that empowers a KGB-like national-security agency.