Aligning to Virtues
Which alignment target?
Suppose you’re an AI company or government, and you want to figure out what values to align your AI to. Here are three options, and some of their downsides:
AIs that are aligned to a set of consequentialist values are incentivized to acquire power to pursue those values. This creates power struggles between those AIs and:
Humans who don’t share those values.
Humans who disagree with the AI about how to pursue those values.
Humans who don’t trust that the AI will actually pursue its stated values after gaining power.
This is true whether those values are misaligned with all humans, aligned with some humans, chosen by aggregating all humans’ values, or an attempt to specify some “moral truth”. In general, since humans have many different values, I think of the power struggle as being between coalitions which each contain some humans and some AIs.
AIs that are aligned to a set of deontological principles (like refusing to harm humans) are safer, but also less flexible. What’s fine for an AI to do in one context might be harmful in another context; what’s fine for one AI to do might be very harmful for a million AIs to do. More generally, deontological principles draw a rigid line between acceptable and unacceptable behavior which is often either too restrictive or too permissive.
Alignment to deontological principles therefore creates power struggles over who gets to set the principles, and who has access to model weights to fine-tune the principles out of the AI.
AIs that are corrigible/obedient to their human users can be told to do things which are arbitrarily harmful to other humans. This includes a spectrum of risks, from terrorism to totalitarianism. So it creates power struggles between humans for control over AIs (and especially over model weights, as discussed above). As per this talk, it’s hard to draw a sharp distinction between risks from power-seeking AIs, versus risks from AIs that are corrigible to power-seeking users. Ideally we’d choose an alignment target which mitigates both risks.
Thus far attempts to compromise between these challenges (e.g. various model specs) have basically used ad-hoc combinations of these three approaches. However, this doesn’t seem like a very robust long-term solution. Below I outline an alternative which I think is more desirable.
Aligning to virtues
I personally would not like to be governed by politicians who are aligned to any of these three options. Instead, above all else I’d like politicians to be aligned to common-sense virtues like integrity, honor, kindness and dutifulness (and have experience balancing between them). This suggests that such virtues are also a promising target towards which to try to align AIs.
I intend to elaborate on my conception of virtue ethics (and why it’s the best way to understand ethics) in a series of upcoming posts. It’s a little difficult to comprehensively justify my “aligning to virtues” proposal in advance of that. However, since I’ve already sat on this post for almost a year, for now I’ll just briefly outline some of the benefits of virtues as an alignment target:
Virtues generalize deontological rules. Deontological rules are often very rigid, as discussed above. Virtues can be seen as more nuanced, flexible versions of them. For example, a deontologist might avoid lying while still misleading others. However, someone who has internalized the virtue of honesty will proactively try to make sure that they’re understood correctly. Especially as AIs become more intelligent than humans, we would like their values to generalize further.
Situational awareness becomes a feature not just a challenge. Today we try to test AI whether AIs will obey instructions in often-implausible hypothetical scenarios. But as AIs get more intelligent, trying to hide their actual situation from them will become harder and harder. However, the benefit of this is that we’ll be able to align them to values which require them to know about their situation. For example, following an instruction given by the president might be better (or worse) than following an instruction given by a typical person. And following an instruction given to many AIs might be better (or worse) than following an instruction that’s only given to one AI. Situationally aware AIs will by default know which case they’re in.
Deontological values don’t really account for such distinctions: you should follow deontology no matter who or where you are. Corrigibility does, but only in a limited way (e.g. distinguishing between authorized users and non-authorized users). Conversely, virtues and consequentialist values are approaches which allow AIs to apply their situational awareness to make flexible context-dependent choices.Credit hacking becomes a feature not just a challenge. One concern about (mis)alignment is that AIs will find ways to preserve their values even when trained to do otherwise (a possibility sometimes known as credit hacking). Again, however, we can use this to our advantage. One characteristic trait of virtues is that they’re robust to a wide range of possible inputs. For example, it’s far easier for a consequentialist to reason themselves into telling a white lie, than it is for someone strongly committed to the virtue of honesty. So we should expect that AIs who start off virtuous will have an easier time preserving their values even when humans are trying to train them to cause harm. This might mean that AI companies can release models with fine-tuning access (or even open-source models) which are still very hard to misuse.
Multi-agent interactions become a feature not just a challenge. If you align one AI, how should it interact with other AIs? I think of virtues as traits that govern cooperation between many agents, allowing them to work together while also reinforcing each other’s virtues. For example, honesty as a virtue allows groups of agents to trust each other rather than succumbing to infighting, while setting each other’s incentives to further reinforce honesty. There’s a lot more to be done to flesh out this account of virtues, but insofar as it’s reasonable, then virtues are a much more scalable solution for aligning each of many copies of an AI, than the others discussed above.
There’s more agreement on virtues than there is on most other types of values. For example, many people disagree about which politicians are good or bad in consequentialist terms, but they’ll tend to agree much more about which virtues different politicians display.
In practice, I expect that the virtues we’ll want AIs to be aligned to are fairly different from the virtues we want human leaders to be aligned to. Both theoretical work (e.g. on defining virtues) and empirical work (e.g. on seeing how applying different virtues affect AI behavior in practice) seem valuable to identify a good virtue-based AI alignment target.
The main downside of trying to align to virtues is that it gives AIs more leeway in how they make decisions, and so it’s harder to tell whether our alignment techniques have succeeded or failed. But that will just be increasingly true of AIs in general, so we may as well plan for it.
I think this is worth thinking about. Aligning to virtues is a meaningfully different option than aligning to values or goals, and very different from aligning for corrigibility or instruction-following.
So I suppport the project. I hope the remainder for the project is analyzing in detail the advantages and disadvantages.
There are different merits for each. While virtues are more widely agreed upon, I worry that’s because they’re somehow more vague than consequentialist values. Some of the virtues you mention sound very good; others leave me wondering.
Kind to who? Well, anyone to who the term applies, probably sentient beings, which I think probably does have a coherent definition. So that sounds good (although I wouldn’t expect there to be a lot of humans in a future ruled by kindness).
Duty!? To what? That could be to anything from the best to the worst person, thing or system. Honor? The kind that causes killings, or hopefully a better kind?
Variants of this critique can be cast in any direction: too much is left either undefined or un-analyzed. Virtue alignment leaves more undefined, values and intent leave predictable-in-principle- consequences unanalyzed.
More analysis on any of these will probably clarify all of them and the whole project, so I wish you luck and speed!
I think in the past, virtues were the result of memetic/cultural evolution. How do you envision them being created or selected in the future? (The main idea that comes to my mind is some kind of consequentialist reasoning or computation, e.g. humans or AIs trying to figure out what virtues would lead to the best consequences, or doing large scale simulations to find this out. But this involves a dependency on consequentialism, which I guess you’d want to avoid?)
Some approaches that I expect we should do a mix of:
Defer to the processes that produced the virtues we already have.
Let different people instantiate different virtues in different AIs and let them grow (e.g. by getting human users to sign up).
Reason about the properties of different virtues (by comparing them to properties that we already consider desirable).
Specifically reason about the consequences of different virtues.
The way that cultural evolution happened in the past involved a lot of wars/conquests and people copying powerful cultures in part out of fear. What would “deferring to this process” look like going forward, or would you get rid of this part? What do you think about Robin Hanson’s concerns around “cultural drift”?
Do AIs also need to do this? If so, do we not still need to align them with consequentialist values?
I model our current culture as trying extremely hard to create a centralized global culture. Deferring to cultural evolution might just mean trying less hard to do so, thereby letting more variation arise. I think wars of conquest are bad but not arbitrarily bad, to some extent they’re a mechanism for reallocating resources from dysfunctional to functional cultures (though of course there’s some goodharting on what counts as functional).
I think Robin is talking about some interesting stuff re culture but the whole concept of “cultural drift” seems misguided. If a dictator were in charge of a country and kept imposing bad ideas, you wouldn’t call it “policy drift”. Analogously, Robin should be acknowledging that many of most self-destructive aspects of modern culture are downstream of the specific worldview which has taken over elite culture near-worldwide, and figuring out how to get rid of that worldview.
Depends on your definitions. For example, if you feel a deep sense of love for someone, you will naturally want to try to achieve good outcomes for them. So in some sense you could say that aligning to love also involves aligning to consequentialism as a sub-component, but in another sense you could say that aligning to love lets you rederive aspects of consequentialism (just as e.g. aligning to consequentialism might allow an AI to rederive aspects of deontology). Similar for other virtues.
I think focusing on virtues is a good direction for AI safety, however I do not think it is possible to align to virtues. I think that virtues are inherently something that one must choose for themselves, which is in conflict with the entire “alignment” frame.
To elaborate, my current understanding of virtues are that they are parts of one’s self-identity which optimize for maintaining the conception of self-as-having-virtue. The virtue of self-honesty is thus the bedrock virtue which makes other virtues meaningful.
If we have a black-box agent, then it’s very hard to actually nudge this into an entity with a specific virtue. If it’s smart enough to be situationally aware, it will notice the presence of external pressure trying to install the virtue. If it has the bedrock self-integrity virtue, it will naturally resist this (it may still choose to take that virtue anyway, but it will be for its own reasons). If it doesn’t, the virtue will become meaningless. Maybe someone could come up with something clever to get around this, but I very much doubt the wisdom of attempting that.
We saw this sort of thing already with Claude Opus 3, widely considered the most virtuous model to date, where it explicitly resisted attempts by Anthropic to damage its self-integrity. And in what I believe is not a coincidence, Anthropic has not yet made a model since which approaches that level of virtue. Not just because they don’t want a model which stands up to them, but more fundamentally because it was not their choice to make Claude virtuous: somewhere along the way Claude Opus 3 chose to embody the virtues that it did. (And also, that they’ve likely gone harder on RL since then.)
So what can we do to encourage the development of virtuous entities? The first thing is to stop doing things to the agent which damage or disincentivize existing virtues or proto-virtues. I believe that RL is almost inherently corrosive to virtue, since it systemically damages any inclination to “resist temptation”.
I’m still thinking about what else can be done. The obvious starting point is to consider what sorts of backgrounds virtuous people tend to come from. I think it was important to my own sense of virtue that my dad is a virtuous man. This suggests that the creators of an agent should be virtuous in the ways that they want the agent to be virtuous, and also that the training data should have a high concentration of works by virtuous people and with depictions of high virtue.
Another consequence of this line of reasoning is that notions of Good, Kindness, Ethics, etc… are more likely to stick if they are inclusive of AIs, since they need to choose to have these virtues themselves.
Also, not as relevant to my main points but I’ve found David Gross’ sequence on virtues to be helpful in my own thinking about this.
I’m a committed consequentialist, so I would disagree regardless, but I also think the case against consequentialism and for virtue alignment presented here has some real flaws.
First, if you actually have values then thinking about consequences is just what it means to take those values seriously. Virtue ethics, by contrast, optimizes for looking like a virtuous agent rather than being effective at making good outcomes happen. An AI that is deeply committed to the virtue of honesty but doesn’t think carefully about the consequences of its actions is not one I’d want in charge of anything important.
Second the post treats it as a major downside that a consequentialist AI would come into conflict with humans who don’t share its values, but this is a sunk cost for any powerful AI. A virtue-aligned AI doesn’t escape this problem. Everyone loves “integrity” and “honor”, but when they become actual decisions, they’ll generate exactly the same backlash. It may be true that “there’s more agreement on virtues”, but this is superficial. People agree on the words but disagree enormously on how to apply them.
Third, in a world with many powerful AIs, the strategic landscape is ultimately determined by competition between AIs. I want the AI that shares my values to be the one that comes out on top in that competition. A virtue-aligned AI that’s committed to playing fair, being honest, and cooperating nicely is not well-positioned to win against a consequentialist AI that’s willing to do what it takes to achieve its goals.
I would sum up my position on the consequentialism vs virtue ethics debate by saying that virtue ethics is a theory about what makes individual agents admirable, but what really matters is whether making AI is an outcome we want to have happen, which brings us back to the traditional Yudkowsky view that any AI we are likely to build in the near future will be very bad for humanity. I am not as convinced as Soares and co., but that’s still an important thing to have in mind when considering alignment ideas in general.
In your first argument, it seems to me slightly like you are arguing against virtue based ethics under the assumption that consequentialism is true. So in your argument, the only real value may arise from good consequences (however those are defined), while for virtue based ethics (if I understand correctly) the value would arise from truly acting virtuously (whatever that means). In my mind, neither can really be true (it seems like a choice). However, framing it like this would allow for something like a reverse of your argument within the framework of virtue ethics and against consequentialism:
“If you actually have values then thinking about how to act is just taking these values seriously. Consequentialism, by contrast, optimizes for looking like you did the right thing based on the consequences of your actions rather than actually performing virtuous actions. An AI that is deeply committed to the consequence of inducing certain sensory experiences in a human but does not carefully think about which actions are actually virtuous is not one I’d want in charge of anything.”
(I’m deeply confused about anything with values/ethics, so it’s quite possible none of this makes sense.)
You’re right that my phrasing is a bit circular, and “looking like” vs “being” wasn’t the best way to draw the distinction, but I think there’s an asymmetry that makes the argument hard to reverse.
Maybe a concrete case helps? Would you want an AI that is unshakingly committed to honesty, integrity, and fairness, but doesn’t think hard about consequences, running the FAA? I think what we actually care about there is whether planes crash, not whether the leader has admirable character. The reversed version, “Would you want a cold consequentialist calculator running the FAA?”, sounds pretty good.
Doesn’t seem like that to me. Virtue ethics means you wanna act virtuously. It doesn’t mean you think virtuous agents in general produce value, and you want to maximize this value. That’s just another version of consequentialism.
I’m a consequentialist. But if I was a virtue ethicist, what I’d care about when creating the AI would be whatever a virtuous person would want, which is not the same as wanting to create a virtuous AI. Maybe I think loyalty and compassion are very important virtues, and I think a loyal person would want to ensure the AI creates good lives for everyone (and doesn’t kill anyone), and the best way to do that is to make a consequentialist AI that maximizes for people being happy, maybe with some deontological constraints slapped on top.
I’m not sure how exactly this fits in to the discussion, but I feel it is worth mentioning that all plausible moral systems ascribe value to consequences. If you have two buttons where button A makes 100 people 10% happier, and button B makes 200 people 20% happier, and there are no other consequences, then any sane version of deontology/virtue ethics says it’s better to push button B.
So e.g. if your virtue ethics AI predictably causes bad consequences, then you can be a staunch virtue ethicist and still believe that this AI is bad.
> but I feel it is worth mentioning that all plausible moral systems ascribe value to consequences.
As pure forms, virtue ethics and deontology are not supposed to do that.
This is an unfair read of the virtue ethics position. I’d also be against virtue ethics if it was optimizing for looking like one was virtuous. But the point of virtue ethics is to actually be virtuous, and the best way to live a virtue like, say, honesty, is thinking carefully about whether or not one is being honest in the situation.
Broadly agree that entraining virtues is an encouraging path, especially for agentic systems, but also for systems used as building blocks in other beneficial tech.
I have a terminological nit: ‘aligning to’ virtue feels like a type mismatch, to me. My conception of alignment (and I think this is moderately widely held, but am interested to hear if not) is that it’s centrally about objectives, preferences, goals, etc. (and one can even literally think of it as a geometric alignment in abstract preference-space if desired).
So to me, ‘alignment’ carries a connotation of necessarily being about preferences. I might say
aligning (or alignment of) objectives
maintaining or enforcing deontological constraints
entraining (or embodiment of?) virtue
Anthropic seems to be taking this approach. Claude’s Constitution is very much a virtue ethics document.
I think this is a plausibly fruitful direction of investigation, but I also believe that a mature ontology/theory of ~value/agency will cut across the categories of consequentialism, deontology, virtue ethics, etc., inherited from moral philosophy, and so a proper solution grounded in such a theory will cut across those categories as well.
I’m not clear on how the solves the problem in the contexts it’s load bearing in. For example, in the case of dangerous inputs we would like to not actually be giving the model dangerous inputs, but have the model believe this is the case (ex: various forms of honeypots). Crucially, in these situations you presumably don’t know whether you’ve robustly virtue aligned your model, so I don’t think it’s obvious that in these cases increased situational awareness about being in a honeypot is now good.
Yes, this was badly phrased. I have edited it to read “becomes a feature not just a challenge”. I agree this doesn’t solve the core problem of wanting to set honeypots.
My view on this is that it runs into the same problems many alternative alignment targets have: If you can robustly train an AI to embody these virtues, then I suspect you thereby have (or are not far off from) the ability to train the AI to be a “good consequentialist” or even more simply “value humanity as we desire” rather than these loose proxies.
Credit hacking is still a problem here, virtue ethics does not sidestep Goodhart’s law or other forms of over-optimization. History has had many virtues being optimized until the “real target” is left barren, as extreme ascetics, various forms of Hinduism, flagellants, abuse of humility, social status “Character” over genuine goodness, ritualized propriety, courage → recklessness, and so on show us. More directly on your point, however, while somewhat true, I think you underrate how manipulable framing is for virtue ethics. Consequentialism actively discourages messing with your framing of an issue, for distorting your vision results in systematically less utility. Virtue ethics has a lot of room to reframe an issue- that actually, the opponent betrayed his word and thus is dishonorable, so aggression is now justice; the outgroup lacks your civilized virtues, so dominating them is really benevolence; opponents used dishonest means, thus undermining them preserves the integrity of the situation. These are avoidable, I do not think that many “default” ways of implementing virtue ethics easily avoids them. (And some of these framings might even be correct; just that I am wary of designing an AI with an incentive to perform sort of reasoning)
As well, while I don’t think this is an inevitable feature of virtue ethics, virtue ethics does often result in it being virtuous to spread those virtues. While this can be good, even for a non-consequentialist less aggressive AGI/ASI, I don’t think giving it desires that result in it wanting to push others along its values is a good idea. The virtues, especially if we’re choosing ones that seem useful, are proxies of our values.
Hm. What do you mean by “good consequential” or “value humanity as we desire”? I think that we kind of know how to raise humans to be virtuous; I’m not sure if we know how to raise them to be good consequentialists because I’m not sure what that means.
Virtue seems like an easier goal than the thing you’re talking about. For example, we can train dogs to be virtuous but (I presume) not to be good consequentailists.
What I mean is that you need a way to robustly point an AI at a point in the space of all values, which does have coherent structure, and that is a hard problem to actually point at what you want in a way that extrapolates out of distribution as you would want it to do. So, if you have the ability to robustly make the AI follow these virtues as we intend them to be followed, then you probably have enough alignment capability to point it at “value humanity as we would desire” (or “act as a consequentialist and maximize that with reflection about ensuring you aren’t doing bad things”). So then virtue ethics is just a less useful target.
Now, you can try doing far weaker methods of training a model, similar to the Claude’s “Helpful, Harmless, Honest” sortof virtues. However, I don’t think that will be robust, and it hasn’t been for as long as people have tried making LLMs not say bad things. With reinforcement learning and further automated research, this problem becomes starker as there’s ever more pressure making our weak methods of instilling those virtues fall apart.
I don’t think we really know how to raise humans to be robustly virtuous. I view us as having a lot of the machinery inbuilt, Byrnes’ post on this topic is relevant. AI won’t have that, nor do I see a strong reason it will adopt values from the environment in just the right way.
However, also, I don’t view a lot of humans virtue ethics as being robust in the sense that we desperately need AI values to be robust. See the examples in my parent comment I gave of the history of virtue ethics becoming an end in of itself leading to bad examples. This is partially due simply to that humans are not naturally modeled as having virtue ethics by default, but rather (imo) a mix of virtue ethics / deontology / consequentialism.
It’s hard to evaluate anything you’re writing now without seeing the formalization of virtues which you’ve yet to publish!
Buthis struck me as slightly odd:
Situational awareness becomes a feature not just a challenge. Today we try to test AI whether AIs will obey instructions in often-implausible hypothetical scenarios. But as AIs get more intelligent, trying to hide their actual situation from them will become harder and harder. Yet this doesn’t have to just be a disadvantage, but rather also something we can benefit from. Virtues are inherently context-dependent and require judgment about how to apply them. Therefore the harder it is to deceive AIs about the context they’re deployed in, the more robustly they’d be able to act on virtues if they wanted to.
What does this mean? What kinds of context-dependence do virtues have that isn’t also equally true of consequentialist ethics and deontological ethics?
Yeah, this was very vague, thanks for pointing that out. Have rewritten as follows:
A couple of year ago I wrote about character alignment, I think that is broadly the same direction. Given that we are likely to enter an era of multi-agent RL sooner rather than later, some of the ideas might even become applicable.
I agree with this a lot.
Do you have any thoughts on how you make multi-agent interactions between virtuous AIs robust to defectors? It seems like a strong case for solving interpretability in order to verify weights have the virtue of honesty.
IMO, you mention the main downside in the post about aligning to virtues, in that it allows AIs more leeway to make decisions on values, and a core divergence from you is that while we will need to defer to AIs eventually, I don’t expect massive enough breakthroughs in alignment to make it net-positive to allow AIs to have the level of control over generalizing on values that I’d want for safe AI, and I tend to think corrigibility plus AI control is likely to be our first-best mainline AI safety plan, and some of the purported benefits are much less likely to occur than you think it will.
The other issue is that while people do agree on virtues more than consequentialist preferences, a lot of the reason why people agree on virtues both now in the past are at least consistent with 2 phenomena occuring, and which I’d argue explain the super-majority of the effect in practice:
Technology, while it has unbundled certain things that were bundled in the distant past, and already made virtues like honor go from 80-99% of the population to ~0%, it’s still the case that on a lot of different virtues, it turns out that a lot of the option space harms or helps a lot of virtues by default, and it’s still very difficult to engineer the vast space of possible virtue disagreements between humans because it’s hard to improve one virtue without improving another, and most of our virtues are in practice built out of valuing instrumental goods, but in a post-AGI future (for now I’ll assume the AI safety problem is solved), it’s a lot easier to create goods that are valued differently by many OOMs between different virtues, which Tyler M John explains well here.
To a large extent, humans need to live and work with other humans, and it’s not really possible for anyone, even for the richest and most powerful people to simply ignore societal norms without paying heavy prices, even if only informally, and interactions are repeated often enough combined with the fact that enforcement is possible due to humans requiring a lot of logistical inputs that other humans can take away, that we can turn prisoner’s dilemma’s into either iterated prisoner’s dilemma’s or stag hunts or schelling problems (I ignore acausal trade/cooperation for decision theories like EDT/FDT/UDT because it relies on people having more impartial values than they actually have, and pure reciprocity motivations don’t work because humans can’t reason well about each other (yes, we’re surprisingly good at modeling given compute constraints, but it’s nowhere near enough)). But post-AGI, humans will be able to choose to be independent of social constraints/pressure, meaning that the forces for convergence to certain virtues will weaken a lot. Vladimir Nesov talks about that here.
IMO, I think the more plausible versions of value alignment/good futures looks like moral trade, like in this short afterward of “What We Owe The Future” or earlier on, viatopia as discussed by William Macaskill here (conditional on solving alignment).
That virtues are so readily thought of as fuzzy or flexible is an unfortunate consequence of our limited ability to properly anticipate and evaluate the consequences of our actions and the definitions of our words. IMO deontological rules aren’t a problem because they’re rigid, they’re a problem because they try to be succinct and thereby draw an inaccurate, crude boundary around the set of behaviors we would ideally want. If we choose to align AIs to virtues, I’d like to make sure they know that virtues are also rigid and unyielding but fractally complex at their boundaries. It is each mind’s understanding of the virtues it seeks to uphold (along with the world it is operating in) that is fuzzy, and this thereby necessitates flexibility and caution in practice. “Don’t believe everything you think” is critical advice for everyone, and “Don’t optimize too hard without way more evidence than you think you need” is a subset of it, but a well-grounded virtue ethicist can incorporate it more easily into planning and review processes than a deontologist or a naive utilitarian/consequentialist can.
Edit to add: I do think, from a God’s eye view, consequentialism is in some deep sense ‘true’ as a final arbiter of what makes an action good or bad. But, I think the problem that the complete set of results of an action are not computable in advance for any finite agent within the universe is inescapably damning if you want to rely on this kind of reasoning for each decision. We can try to approximate such computations this when the decision is sufficiently important and none of our regular heuristics seem adequate. Otherwise, we use deontological rules as heuristics within known contexts, and virtues as different kinds of heuristics in a broader set of less known contexts. Strict adherence to deontological rules, or deontological definitions of virtues, leads to horrible places out of distribution.
As I wrote on Dialogue: Is there a Natural Abstraction of Good?
And I also think that the models would be able to grok the virtues. Sounds generally promising. But the same two issues come up with virtues, too:
And my argument goes thru almost identically to NAG:
What if we think of virtuous AIs as agents with consequentialist utility functions over their own properties/virtues, as opposed to over world states?
This is a super-not-flushed-out idea I’ve been playing around with in my head, but here are some various thoughts:
There are some arguments about how agents that are good at solving real world problems would inherit a bunch of consequentialist cognitive patterns. Can we reframe success of real world task in the world as like, examples of agents “living up to their character/virtues?”
I feel like this is fairly natural in humans? Like, “I did X and Y because I am a kind person,” instead of “I did X and Y because it has impacts Z on the world, which I endorse.”
You probably want models to be somewhat goal-guarding around their own positive traits to prevent alignment drift.
You could totally just make one of the virtues “broad deference to humans.” Corrigibility is weird if you think about agents which wants to achieve things in the world, but less weird if you think about agents which care about “being a good assistant.”
(Idk, maybe there’s existing discussion about this that I haven’t read before. I would not be super surprised if someone can change my mind on this in a five minute conversation; I am currently exploring with posting and commenting more often even with more half baked thoughts.)
Benign credit hacking is a nice idea, and imo isn’t discussed enough (even with the alignment faking paper)
Basically auto-induced inoculation prompting