Apologizing is a Core Rationalist Skill

There’s this narrative about a tradeoff between:

  • The virtue of Saying Oops, early and often, correcting course rather than continuing to pour oneself into a losing bet, vs

  • The loss of social status one suffers by admitting defeat, rather than spinning things as a win or at least a minor setback, or defending oneself.

In an ideal world—goes the narrative—social status mechanisms would reward people for publicly updating, rather than defending or spinning their every mistake. But alas, that’s not how the world actually works, so as individuals we’re stuck making difficult tradeoffs.

I claim that this narrative is missing a key piece. There is a social status mechanism which rewards people for publicly updating. The catch is that it’s a mechanism which the person updating must explicitly invoke; a social API which the person updating must call, in order to be rewarded for their update.

That social API is apologizing.

Mistake/​Misdeed + Apology can be Net Gainful to Social Status

A personal example: there was a post called “Common Misconceptions about OpenAI”, which (among many other points) estimated that ~30 alignment researchers work there. I replied (also among many other points):

I’d guess that is an overestimate of the number of people actually doing alignment research at OpenAI, as opposed to capabilities research in which people pay lip service to alignment. In particular, all of the RLHF work is basically capabilities work which makes alignment harder in the long term (because it directly selects for deception), while billing itself as “alignment”.

There was a lot of pushback against that. Paul Christiano replied “Calling work you disagree with ‘lip service’ seems wrong and unhelpful.”. I clarified that I was not using this as a generic criticism of something I disagreed with, but rather that RLHF seemed so obviously antihelpful to alignment that I did not expect most people working on it had actually thought about whether it would help, but were instead just doing it for other reasons while, y’know, paying lip service to alignment.[1] Richard Ngo eventually convinced me the people working on RLHF had thought at least somewhat about whether it would help; his evidence was a comment elsewhere from Paul, which in-particular said:

To be clear, this is not post hoc reasoning. I talked with WebGPT folks early on while they were wondering about whether these risks were significant, and I said that I thought this was badly overdetermined. If there had been more convincing arguments that the harms from the research were significant, I believe that it likely wouldn’t have happened.

I was wrong; the people working on RLHF (for WebGPT) apparently had actually thought about how it would impact alignment to at least some extent.

So, I replied to Richard to confirm that he had indeed disproved my intended claim, and thanked him for the information. I struck out the relevant accusation from my original comment, and edited in an apology there:

I have been convinced that I was wrong about this, and I apologize. I still definitely maintain that RLHF makes alignment harder and is negative progress for both outer and inner alignment, but I have been convinced that the team actually was trying to solve problems which kill us, and therefore not just paying lip service to alignment.

And, finally, I sent a personal apology message to Jacob Hilton, the author of the original post.

Why do I bring up this whole story here?

Lesswrong has a convenient numerical proxy-metric of social status: site karma. Prior to the redaction and apology, my comment had been rather controversial—lots of upvotes, lots of downvotes, generally low-positive karma overall but a rollercoaster. After the redaction and apology, it stabilized at a reasonable positive number, and the comment in which I confirmed that Richard had disproved my claim (and thanked him for the information) ended up one of the most-upvoted in that thread.

The point: apologizing probably worked out to a net-positive marginal delta in social status. Not just relative to further defending my claim, but even relative to not having left any comment in the first place.

More generally: when I admit mistake/​misdeed and apologize for it, I lose some social standing for having made the mistake or committed the misdeed. But I also get a large boost of social status, by sending the strongest possible signal that I am the sort of person who is willing to admit their own mistakes/​misdeeds and apologize for them. The LessWrong community nominally places especially high value on this, but in-practice it extends to the rest of the world too: explicitly admitting one’s mistakes/​misdeeds and apologizing for them is rare enough to be an extremely strong signal of integrity. It’s the sort of signal which most people instinctively recognize, and will instinctively respect, a lot. Overall, then, it’s not that unusual for mistake/​misdeed + apology to add up to a net gain in social status—if one actually registers one’s update via the social apology-API.[2]

Now, I am not saying a Machiavellian should go around intentionally making mistakes or committing misdeeds and then apologizing for them. It’s high-variance at best, and besides, we all inevitably have plenty of mistakes/​misdeeds to work with anyway. But I am saying that, even from a basically-Machiavellian perspective, ignoring all that “right thing to do” stuff, apologizing is very often a net-beneficial move, even compared to convincing people you didn’t make a mistake or commit a misdeed in the first place. It’s severely underexploited.

Apology-Adjacent Things

Beyond literal apologies, there are other apology-like things which seem to evoke similar emotions (both in the apologizer and the recipient/​audience), and which “use the same API” in some sense.

Another example: a few years ago I wrote a post called Why Subagents?, arguing that a market/​committee of utility maximizers is a better baseline model for agents than a monolithic utility maximizer. Nate Soares later convinced me that a core part of my argument was wrong: the subagent-systems I was talking about will tend toward monolithic utility maximization. So I eventually wrote another post: Why Not Subagents?. At the very beginning, I quoted the older post, and stated that Nate had convinced me the older argument was wrong.

That’s not literally an apology. This wasn’t the kind of situation where a literal apology made sense; there wasn’t a specific person or a few specific people who’d been harmed by my mistake. (Even had there been, it was a very nonobvious mistake, so I doubt most people would consider me particularly blameworthy.)

But it still had much of the structure of an apology. The admission “I was wrong” (admittedly not in those exact words, but clear enough), spelling out exactly what I was wrong about, then what I would believe differently going forward. All the standard apology-pieces.

And it felt like an apology, when writing it. It pulled the same internal levers in my brain, drew on the same skills. There was a feeling of… a choice to willingly “give up ground”, like I could have been defensive but chose otherwise. An instinctive feeling that I’d lose respect of a different kind (the “being the sort of person who explicitly admits mistake/​misdeed” kind) if I tried to dig in my heels. The nudge of an ingrained habit from having apologized many times before, which pushed me to apologize a lot rather than a little—to be maximally upfront and explicit about my own errors, rather than try to portray them as minor. Because the more direct and explicit and comprehensive the apology, the more I gain that other kind of respect from having apologized.

That feels to me like a core part of what makes apologizing a skill, i.e. something one can improve at with practice: that feeling of leaning into the apology, being maximally direct and explicit about one’s mistakes/​misdeeds rather than downplaying, and growing the gut feel that such directness tends to be rewarded with its own kind of respect/​status.

Takeaways

Wouldn’t it be great if social status mechanisms would reward us for publicly updating, rather than defending or spinning our mistakes/​misdeeds? Wouldn’t it be great if integrity were better aligned with Machiavellian incentives?

Well, it often can be, if we invoke the right social API call. Apologizing is the standard social status mechanism by which one can receive social credit for updating, after making a mistake or committing a misdeed.

Even from a purely Machiavellian perspective, apologizing can often leave us better off than we started. The people around us might trust us less for our mistake/​misdeed, but we earn a different kind of respect by sending the strongest possible signal that we are the sort of person who is willing to admit their own mistakes/​misdeeds and apologize for them—the sort of person who can explicitly and publicly update.

And apologizing is a skill which can be developed. We can build the instinct to “lean into it”—to be maximally upfront and explicit about our errors when apologizing, rather than try to downplay them. We can build the gut feel that being maximally forthright, rather than minimally, will maximize that other kind of respect which a good apology earns.

By building the skill of apologizing well, we can interface with standard social reality in a way more compatible with the virtue of Saying Oops.

  1. ^

    I wish to note here that Richard took this “as evidence that John would fail an intellectual turing test for people who have different views than he does about how valuable incremental empiricism is”. Of course I couldn’t just ignore an outright challenge to my honor like that, so I wrote a brief reply which Richard himself called “a pretty good ITT”.

  2. ^

    In certain circumstances, apologizing can also be a countersignalling power-move, i.e. “I am so high status that I can grovel a bit without anybody mistaking me for a general groveller”. But that’s not really the type of move this post is focused on.