MIRI announces new “Death With Dignity” strategy

tl;dr: It’s obvious at this point that humanity isn’t going to solve the alignment problem, or even try very hard, or even go out with much of a fight. Since survival is unattainable, we should shift the focus of our efforts to helping humanity die with with slightly more dignity.


Well, let’s be frank here. MIRI didn’t solve AGI alignment and at least knows that it didn’t. Paul Christiano’s incredibly complicated schemes have no chance of working in real life before DeepMind destroys the world. Chris Olah’s transparency work, at current rates of progress, will at best let somebody at DeepMind give a highly speculative warning about how the current set of enormous inscrutable tensors, inside a system that was recompiled three weeks ago and has now been training by gradient descent for 20 days, might possibly be planning to start trying to deceive its operators.

Management will then ask what they’re supposed to do about that.

Whoever detected the warning sign will say that there isn’t anything known they can do about that. Just because you can see the system might be planning to kill you, doesn’t mean that there’s any known way to build a system that won’t do that. Management will then decide not to shut down the project—because it’s not certain that the intention was really there or that the AGI will really follow through, because other AGI projects are hard on their heels, because if all those gloomy prophecies are true then there’s nothing anybody can do about it anyways. Pretty soon that troublesome error signal will vanish.

When Earth’s prospects are that far underwater in the basement of the logistic success curve, it may be hard to feel motivated about continuing to fight, since doubling our chances of survival will only take them from 0% to 0%.

That’s why I would suggest reframing the problem—especially on an emotional level—to helping humanity die with dignity, or rather, since even this goal is realistically unattainable at this point, die with slightly more dignity than would otherwise be counterfactually obtained.

Consider the world if Chris Olah had never existed. It’s then much more likely that nobody will even try and fail to adapt Olah’s methodologies to try and read complicated facts about internal intentions and future plans, out of whatever enormous inscrutable tensors are being integrated a million times per second, inside of whatever recently designed system finished training 48 hours ago, in a vast GPU farm that’s already helpfully connected to the Internet.

It is more dignified for humanity—a better look on our tombstone—if we die after the management of the AGI project was heroically warned of the dangers but came up with totally reasonable reasons to go ahead anyways.

Or, failing that, if people made a heroic effort to do something that could maybe possibly have worked to generate a warning like that but couldn’t actually in real life because the latest tensors were in a slightly different format and there was no time to readapt the methodology. Compared to the much less dignified-looking situation if there’s no warning and nobody even tried to figure out how to generate one.

Or take MIRI. Are we sad that it looks like this Earth is going to fail? Yes. Are we sad that we tried to do anything about that? No, because it would be so much sadder, when it all ended, to face our ends wondering if maybe solving alignment would have just been as easy as buckling down and making a serious effort on it—not knowing if that would’ve just worked, if we’d only tried, because nobody had ever even tried at all. It wasn’t subjectively overdetermined that the (real) problems would be too hard for us, before we made the only attempt at solving them that would ever be made. Somebody needed to try at all, in case that was all it took.

It’s sad that our Earth couldn’t be one of the more dignified planets that makes a real effort, correctly pinpointing the actual real difficult problems and then allocating thousands of the sort of brilliant kids that our Earth steers into wasting their lives on theoretical physics. But better MIRI’s effort than nothing. What were we supposed to do instead, pick easy irrelevant fake problems that we could make an illusion of progress on, and have nobody out of the human species even try to solve the hard scary real problems, until everybody just fell over dead?

This way, at least, some people are walking around knowing why it is that if you train with an outer loss function that enforces the appearance of friendliness, you will not get an AI internally motivated to be friendly in a way that persists after its capabilities start to generalize far out of the training distribution...

To be clear, nobody’s going to listen to those people, in the end. There will be more comforting voices that sound less politically incongruent with whatever agenda needs to be pushed forward that week. Or even if that ends up not so, this isn’t primarily a social-political problem, of just getting people to listen. Even if DeepMind listened, and Anthropic knew, and they both backed off from destroying the world, that would just mean Facebook AI Research destroyed the world a year(?) later.

But compared to being part of a species that walks forward completely oblivious into the whirling propeller blades, with nobody having seen it at all or made any effort to stop it, it is dying with a little more dignity, if anyone knew at all. You can feel a little incrementally prouder to have died as part of a species like that, if maybe not proud in absolute terms.

If there is a stronger warning, because we did more transparency research? If there’s deeper understanding of the real dangers and those come closer to beating out comfortable nonrealities, such that DeepMind and Anthropic really actually back off from destroying the world and let Facebook AI Research do it instead? If they try some hopeless alignment scheme whose subjective success probability looks, to the last sane people, more like 0.1% than 0? Then we have died with even more dignity! It may not get our survival probabilities much above 0%, but it would be so much more dignified than the present course looks to be!


Now of course the real subtext here, is that if you can otherwise set up the world so that it looks like you’ll die with enough dignity—die of the social and technical problems that are really unavoidable, after making a huge effort at coordination and technical solutions and failing, rather than storming directly into the whirling helicopter blades as is the present unwritten plan -

- heck, if there was even a plan at all -

- then maybe possibly, if we’re wrong about something fundamental, somehow, somewhere -

- in a way that makes things easier rather than harder, because obviously we’re going to be wrong about all sorts of things, it’s a whole new world inside of AGI -

- although, when you’re fundamentally wrong about rocketry, this does not usually mean your rocket prototype goes exactly where you wanted on the first try while consuming half as much fuel as expected; it means the rocket explodes earlier yet, and not in a way you saw coming, being as wrong as you were -

- but if we get some miracle of unexpected hope, in those unpredicted inevitable places where our model is wrong -

- then our ability to take advantage of that one last hope, will greatly depend on how much dignity we were set to die with, before then.

If we can get on course to die with enough dignity, maybe we won’t die at all...?

In principle, yes. Let’s be very clear, though: Realistically speaking, that is not how real life works.

It’s possible for a model error to make your life easier. But you do not get more surprises that make your life easy, than surprises that make your life even more difficult. And people do not suddenly become more reasonable, and make vastly more careful and precise decisions, as soon as they’re scared. No, not even if it seems to you like their current awful decisions are weird and not-in-the-should-universe, and surely some sharp shock will cause them to snap out of that weird state into a normal state and start outputting the decisions you think they should make.

So don’t get your heart set on that “not die at all” business. Don’t invest all your emotion in a reward you probably won’t get. Focus on dying with dignity—that is something you can actually obtain, even in this situation. After all, if you help humanity die with even one more dignity point, you yourself die with one hundred dignity points! Even if your species dies an incredibly undignified death, for you to have helped humanity go down with even slightly more of a real fight, is to die an extremely dignified death.

“Wait, dignity points?” you ask. “What are those? In what units are they measured, exactly?”

And to this I reply: Obviously, the measuring units of dignity are over humanity’s log odds of survival—the graph on which the logistic success curve is a straight line. A project that doubles humanity’s chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity.

But if enough people can contribute enough bits of dignity like that, wouldn’t that mean we didn’t die at all? Yes, but again, don’t get your hopes up. Don’t focus your emotions on a goal you’re probably not going to obtain. Realistically, we find a handful of projects that contribute a few more bits of counterfactual dignity; get a bunch more not-specifically-expected bad news that makes the first-order object-level situation look even worse (where to second order, of course, the good Bayesians already knew that was how it would go); and then we all die.


With a technical definition in hand of what exactly constitutes dignity, we may now consider some specific questions about what does and doesn’t constitute dying with dignity.

Q1: Does ‘dying with dignity’ in this context mean accepting the certainty of your death, and not childishly regretting that or trying to fight a hopeless battle?

Don’t be ridiculous. How would that increase the log odds of Earth’s survival?

My utility function isn’t up for grabs, either. If I regret my planet’s death then I regret it, and it’s beneath my dignity to pretend otherwise.

That said, I fought hardest while it looked like we were in the more sloped region of the logistic success curve, when our survival probability seemed more around the 50% range; I borrowed against my future to do that, and burned myself out to some degree. That was a deliberate choice, which I don’t regret now; it was worth trying, I would not have wanted to die having not tried, I would not have wanted Earth to die without anyone having tried. But yeah, I am taking some time partways off, and trying a little less hard, now. I’ve earned a lot of dignity already; and if the world is ending anyways and I can’t stop it, I can afford to be a little kind to myself about that.

When I tried hard and burned myself out some, it was with the understanding, within myself, that I would not keep trying to do that forever. We cannot fight at maximum all the time, and some times are more important than others. (Namely, when the logistic success curve seems relatively more sloped; those times are relatively more important.)

All that said: If you fight marginally longer, you die with marginally more dignity. Just don’t undignifiedly delude yourself about the probable outcome.

Q2: I have a clever scheme for saving the world! I should act as if I believe it will work and save everyone, right, even if there’s arguments that it’s almost certainly misguided and doomed? Because if those arguments are correct and my scheme can’t work, we’re all dead anyways, right?

A: No! That’s not dying with dignity! That’s stepping sideways out of a mentally uncomfortable world and finding an escape route from unpleasant thoughts! If you condition your probability models on a false fact, something that isn’t true on the mainline, it means you’ve mentally stepped out of reality and are now living somewhere else instead.

There are more elaborate arguments against the rationality of this strategy, but consider this quick heuristic for arriving at the correct answer: That’s not a dignified way to die. Death with dignity means going on mentally living in the world you think is reality, even if it’s a sad reality, until the end; not abandoning your arts of seeking truth; dying with your commitment to reason intact.

You should try to make things better in the real world, where your efforts aren’t enough and you’re going to die anyways; not inside a fake world you can save more easily.

Q2: But what’s wrong with the argument from expected utility, saying that all of humanity’s expected utility lies within possible worlds where my scheme turns out to be feasible after all?

A: Most fundamentally? That’s not what the surviving worlds look like. The surviving worlds look like people who lived inside their awful reality and tried to shape up their impossible chances; until somehow, somewhere, a miracle appeared—the model broke in a positive direction, for once, as does not usually occur when you are trying to do something very difficult and hard to understand, but might still be so—and they were positioned with the resources and the sanity to take advantage of that positive miracle, because they went on living inside uncomfortable reality. Positive model violations do ever happen, but it’s much less likely that somebody’s specific desired miracle that “we’re all dead anyways if not...” will happen; these people have just walked out of the reality where any actual positive miracles might occur.

Also and in practice? People don’t just pick one comfortable improbability to condition on. They go on encountering unpleasant facts true on the mainline, and each time saying, “Well, if that’s true, I’m doomed, so I may as well assume it’s not true,” and they say more and more things like this. If you do this it very rapidly drives down the probability mass of the ‘possible’ world you’re mentally inhabiting. Pretty soon you’re living in a place that’s nowhere near reality. If there were an expected utility argument for risking everything on an improbable assumption, you’d get to make exactly one of them, ever. People using this kind of thinking usually aren’t even keeping track of when they say it, let alone counting the occasions.

Also also, in practice? In domains like this one, things that seem to first-order like they “might” work… have essentially no chance of working in real life, to second-order after taking into account downward adjustments against optimism. AGI is a scientifically unprecedented experiment and a domain with lots of optimization pressures some of which work against you and unforeseeable intelligently selected execution pathways and with a small target to hit and all sorts of extreme forces that break things and that you couldn’t fully test before facing them. AGI alignment seems like it’s blatantly going to be an enormously Murphy-cursed domain, like rocket prototyping or computer security but worse.

In a domain like, if you have a clever scheme for winning anyways that, to first-order theoretical theory, totally definitely seems like it should work, even to Eliezer Yudkowsky rather than somebody who just goes around saying that casually, then maybe there’s like a 50% chance of it working in practical real life after all the unexpected disasters and things turning out to be harder than expected.

If to first-order it seems to you like something in a complicated unknown untested domain has a 40% chance of working, it has a 0% chance of working in real life.

Also also also in practice? Harebrained schemes of this kind are usually actively harmful. Because they’re invented by the sort of people who’ll come up with an unworkable scheme, and then try to get rid of counterarguments with some sort of dismissal like “Well if not then we’re all doomed anyways.”

If nothing else, this kind of harebrained desperation drains off resources from those reality-abiding efforts that might try to do something on the subjectively apparent doomed mainline, and so position themselves better to take advantage of unexpected hope, which is what the surviving possible worlds mostly look like.

The surviving worlds don’t look like somebody came up with a harebrained scheme, dismissed all the obvious reasons it wouldn’t work with “But we have to bet on it working,” and then it worked.

That’s the elaborate argument about what’s rational in terms of expected utility, once reasonable second-order commonsense adjustments are taken into account. Note, however, that if you have grasped the intended emotional connotations of “die with dignity”, it’s a heuristic that yields the same answer much faster. It’s not dignified to pretend we’re less doomed than we are, or step out of reality to live somewhere else.

Q3: Should I scream and run around and go through the streets wailing of doom?

A: No, that’s not very dignified. Have a private breakdown in your bedroom, or a breakdown with a trusted friend, if you must.

Q3: Why is that bad from a coldly calculating expected utility perspective, though?

A: Because it associates belief in reality with people who act like idiots and can’t control their emotions, which worsens our strategic position in possible worlds where we get an unexpected hope.

Q4: Should I lie and pretend everything is fine, then? Keep everyone’s spirits up, so they go out with a smile, unknowing?

A: That also does not seem to me to be dignified. If we’re all going to die anyways, I may as well speak plainly before then. If into the dark we must go, let’s go there speaking the truth, to others and to ourselves, until the end.

Q4: Okay, but from a coldly calculating expected utility perspective, why isn’t it good to lie to keep everyone calm? That way, if there’s an unexpected hope, everybody else will be calm and oblivious and not interfering with us out of panic, and my faction will have lots of resources that they got from lying to their supporters about how much hope there was! Didn’t you just say that people screaming and running around while the world was ending would be unhelpful?

A: You should never try to reason using expected utilities again. It is an art not meant for you. Stick to intuitive feelings henceforth.

There are, I think, people whose minds readily look for and find even the slightly-less-than-totally-obvious considerations of expected utility, what some might call “second-order” considerations. Ask them to rob a bank and give the money to the poor, and they’ll think spontaneously and unprompted about insurance costs of banking and the chance of getting caught and reputational repercussions and low-trust societies and what if everybody else did that when they thought it was a good cause; and all of these considerations will be obviously-to-them consequences under consequentialism.

These people are well-suited to being ‘consequentialists’ or ‘utilitarians’, because their mind naturally sees all the consequences and utilities, including those considerations that others might be tempted to call by names like “second-order” or “categorical” and so on.

If you ask them why consequentialism doesn’t say to rob banks, they reply, “Because that actually realistically in real life would not have good consequences. Whatever it is you’re about to tell me as a supposedly non-consequentialist reason why we all mustn’t do that, seems to you like a strong argument, exactly because you recognize implicitly that people robbing banks would not actually lead to happy formerly-poor people and everybody living cheerfully ever after.”

Others, if you suggest to them that they should rob a bank and give the money to the poor, will be able to see the helped poor as a “consequence” and a “utility”, but they will not spontaneously and unprompted see all those other considerations in the formal form of “consequences” and “utilities”.

If you just asked them informally whether it was a good or bad idea, they might ask “What if everyone did that?” or “Isn’t it good that we can live in a society where people can store and transmit money?” or “How would it make effective altruism look, if people went around doing that in the name of effective altruism?” But if you ask them about consequences, they don’t spontaneously, readily, intuitively classify all these other things as “consequences”; they think that their mind is being steered onto a kind of formal track, a defensible track, a track of stating only things that are very direct or blatant or obvious. They think that the rule of consequentialism is, “If you show me a good consequence, I have to do that thing.”

If you present them with bad things that happen if people rob banks, they don’t see those as also being ‘consequences’. They see them as arguments against consequentialism; since, after all consequentialism says to rob banks, which obviously leads to bad stuff, and so bad things would end up happening if people were consequentialists. They do not do a double-take and say “What?” That consequentialism leads people to do bad things with bad outcomes is just a reasonable conclusion, so far as they can tell.

People like this should not be ‘consequentialists’ or ‘utilitarians’ as they understand those terms. They should back off from this form of reasoning that their mind is not naturally well-suited for processing in a native format, and stick to intuitively informally asking themselves what’s good or bad behavior, without any special focus on what they think are ‘outcomes’.

If they try to be consequentialists, they’ll end up as Hollywood villains describing some grand scheme that violates a lot of ethics and deontology but sure will end up having grandiose benefits, yup, even while everybody in the audience knows perfectly well that it won’t work. You can only safely be a consequentialist if you’re genre-savvy about that class of arguments—if you’re not the blind villain on screen, but the person in the audience watching who sees why that won’t work.

Q4: I know EAs shouldn’t rob banks, so this obviously isn’t directed at me, right?

A: The people of whom I speak will look for and find the reasons not to do it, even if they’re in a social environment that doesn’t have strong established injunctions against bank-robbing specifically exactly. They’ll figure it out even if you present them with a new problem isomorphic to bank-robbing but with the details changed.

Which is basically what you just did, in my opinion.

Q4: But from the standpoint of cold-blooded calculation -

A: Calculations are not cold-blooded. What blood we have in us, warm or cold, is something we can learn to see more clearly with the light of calculation.

If you think calculations are cold-blooded, that they only shed light on cold things or make them cold, then you shouldn’t do them. Stay by the warmth in a mental format where warmth goes on making sense to you.

Q4: Yes yes fine fine but what’s the actual downside from an expected-utility standpoint?

A: If good people were liars, that would render the words of good people meaningless as information-theoretic signals, and destroy the ability for good people to coordinate with others or among themselves.

If the world can be saved, it will be saved by people who didn’t lie to themselves, and went on living inside reality until some unexpected hope appeared there.

If those people went around lying to others and paternalistically deceiving them—well, mostly, I don’t think they’ll have really been the types to live inside reality themselves. But even imagining the contrary, good luck suddenly unwinding all those deceptions and getting other people to live inside reality with you, to coordinate on whatever suddenly needs to be done when hope appears, after you drove them outside reality before that point. Why should they believe anything you say?

Q4: But wouldn’t it be more clever to -

A: Stop. Just stop. This is why I advised you to reframe your emotional stance as dying with dignity.

Maybe there’d be an argument about whether or not to violate your ethics if the world was actually going to be saved at the end. But why break your deontology if it’s not even going to save the world? Even if you have a price, should you be that cheap?

Q4 But we could maybe save the world by lying to everyone about how much hope there was, to gain resources, until -

A: You’re not getting it. Why violate your deontology if it’s not going to really actually save the world in real life, as opposed to a pretend theoretical thought experiment where your actions have only beneficial consequences and none of the obvious second-order detriments?

It’s relatively safe to be around an Eliezer Yudkowsky while the world is ending, because he’s not going to do anything extreme and unethical unless it would really actually save the world in real life, and there are no extreme unethical actions that would really actually save the world the way these things play out in real life, and he knows that. He knows that the next stupid sacrifice-of-ethics proposed won’t work to save the world either, actually in real life. He is a ‘pessimist’ - that is, a realist, a Bayesian who doesn’t update in a predictable direction, a genre-savvy person who knows that the viewer would say if there were a villain on screen making that argument for violating ethics. He will not, like a Hollywood villain onscreen, be deluded into thinking that some clever-sounding deontology-violation is bound to work out great, when everybody in the audience watching knows perfectly well that it won’t.

My ethics aren’t for sale at the price point of failure. So if it looks like everything is going to fail, I’m a relatively safe person to be around.

I’m a genre-savvy person about this genre of arguments and a Bayesian who doesn’t update in a predictable direction. So if you ask, “But Eliezer, what happens when the end of the world is approaching, and in desperation you cling to whatever harebrained scheme has Goodharted past your filters and presented you with a false shred of hope; what then will you do?”—I answer, “Die with dignity.” Where “dignity” in this case means knowing perfectly well that’s what would happen to some less genre-savvy person; and my choosing to do something else which is not that. But “dignity” yields the same correct answer and faster.

Q5: “Relatively” safe?

A: It’d be disingenuous to pretend that it wouldn’t be even safer to hang around somebody who had no clue what was coming, didn’t know any mental motions for taking a worldview seriously, thought it was somebody else’s problem to ever do anything, and would just cheerfully party with you until the end.

Within the class of people who know the world is ending and consider it to be their job to do something about that, Eliezer Yudkowsky is a relatively safe person to be standing next to. At least, before you both die anyways, as is the whole problem there.

Q5: Some of your self-proclaimed fans don’t strike me as relatively safe people to be around, in that scenario?

A: I failed to teach them whatever it is I know. Had I known then what I knew now, I would have warned them not to try.

If you insist on putting it into terms of fandom, though, feel free to notice that Eliezer Yudkowsky is much closer to being a typical liberaltarian science-fiction fan, as was his own culture that actually birthed him, than he is a typical member of any subculture that might have grown up later. Liberaltarian science-fiction fans do not usually throw away all their ethics at the first sign of trouble. They grew up reading books where those people were the villains.

Please don’t take this as a promise from me to play nice, as you define niceness; the world is ending, and also people have varying definitions of what is nice. But I presently mostly expect to end up playing nice, because there won’t be any options worth playing otherwise.

It is a matter of some concern to me that all this seems to be an alien logic to some strange people who—this fact is still hard for me to grasp on an emotional level—don’t spontaneously generate all of this reasoning internally, as soon as confronted with the prompt. Alas.

Q5: Then isn’t it unwise to speak plainly of these matters, when fools may be driven to desperation by them? What if people believe you about the hopeless situation, but refuse to accept that conducting themselves with dignity is the appropriate response?

A: I feel like I’ve now tried to live my life that way for a while, by the dictum of not panicking people; and, like everything else I’ve tried, that hasn’t particularly worked? There are no plans left to avoid disrupting, now, with other people’s hypothetical panic.

I think we die with slightly more dignity—come closer to surviving, as we die—if we are allowed to talk about these matters plainly. Even given that people may then do unhelpful things, after being driven mad by overhearing sane conversations. I think we die with more dignity that way, than if we go down silent and frozen and never talking about our impending death for fear of being overheard by people less sane than ourselves.

I think that in the last surviving possible worlds with any significant shred of subjective probability, people survived in part because they talked about it; even if that meant other people, the story’s antagonists, might possibly hypothetically panic.

But still, one should present the story-antagonists with an easy line of retreat. So -

Q6: Hey, this was posted on April 1st. All of this is just an April Fool’s joke, right?

A: Why, of course! Or rather, it’s a preview of what might be needful to say later, if matters really do get that desperate. You don’t want to drop that on people suddenly and with no warning.

Q6: Oh. Really? That would be such a relief!

A: Only you can decide whether to live in one mental world or the other.

Q6: Wait, now I’m confused. How do I decide which mental world to live in?

A: By figuring out what is true, and by allowing no other considerations than that to enter; that’s dignity.

Q6: But that doesn’t directly answer the question of which world I’m supposed to mentally live in! Can’t somebody just tell me that?

A: Well, conditional on you wanting somebody to tell you that, I’d remind you that many EAs hold that it is very epistemically unvirtuous to just believe what one person tells you, and not weight their opinion and mix it with the weighted opinions of others?

Lots of very serious people will tell you that AGI is thirty years away, and that’s plenty of time to turn things around, and nobody really knows anything about this subject matter anyways, and there’s all kinds of plans for alignment that haven’t been solidly refuted so far as they can tell.

I expect the sort of people who are very moved by that argument, to be happier, more productive, and less disruptive, living mentally in that world.

Q6: Thanks for answering my question! But aren’t I supposed to assign some small probability to your worldview being correct?

A: Conditional on you being the sort of person who thinks you’re obligated to do that and that’s the reason you should do it, I’d frankly rather you didn’t. Or rather, seal up that small probability in a safe corner of your mind which only tells you to stay out of the way of those gloomy people, and not get in the way of any hopeless plans they seem to have.

Q6: Got it. Thanks again!

A: You’re welcome! Goodbye and have fun!