# MIRI announces new “Death With Dignity” strategy

tl;dr: It’s obvious at this point that humanity isn’t going to solve the alignment problem, or even try very hard, or even go out with much of a fight. Since survival is unattainable, we should shift the focus of our efforts to helping humanity die with with slightly more dignity.

Well, let’s be frank here. MIRI didn’t solve AGI alignment and at least knows that it didn’t. Paul Christiano’s incredibly complicated schemes have no chance of working in real life before DeepMind destroys the world. Chris Olah’s transparency work, at current rates of progress, will at best let somebody at DeepMind give a highly speculative warning about how the current set of enormous inscrutable tensors, inside a system that was recompiled three weeks ago and has now been training by gradient descent for 20 days, might possibly be planning to start trying to deceive its operators.

Whoever detected the warning sign will say that there isn’t anything known they can do about that. Just because you can see the system might be planning to kill you, doesn’t mean that there’s any known way to build a system that won’t do that. Management will then decide not to shut down the project—because it’s not certain that the intention was really there or that the AGI will really follow through, because other AGI projects are hard on their heels, because if all those gloomy prophecies are true then there’s nothing anybody can do about it anyways. Pretty soon that troublesome error signal will vanish.

When Earth’s prospects are that far underwater in the basement of the logistic success curve, it may be hard to feel motivated about continuing to fight, since doubling our chances of survival will only take them from 0% to 0%.

That’s why I would suggest reframing the problem—especially on an emotional level—to helping humanity die with dignity, or rather, since even this goal is realistically unattainable at this point, die with slightly more dignity than would otherwise be counterfactually obtained.

Consider the world if Chris Olah had never existed. It’s then much more likely that nobody will even try and fail to adapt Olah’s methodologies to try and read complicated facts about internal intentions and future plans, out of whatever enormous inscrutable tensors are being integrated a million times per second, inside of whatever recently designed system finished training 48 hours ago, in a vast GPU farm that’s already helpfully connected to the Internet.

It is more dignified for humanity—a better look on our tombstone—if we die after the management of the AGI project was heroically warned of the dangers but came up with totally reasonable reasons to go ahead anyways.

Or, failing that, if people made a heroic effort to do something that could maybe possibly have worked to generate a warning like that but couldn’t actually in real life because the latest tensors were in a slightly different format and there was no time to readapt the methodology. Compared to the much less dignified-looking situation if there’s no warning and nobody even tried to figure out how to generate one.

Or take MIRI. Are we sad that it looks like this Earth is going to fail? Yes. Are we sad that we tried to do anything about that? No, because it would be so much sadder, when it all ended, to face our ends wondering if maybe solving alignment would have just been as easy as buckling down and making a serious effort on it—not knowing if that would’ve just worked, if we’d only tried, because nobody had ever even tried at all. It wasn’t subjectively overdetermined that the (real) problems would be too hard for us, before we made the only attempt at solving them that would ever be made. Somebody needed to try at all, in case that was all it took.

It’s sad that our Earth couldn’t be one of the more dignified planets that makes a real effort, correctly pinpointing the actual real difficult problems and then allocating thousands of the sort of brilliant kids that our Earth steers into wasting their lives on theoretical physics. But better MIRI’s effort than nothing. What were we supposed to do instead, pick easy irrelevant fake problems that we could make an illusion of progress on, and have nobody out of the human species even try to solve the hard scary real problems, until everybody just fell over dead?

This way, at least, some people are walking around knowing why it is that if you train with an outer loss function that enforces the appearance of friendliness, you will not get an AI internally motivated to be friendly in a way that persists after its capabilities start to generalize far out of the training distribution...

To be clear, nobody’s going to listen to those people, in the end. There will be more comforting voices that sound less politically incongruent with whatever agenda needs to be pushed forward that week. Or even if that ends up not so, this isn’t primarily a social-political problem, of just getting people to listen. Even if DeepMind listened, and Anthropic knew, and they both backed off from destroying the world, that would just mean Facebook AI Research destroyed the world a year(?) later.

But compared to being part of a species that walks forward completely oblivious into the whirling propeller blades, with nobody having seen it at all or made any effort to stop it, it is dying with a little more dignity, if anyone knew at all. You can feel a little incrementally prouder to have died as part of a species like that, if maybe not proud in absolute terms.

If there is a stronger warning, because we did more transparency research? If there’s deeper understanding of the real dangers and those come closer to beating out comfortable nonrealities, such that DeepMind and Anthropic really actually back off from destroying the world and let Facebook AI Research do it instead? If they try some hopeless alignment scheme whose subjective success probability looks, to the last sane people, more like 0.1% than 0? Then we have died with even more dignity! It may not get our survival probabilities much above 0%, but it would be so much more dignified than the present course looks to be!

Now of course the real subtext here, is that if you can otherwise set up the world so that it looks like you’ll die with enough dignity—die of the social and technical problems that are really unavoidable, after making a huge effort at coordination and technical solutions and failing, rather than storming directly into the whirling helicopter blades as is the present unwritten plan -

- heck, if there was even a plan at all -

- then maybe possibly, if we’re wrong about something fundamental, somehow, somewhere -

- in a way that makes things easier rather than harder, because obviously we’re going to be wrong about all sorts of things, it’s a whole new world inside of AGI -

- although, when you’re fundamentally wrong about rocketry, this does not usually mean your rocket prototype goes exactly where you wanted on the first try while consuming half as much fuel as expected; it means the rocket explodes earlier yet, and not in a way you saw coming, being as wrong as you were -

- but if we get some miracle of unexpected hope, in those unpredicted inevitable places where our model is wrong -

- then our ability to take advantage of that one last hope, will greatly depend on how much dignity we were set to die with, before then.

If we can get on course to die with enough dignity, maybe we won’t die at all...?

In principle, yes. Let’s be very clear, though: Realistically speaking, that is not how real life works.

It’s possible for a model error to make your life easier. But you do not get more surprises that make your life easy, than surprises that make your life even more difficult. And people do not suddenly become more reasonable, and make vastly more careful and precise decisions, as soon as they’re scared. No, not even if it seems to you like their current awful decisions are weird and not-in-the-should-universe, and surely some sharp shock will cause them to snap out of that weird state into a normal state and start outputting the decisions you think they should make.

So don’t get your heart set on that “not die at all” business. Don’t invest all your emotion in a reward you probably won’t get. Focus on dying with dignity—that is something you can actually obtain, even in this situation. After all, if you help humanity die with even one more dignity point, you yourself die with one hundred dignity points! Even if your species dies an incredibly undignified death, for you to have helped humanity go down with even slightly more of a real fight, is to die an extremely dignified death.

“Wait, dignity points?” you ask. “What are those? In what units are they measured, exactly?”

And to this I reply: Obviously, the measuring units of dignity are over humanity’s log odds of survival—the graph on which the logistic success curve is a straight line. A project that doubles humanity’s chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity.

But if enough people can contribute enough bits of dignity like that, wouldn’t that mean we didn’t die at all? Yes, but again, don’t get your hopes up. Don’t focus your emotions on a goal you’re probably not going to obtain. Realistically, we find a handful of projects that contribute a few more bits of counterfactual dignity; get a bunch more not-specifically-expected bad news that makes the first-order object-level situation look even worse (where to second order, of course, the good Bayesians already knew that was how it would go); and then we all die.

With a technical definition in hand of what exactly constitutes dignity, we may now consider some specific questions about what does and doesn’t constitute dying with dignity.

Q1: Does ‘dying with dignity’ in this context mean accepting the certainty of your death, and not childishly regretting that or trying to fight a hopeless battle?

Don’t be ridiculous. How would that increase the log odds of Earth’s survival?

My utility function isn’t up for grabs, either. If I regret my planet’s death then I regret it, and it’s beneath my dignity to pretend otherwise.

That said, I fought hardest while it looked like we were in the more sloped region of the logistic success curve, when our survival probability seemed more around the 50% range; I borrowed against my future to do that, and burned myself out to some degree. That was a deliberate choice, which I don’t regret now; it was worth trying, I would not have wanted to die having not tried, I would not have wanted Earth to die without anyone having tried. But yeah, I am taking some time partways off, and trying a little less hard, now. I’ve earned a lot of dignity already; and if the world is ending anyways and I can’t stop it, I can afford to be a little kind to myself about that.

When I tried hard and burned myself out some, it was with the understanding, within myself, that I would not keep trying to do that forever. We cannot fight at maximum all the time, and some times are more important than others. (Namely, when the logistic success curve seems relatively more sloped; those times are relatively more important.)

All that said: If you fight marginally longer, you die with marginally more dignity. Just don’t undignifiedly delude yourself about the probable outcome.

Q2: I have a clever scheme for saving the world! I should act as if I believe it will work and save everyone, right, even if there’s arguments that it’s almost certainly misguided and doomed? Because if those arguments are correct and my scheme can’t work, we’re all dead anyways, right?

A: No! That’s not dying with dignity! That’s stepping sideways out of a mentally uncomfortable world and finding an escape route from unpleasant thoughts! If you condition your probability models on a false fact, something that isn’t true on the mainline, it means you’ve mentally stepped out of reality and are now living somewhere else instead.

There are more elaborate arguments against the rationality of this strategy, but consider this quick heuristic for arriving at the correct answer: That’s not a dignified way to die. Death with dignity means going on mentally living in the world you think is reality, even if it’s a sad reality, until the end; not abandoning your arts of seeking truth; dying with your commitment to reason intact.

You should try to make things better in the real world, where your efforts aren’t enough and you’re going to die anyways; not inside a fake world you can save more easily.

Q2: But what’s wrong with the argument from expected utility, saying that all of humanity’s expected utility lies within possible worlds where my scheme turns out to be feasible after all?

A: Most fundamentally? That’s not what the surviving worlds look like. The surviving worlds look like people who lived inside their awful reality and tried to shape up their impossible chances; until somehow, somewhere, a miracle appeared—the model broke in a positive direction, for once, as does not usually occur when you are trying to do something very difficult and hard to understand, but might still be so—and they were positioned with the resources and the sanity to take advantage of that positive miracle, because they went on living inside uncomfortable reality. Positive model violations do ever happen, but it’s much less likely that somebody’s specific desired miracle that “we’re all dead anyways if not...” will happen; these people have just walked out of the reality where any actual positive miracles might occur.

Also and in practice? People don’t just pick one comfortable improbability to condition on. They go on encountering unpleasant facts true on the mainline, and each time saying, “Well, if that’s true, I’m doomed, so I may as well assume it’s not true,” and they say more and more things like this. If you do this it very rapidly drives down the probability mass of the ‘possible’ world you’re mentally inhabiting. Pretty soon you’re living in a place that’s nowhere near reality. If there were an expected utility argument for risking everything on an improbable assumption, you’d get to make exactly one of them, ever. People using this kind of thinking usually aren’t even keeping track of when they say it, let alone counting the occasions.

Also also, in practice? In domains like this one, things that seem to first-order like they “might” work… have essentially no chance of working in real life, to second-order after taking into account downward adjustments against optimism. AGI is a scientifically unprecedented experiment and a domain with lots of optimization pressures some of which work against you and unforeseeable intelligently selected execution pathways and with a small target to hit and all sorts of extreme forces that break things and that you couldn’t fully test before facing them. AGI alignment seems like it’s blatantly going to be an enormously Murphy-cursed domain, like rocket prototyping or computer security but worse.

In a domain like, if you have a clever scheme for winning anyways that, to first-order theoretical theory, totally definitely seems like it should work, even to Eliezer Yudkowsky rather than somebody who just goes around saying that casually, then maybe there’s like a 50% chance of it working in practical real life after all the unexpected disasters and things turning out to be harder than expected.

If to first-order it seems to you like something in a complicated unknown untested domain has a 40% chance of working, it has a 0% chance of working in real life.

Also also also in practice? Harebrained schemes of this kind are usually actively harmful. Because they’re invented by the sort of people who’ll come up with an unworkable scheme, and then try to get rid of counterarguments with some sort of dismissal like “Well if not then we’re all doomed anyways.”

If nothing else, this kind of harebrained desperation drains off resources from those reality-abiding efforts that might try to do something on the subjectively apparent doomed mainline, and so position themselves better to take advantage of unexpected hope, which is what the surviving possible worlds mostly look like.

The surviving worlds don’t look like somebody came up with a harebrained scheme, dismissed all the obvious reasons it wouldn’t work with “But we have to bet on it working,” and then it worked.

That’s the elaborate argument about what’s rational in terms of expected utility, once reasonable second-order commonsense adjustments are taken into account. Note, however, that if you have grasped the intended emotional connotations of “die with dignity”, it’s a heuristic that yields the same answer much faster. It’s not dignified to pretend we’re less doomed than we are, or step out of reality to live somewhere else.

Q3: Should I scream and run around and go through the streets wailing of doom?

A: No, that’s not very dignified. Have a private breakdown in your bedroom, or a breakdown with a trusted friend, if you must.

Q3: Why is that bad from a coldly calculating expected utility perspective, though?

A: Because it associates belief in reality with people who act like idiots and can’t control their emotions, which worsens our strategic position in possible worlds where we get an unexpected hope.

Q4: Should I lie and pretend everything is fine, then? Keep everyone’s spirits up, so they go out with a smile, unknowing?

A: That also does not seem to me to be dignified. If we’re all going to die anyways, I may as well speak plainly before then. If into the dark we must go, let’s go there speaking the truth, to others and to ourselves, until the end.

Q4: Okay, but from a coldly calculating expected utility perspective, why isn’t it good to lie to keep everyone calm? That way, if there’s an unexpected hope, everybody else will be calm and oblivious and not interfering with us out of panic, and my faction will have lots of resources that they got from lying to their supporters about how much hope there was! Didn’t you just say that people screaming and running around while the world was ending would be unhelpful?

A: You should never try to reason using expected utilities again. It is an art not meant for you. Stick to intuitive feelings henceforth.

There are, I think, people whose minds readily look for and find even the slightly-less-than-totally-obvious considerations of expected utility, what some might call “second-order” considerations. Ask them to rob a bank and give the money to the poor, and they’ll think spontaneously and unprompted about insurance costs of banking and the chance of getting caught and reputational repercussions and low-trust societies and what if everybody else did that when they thought it was a good cause; and all of these considerations will be obviously-to-them consequences under consequentialism.

These people are well-suited to being ‘consequentialists’ or ‘utilitarians’, because their mind naturally sees all the consequences and utilities, including those considerations that others might be tempted to call by names like “second-order” or “categorical” and so on.

If you ask them why consequentialism doesn’t say to rob banks, they reply, “Because that actually realistically in real life would not have good consequences. Whatever it is you’re about to tell me as a supposedly non-consequentialist reason why we all mustn’t do that, seems to you like a strong argument, exactly because you recognize implicitly that people robbing banks would not actually lead to happy formerly-poor people and everybody living cheerfully ever after.”

Others, if you suggest to them that they should rob a bank and give the money to the poor, will be able to see the helped poor as a “consequence” and a “utility”, but they will not spontaneously and unprompted see all those other considerations in the formal form of “consequences” and “utilities”.

If you just asked them informally whether it was a good or bad idea, they might ask “What if everyone did that?” or “Isn’t it good that we can live in a society where people can store and transmit money?” or “How would it make effective altruism look, if people went around doing that in the name of effective altruism?” But if you ask them about consequences, they don’t spontaneously, readily, intuitively classify all these other things as “consequences”; they think that their mind is being steered onto a kind of formal track, a defensible track, a track of stating only things that are very direct or blatant or obvious. They think that the rule of consequentialism is, “If you show me a good consequence, I have to do that thing.”

If you present them with bad things that happen if people rob banks, they don’t see those as also being ‘consequences’. They see them as arguments against consequentialism; since, after all consequentialism says to rob banks, which obviously leads to bad stuff, and so bad things would end up happening if people were consequentialists. They do not do a double-take and say “What?” That consequentialism leads people to do bad things with bad outcomes is just a reasonable conclusion, so far as they can tell.

People like this should not be ‘consequentialists’ or ‘utilitarians’ as they understand those terms. They should back off from this form of reasoning that their mind is not naturally well-suited for processing in a native format, and stick to intuitively informally asking themselves what’s good or bad behavior, without any special focus on what they think are ‘outcomes’.

If they try to be consequentialists, they’ll end up as Hollywood villains describing some grand scheme that violates a lot of ethics and deontology but sure will end up having grandiose benefits, yup, even while everybody in the audience knows perfectly well that it won’t work. You can only safely be a consequentialist if you’re genre-savvy about that class of arguments—if you’re not the blind villain on screen, but the person in the audience watching who sees why that won’t work.

Q4: I know EAs shouldn’t rob banks, so this obviously isn’t directed at me, right?

A: The people of whom I speak will look for and find the reasons not to do it, even if they’re in a social environment that doesn’t have strong established injunctions against bank-robbing specifically exactly. They’ll figure it out even if you present them with a new problem isomorphic to bank-robbing but with the details changed.

Which is basically what you just did, in my opinion.

Q4: But from the standpoint of cold-blooded calculation -

A: Calculations are not cold-blooded. What blood we have in us, warm or cold, is something we can learn to see more clearly with the light of calculation.

If you think calculations are cold-blooded, that they only shed light on cold things or make them cold, then you shouldn’t do them. Stay by the warmth in a mental format where warmth goes on making sense to you.

Q4: Yes yes fine fine but what’s the actual downside from an expected-utility standpoint?

A: If good people were liars, that would render the words of good people meaningless as information-theoretic signals, and destroy the ability for good people to coordinate with others or among themselves.

If the world can be saved, it will be saved by people who didn’t lie to themselves, and went on living inside reality until some unexpected hope appeared there.

If those people went around lying to others and paternalistically deceiving them—well, mostly, I don’t think they’ll have really been the types to live inside reality themselves. But even imagining the contrary, good luck suddenly unwinding all those deceptions and getting other people to live inside reality with you, to coordinate on whatever suddenly needs to be done when hope appears, after you drove them outside reality before that point. Why should they believe anything you say?

Q4: But wouldn’t it be more clever to -

A: Stop. Just stop. This is why I advised you to reframe your emotional stance as dying with dignity.

Maybe there’d be an argument about whether or not to violate your ethics if the world was actually going to be saved at the end. But why break your deontology if it’s not even going to save the world? Even if you have a price, should you be that cheap?

Q4 But we could maybe save the world by lying to everyone about how much hope there was, to gain resources, until -

A: You’re not getting it. Why violate your deontology if it’s not going to really actually save the world in real life, as opposed to a pretend theoretical thought experiment where your actions have only beneficial consequences and none of the obvious second-order detriments?

It’s relatively safe to be around an Eliezer Yudkowsky while the world is ending, because he’s not going to do anything extreme and unethical unless it would really actually save the world in real life, and there are no extreme unethical actions that would really actually save the world the way these things play out in real life, and he knows that. He knows that the next stupid sacrifice-of-ethics proposed won’t work to save the world either, actually in real life. He is a ‘pessimist’ - that is, a realist, a Bayesian who doesn’t update in a predictable direction, a genre-savvy person who knows that the viewer would say if there were a villain on screen making that argument for violating ethics. He will not, like a Hollywood villain onscreen, be deluded into thinking that some clever-sounding deontology-violation is bound to work out great, when everybody in the audience watching knows perfectly well that it won’t.

My ethics aren’t for sale at the price point of failure. So if it looks like everything is going to fail, I’m a relatively safe person to be around.

I’m a genre-savvy person about this genre of arguments and a Bayesian who doesn’t update in a predictable direction. So if you ask, “But Eliezer, what happens when the end of the world is approaching, and in desperation you cling to whatever harebrained scheme has Goodharted past your filters and presented you with a false shred of hope; what then will you do?”—I answer, “Die with dignity.” Where “dignity” in this case means knowing perfectly well that’s what would happen to some less genre-savvy person; and my choosing to do something else which is not that. But “dignity” yields the same correct answer and faster.

Q5: “Relatively” safe?

A: It’d be disingenuous to pretend that it wouldn’t be even safer to hang around somebody who had no clue what was coming, didn’t know any mental motions for taking a worldview seriously, thought it was somebody else’s problem to ever do anything, and would just cheerfully party with you until the end.

Within the class of people who know the world is ending and consider it to be their job to do something about that, Eliezer Yudkowsky is a relatively safe person to be standing next to. At least, before you both die anyways, as is the whole problem there.

Q5: Some of your self-proclaimed fans don’t strike me as relatively safe people to be around, in that scenario?

A: I failed to teach them whatever it is I know. Had I known then what I knew now, I would have warned them not to try.

If you insist on putting it into terms of fandom, though, feel free to notice that Eliezer Yudkowsky is much closer to being a typical liberaltarian science-fiction fan, as was his own culture that actually birthed him, than he is a typical member of any subculture that might have grown up later. Liberaltarian science-fiction fans do not usually throw away all their ethics at the first sign of trouble. They grew up reading books where those people were the villains.

Please don’t take this as a promise from me to play nice, as you define niceness; the world is ending, and also people have varying definitions of what is nice. But I presently mostly expect to end up playing nice, because there won’t be any options worth playing otherwise.

It is a matter of some concern to me that all this seems to be an alien logic to some strange people who—this fact is still hard for me to grasp on an emotional level—don’t spontaneously generate all of this reasoning internally, as soon as confronted with the prompt. Alas.

Q5: Then isn’t it unwise to speak plainly of these matters, when fools may be driven to desperation by them? What if people believe you about the hopeless situation, but refuse to accept that conducting themselves with dignity is the appropriate response?

A: I feel like I’ve now tried to live my life that way for a while, by the dictum of not panicking people; and, like everything else I’ve tried, that hasn’t particularly worked? There are no plans left to avoid disrupting, now, with other people’s hypothetical panic.

I think we die with slightly more dignity—come closer to surviving, as we die—if we are allowed to talk about these matters plainly. Even given that people may then do unhelpful things, after being driven mad by overhearing sane conversations. I think we die with more dignity that way, than if we go down silent and frozen and never talking about our impending death for fear of being overheard by people less sane than ourselves.

I think that in the last surviving possible worlds with any significant shred of subjective probability, people survived in part because they talked about it; even if that meant other people, the story’s antagonists, might possibly hypothetically panic.

But still, one should present the story-antagonists with an easy line of retreat. So -

Q6: Hey, this was posted on April 1st. All of this is just an April Fool’s joke, right?

A: Why, of course! Or rather, it’s a preview of what might be needful to say later, if matters really do get that desperate. You don’t want to drop that on people suddenly and with no warning.

Q6: Oh. Really? That would be such a relief!

A: Only you can decide whether to live in one mental world or the other.

Q6: Wait, now I’m confused. How do I decide which mental world to live in?

A: By figuring out what is true, and by allowing no other considerations than that to enter; that’s dignity.

Q6: But that doesn’t directly answer the question of which world I’m supposed to mentally live in! Can’t somebody just tell me that?

A: Well, conditional on you wanting somebody to tell you that, I’d remind you that many EAs hold that it is very epistemically unvirtuous to just believe what one person tells you, and not weight their opinion and mix it with the weighted opinions of others?

Lots of very serious people will tell you that AGI is thirty years away, and that’s plenty of time to turn things around, and nobody really knows anything about this subject matter anyways, and there’s all kinds of plans for alignment that haven’t been solidly refuted so far as they can tell.

I expect the sort of people who are very moved by that argument, to be happier, more productive, and less disruptive, living mentally in that world.

Q6: Thanks for answering my question! But aren’t I supposed to assign some small probability to your worldview being correct?

A: Conditional on you being the sort of person who thinks you’re obligated to do that and that’s the reason you should do it, I’d frankly rather you didn’t. Or rather, seal up that small probability in a safe corner of your mind which only tells you to stay out of the way of those gloomy people, and not get in the way of any hopeless plans they seem to have.

Q6: Got it. Thanks again!

A: You’re welcome! Goodbye and have fun!

• That’s great and all, but with all due respect:

Fuck. That. Noise.

Regardless of the odds of success and what the optimal course of action actually is, I would be very hard pressed to say that I’m trying to “help humanity die with dignity”. Regardless of what the optimal action should be given that goal, on an emotional level, it’s tantamount to giving up.

Before even getting into the cost/​benefit of that attitude, in the worlds where we do make it out alive, I don’t want to look back and see a version of me where that became my goal. I also don’t think that if that was my goal, that I would fight nearly as hard to achieve it. I want a catgirl volcano lair not “dignity”. So when I try to negotiate with my money brain to expend precious calories, the plan had better involve the former, not the latter. I suspect that something similar applies to others.

I don’t want to hear about genre-saviness from the defacto-founder of the community that gave us HPMOR!Harry and the Comet King after he wrote this post. Because it’s so antithetical to the attitude present in those characters and posts like this one.

I also don’t want to hear about second-order effects when, as best as I can tell, the attitude present here is likely to push people towards ineffective doomerism, rather than actually dying with dignity.

So instead, I’m gonna think carefully about my next move, come up with a plan, blast some shonen anime OSTs, and get to work. Then, amongst all the counterfactual worlds, there will be a version of me that gets to look back and know that they faced the end of the world, rose to the challenge, and came out the other end having carved utopia out of the bones of lovecraftian gods.

• I think there’s an important point about locus of control and scope. You can imagine someone who, early in life, decides that their life’s work will be to build a time machine, because the value of doing so is immense (turning an otherwise finite universe into an infinite one, for example). As time goes on, they notice being more and more pessimistic about their prospects of doing that, but have some block against giving up on an emotional level. The stakes are too high for doomerism to be entertained!

But I think they overestimated their locus of control when making their plans, and they should have updated as evidence came in. If they reduced the scope of their ambitions, they might switch from plans that are crazy because they have to condition on time travel being possible to plans that are sane (because they can condition on actual reality). Maybe they just invent flying cars instead of time travel, or whatever.

I see this post as saying: “look, people interested in futurism: if you want to live in reality, this is where the battle line actually is. Fight your battles there, don’t send bombing runs behind miles of anti-air defenses and wonder why you don’t seem to be getting any hits.” Yes, knowing the actual state of the battlefield might make people less interested in fighting in the war, but especially for intellectual wars it doesn’t make sense to lie to maintain morale.

[In particular, lies of the form “alignment is easy!” work both to attract alignment researchers and convince AI developers and their supporters that developing AI is good instead of world-ending, because someone else is handling the alignment bit.]

• lies of the form “alignment is easy!”

Aside: Regardless of whether the quoted claim is true, it does not seem like a prototypical lie. My read of your meaning is: “If you [the hypothetical person claiming alignment is easy] were an honest reasoner and worked out the consequences of what you know, you would not believe that alignment is easy; thusly has an inner deception blossomed into an outer deception; thus I call your claim a ‘lie.’”

And under that understanding of what you mean, Vaniver, I think yours is not a wholly inappropriate usage, but rather unconventional. In its unconventionality, I think it implies untruths about the intentions of the claimants. (Namely, that they semi-consciously seek to benefit by spreading a claim they know to be false on some level.) In your shoes, I think I would have just called it an “untruth” or “false claim.”

Edit: I now think you might have been talking about EY’s hypothetical questioners who thought it valuable to purposefully deceive about the problem’s difficulty, and not about the typical present-day person who believes alignment is easy?

• Edit: I now think you might have been talking about EY’s hypothetical questioners who thought it valuable to purposefully deceive about the problem’s difficulty, and not about the typical present-day person who believes alignment is easy?

That is what I was responding to.

• “To win any battle, you must fight as if you are already dead.” — Miyamoto Musashi.

I don’t in fact personally know we won’t make it. This may be because I’m more ignorant than Eliezer, or may be because he (or his April first identity, I guess) is overconfident on a model, relative to me; it’s hard to tell.

Regardless, the bit about “don’t get psychologically stuck having-to-(believe/​pretend)-it’s-gonna-work seems really sane and healthy to me. Like falling out of an illusion and noticing your feet on the ground. The ground is a more fun and joyful place to live, even when things are higher probability of death than one is used to acting-as-though, in my limited experience. More access to creativity near the ground, I think.

But, yes, I can picture things under the heading “ineffective doomerism” that seem to me like they suck. Like, still trying to live in an ego-constructed illusion of deferral, and this time with “and we die” pasted on it, instead of “and we definitely live via such-and-such a plan.”

• I think I have more access to all of my emotional range nearer the ground, but this sentence doesn’t ring true to me.

The ground is a more fun and joyful place to live, even when things are higher probability of death than one is used to acting-as-though, in my limited experience.

• Hm. It rings true to me, but there have been periods of my life where it has been false.

• As cheesy as it is, this is the correct response. I’m a little disappointed that Eliezer would resort to doomposting like this, but at the same time it’s to be expected from him after some point. The people with remaining energy need to understand his words are also serving a personal therapeautic purpose and press on.

• Some people can think there’s next to no chance and yet go out swinging. I plan to, if I reach the point of feeling hopeless.

• Yeah—I love AI_WAIFU’s comment, but I love the OP too.

To some extent I think these are just different strategies that will work better for different people; both have failure modes, and Eliezer is trying to guard against the failure modes of ‘Fuck That Noise’ (e.g., losing sight of reality), while AI_WAIFU is trying to guard against the failure modes of ‘Try To Die With More Dignity’ (e.g., losing motivation).

My general recommendation to people would be to try different framings /​ attitudes out and use the ones that empirically work for them personally, rather than trying to have the same lens as everyone else. I’m generally a skeptic of advice, because I think people vary a lot; so I endorse the meta-advice that you should be very picky about which advice you accept, and keep in mind that you’re the world’s leading expert on yourself. (Or at least, you’re in the best position to be that thing.)

Cf. ‘Detach the Grim-o-Meter’ versus ‘Try to Feel the Emotions that Match Reality’. Both are good advice in some contexts, for some people; but I think there’s some risk from taking either strategy too far, especially if you aren’t aware of the other strategy as a viable option.

• Please correct me if I am wrong, but a huge difference between Eliezer’s post and AI_WAIFU’s comment is that Eliezer’s post is informed by conversations with dozens of people about the problem.

• I interpreted AI_WAIFU as pushing back against a psychological claim (‘X is the best attitude for mental clarity, motivation, etc.’), not as pushing back against a AI-related claim like P(doom). Are you interpreting them as disagreeing about P(doom)? (If not, then I don’t understand your comment.)

If (counterfactually) they had been arguing about P(doom), I’d say: I don’t know AI_WAIFU’s level of background. I have a very high opinion of Eliezer’s thinking about AI (though keep in mind that I’m a co-worker of his), but EY is still some guy who can be wrong about things, and I’m interested to hear counter-arguments against things like P(doom). AGI forecasting and alignment are messy, pre-paradigmatic fields, so I think it’s easier for field founders and authorities to get stuff wrong than it would be in a normal scientific field.

The specific claim that Eliezer’s P(doom) is “informed by conversations with dozens of people about the problem” (if that’s what you were claiming) seems off to me. Like, it may be technically true under some interpretation, but (a) I think of Eliezer’s views as primarily based on his own models, (b) I’d tentatively guess those models are much more based on things like ‘reading textbooks’ and ‘thinking things through himself’ than on ‘insights gleaned during back-and-forth discussions with other people’, and (c) most people working full-time on AI alignment have far lower P(doom) than Eliezer.

• Sorry for the lack of clarity. I share Eliezer’s pessimism about the global situation (caused by rapid progress in AI). All I meant is that I see signs in his writings that over the last 15 years Eliezer has spent many hours trying to help at least a dozen different people become effective at trying to improve the horrible situation we are currently in. That work experience makes me pay much greater attention to him on the subject at hand than someone I know nothing about.

• Ah, I see. I think Eliezer has lots of relevant experience and good insights, but I still wouldn’t currently recommend the ‘Death with Dignity’ framing to everyone doing good longtermist work, because I just expect different people’s minds to work very differently.

• Assuming this is correct (certainly it is of Eliezer, though I don’t know AI_WAIFU’s background and perhaps they have had similar conversations), does it matter? WAIFU’s point is that we should continue trying as a matter of our terminal values; that’s not something that can be wrong due to the problem being difficult.

• I agree, but do not perceive Eliezer as having stopped trying or as advising others to stop trying, er, except of course for the last section of this post (“Q6: . . . All of this is just an April Fool’s joke, right?”) but that is IMHO addressed to a small fraction of his audience.

• I don’t want to speak for him (especially when he’s free to clarify himself far better than we could do for him!), but dying with dignity conveys an attitude that might be incompatible with actually winning. Maybe not; sometimes abandoning the constraint that you have to see a path to victory makes it easier to do the best you can. But it feels concerning on an instinctive level.

• In my experience, most people cannot.

• I think both emotions are helpful at motivating me.

• I think I’m more motivated by the thought that I am going to die soon, any children I might have in the future will die soon, my family, my friends, and their children are going to die soon, and any QALYs I think I’m buying are around 40% as valuable as I thought, more than undoing the income tax deduction I get for them.

It seems like wrangling my ADHD brain into looking for way to prevent catastrophe could be more worthwhile than working a high-paid job I can currently hyper-focus on (and probably more virtuous, too), unless I find that the probability of success is literally 0% despite what I think I know about Bayesian reasoning, in which case I’ll probably go into art or something.

• Makes me think of the following quote. I’m not sure how much I agree with or endorse it, but it’s something to think about.

The good fight has its own values. That it must end in irrevocable defeat is irrelevant.

— Isaac Asimov, It’s Been A Good Life

• Exquisitely based

• Agreed. Also here’s the poem that goes with that comment:

Do not go gentle into that good night,
Old age should burn and rave at close of day;
Rage, rage against the dying of the light.

Though wise men at their end know dark is right,
Because their words had forked no lightning they
Do not go gentle into that good night.

Good men, the last wave by, crying how bright
Their frail deeds might have danced in a green bay,
Rage, rage against the dying of the light.

Wild men who caught and sang the sun in flight,
And learn, too late, they grieved it on its way,
Do not go gentle into that good night.

Grave men, near death, who see with blinding sight
Blind eyes could blaze like meteors and be gay,
Rage, rage against the dying of the light.

And you, my father, there on the sad height,
Curse, bless, me now with your fierce tears, I pray.
Do not go gentle into that good night.
Rage, rage against the dying of the light.

I totally empathize with Eliezer, and I’m afraid that I might be similarly burned out if I had been trying this for as long.

But that’s not who I want to be. I want to be Harry who builds a rocket to escape Azkaban, the little girl that faces the meteor with a baseball bat, and the general who empties his guns into the sky against another meteor (minus all his racism and shit).

I bet I won’t always have the strength for that – but that’s the goal.

• Harry Potter and the Comet King have access to magic; we don’t.

… is the obvious response, but the correct response is actually:

Harry Potter and the Comet King don’t exist, so what attitude is present in those characters is irrelevant to the question of what attitude we, in reality, ought to have.

• Most fictional characters are optimised to make for entertaining stories, hence why “generalizing from fictional evidence” is usually a failure-mode. The HPMOR Harry and the Comet King were optimized by two rationalists as examples of rationalist heroes — and are active in allegorical situations engineered to say something that rationalists would find to be “of worth” about real world problems.

They are appealing precisely because they encode assumptions about what a real-world, rationalist “hero” ought to be like. Or at least, that’s the hope. So, they can be pointed to as “theses” about the real world by Yudkowsky and Alexander, no different from blog posts that happen to be written as allegorical stories, and if people found the ideas encoded in those characters more convincing than the ideas encoded in the present April Fools’ Day post, that’s fair enough.

Not necessarily correct on the object-level, but, if it’s wrong, it’s a different kind of error from garden-variety “generalizing from fictional evidence”.

• +1

• As fictional characters popular among humans, what attitude is present in them is evidence for what sort of attitude humans like to see or inhabit. As author of those characters, Yudkowsky should be aware of this mechanism. And empirically, people with accurate beliefs and positive attitudes outperform people with accurate beliefs and negative attitudes. It seems plausible Yudkowsky is aware of this as well.

“Death with dignity” reads as an unnecessarily negative attitude to accompany the near-certainty of doom. Heroism, maximum probability of catgirls, or even just raw log-odds-of-survival seem like they would be more motivating than dignity without sacrificing accuracy.

Like, just substitute all instances of ‘dignity’ in the OP with ‘heroism’ and naively I would expect this post to have a better impact(/​be more dignified/​be more heroic), except insofar it might give a less accurate impression of Yudkowsky’s mood. But few people have actually engaged with him on that front.

• You have a money brain? That’s awesome, most of us only have monkey brains! 🙂

• Why the downboats? People new to LW jargon probably wouldn’t realize “money brain” is a typo.

• Seemed like a bit of a rude way to let someone know they had a typo, I would have just gone with “Typo: money brain should be monkey brain”.

• Here we see irrationality for what it is: a rational survival mechanism.

• Seeing this post get so strongly upvoted makes me feel like I’m going crazy.

This is not the kind of content I want on LessWrong. I did not enjoy it, I do not think it will lead me to be happier or more productive toward reducing x-risk, I don’t see how it would help others, and it honestly doesn’t even seem like a particularly well done version of itself.

Can people help me understand why they upvoted?

• For whatever it is worth, this post along with reading the unworkable alignment strategy on the ELK report has made me realize that we actually have no idea what to do and has finally convinced me to try to solve alignment, I encourage everyone else to do the same. For some people knowing that the world is doomed by default and that we can’t just expect the experts to save it is motivating. If that was his goal, he achieved it.

• Certainly for some people (including you!), yes. For others, I expect this post to be strongly demotivating. That doesn’t mean it shouldn’t have been written (I value honestly conveying personal beliefs and are expressing diversity of opinion enough to outweigh the downsides), but we should realistically expect this post to cause psychological harm for some people, and could also potentially make interaction and PR with those who don’t share Yudkowsky’s views harder. Despite some claims to the contrary, I believe (through personal experience in PR) that expressing radical honesty is not strongly valued outside the rationalist community, and that interaction with non-rationalists can be extremely important, even to potentially world-saving levels. Yudkowsky, for all of his incredible talent, is frankly terrible at PR (at least historically), and may not be giving proper weight to its value as a world-saving tool. I’m still thinking through the details of Yudkowsky’s claims, but expect me to write a post here in the near future giving my perspective in more detail.

• I don’t think “Eliezer is terrible at PR” is a very accurate representation of historical fact. It might be a good representation of something else. But it seems to me that deleting Eliezer from the timeline would probably result in a world where far far fewer people were convinced of the problem. Admittedly, such questions are difficult to judge.

I think “Eliezer is bad at PR” rings true in the sense that he belongs in the cluster of “bad at PR”; you’ll make more correct inferences about Eliezer if you cluster him that way. But on historical grounds, he seems good at PR.

• Eliezer is “bad at PR” in the sense that there are lots of people who don’t like him. But that’s mostly irrelevant. The people who do like him like him enough to donate to his foundation and all of the foundations he inspired.

• It’s the people who don’t like him (and are also intelligent and in positions of power), which I’m concerned with in this context. We’re dealing with problems where even a small adversarial group can do a potentially world-ending amount of harm, and that’s pretty important to be able to handle!

• My personal experience is that the people who actively dislike Eliezer are specifically the people who were already set on their path; they dislike Eliezer mostly because he’s telling them to get off that path.

I could be wrong, however; my personal experience is undoubtedly very biased.

• I’ll tell you that one of my brothers (who I greatly respect) has decided not to be concerned about AGI risks specifically because he views EY as being a very respected “alarmist” in the field (which is basically correct), and also views EY as giving off extremely “culty” and “obviously wrong” vibes (with Roko’s Basilisk and EY’s privacy around the AI boxing results being the main examples given), leading him to conclude that it’s simply not worth engaging with the community (and their arguments) in the first place. I wouldn’t personally engage with what I believe to be a doomsday cult (even if they claim that the risk of ignoring them is astronomically high), so I really can’t blame him.

I’m also aware of an individual who has enormous cultural influence, and was interested in rationalism, but heard from an unnamed researcher at Google that the rationalist movement is associated with the alt-right, so they didn’t bother looking further. (Yes, that’s an incorrect statement, but came from the widespread [possibly correct?] belief that Peter Theil is both alt-right and has/​had close ties with many prominent rationalists.) This indicates a general lack of control of the narrative surrounding the movement, and likely has directly led to needlessly antagonistic relationships.

• That’s putting it mildly.

The problems are well known. The mystery is why the community doesn’t implement obvious solutions. Hiring PR people is an obvious solution. There’s a posting somewhere in which Anna Salamon argues that there is some sort of moral hazard involved in professional PR, but never explains why, and everyone agrees with her anyway.

If the community really and literally is about saving the world, then having a constant stream of people who are put off, or even becoming enemies is incrementally making the world more likely to be destroyed. So surely it’s an important problem to solve? Yet the community doesn’t even like discussing it. It’s as if maintaining some sort of purity, or some sort of impression that you don’t make mistakes is more important than saving the world.

• Presumably you mean this post.

If the community really and literally is about saving the world, then having a constant stream of people who are put off, or even becoming enemies is incrementally making the world more likely to be destroyed. So surely it’s an important problem to solve? Yet the community doesn’t even like discussing it. It’s as if maintaining some sort of purity, or some sort of impression that you don’t make mistakes is more important than saving the world.

I think there are two issues.

First, some of the ‘necessary to save the world’ things might make enemies. If it’s the case that Bob really wants there to be a giant explosion, and you think giant explosions might kill everyone, you and Bob are going to disagree about what to do, and Bob existing in the same information environment as you will constrain your ability to share your preferences and collect allies without making Bob an enemy.

Second, this isn’t an issue where we can stop thinking, and thus we need to continue doing things that help us think, even if those things have costs. In contrast, in a situation where you know what plan you need to implement, you can now drop lots of your ability to think in order to coordinate on implementing that plan. [Like, a lot of the “there are too much PR in EA” complaints were specifically about situations where people were overstating the effectiveness of particular interventions, which seemed pretty poisonous to the project of comparing interventions, which was one of the core goals of EA, rather than just ‘money moved’ or ‘number of people pledging’ or so on.]

That said, I agree that this seems important to make progress on; this is one of the reasons I worked in communications roles, this is one of the reasons I try to be as polite as I am, this is why I’ve tried to make my presentation more adaptable instead of being more willing to write people off.

• First, some of the ‘necessary to save the world’ things might make enemies. If it’s the case that Bob really wants there to be a giant explosion, and you think giant explosions might kill everyone, you and Bob are going to disagree about what to do, and Bob existing in the same information environment as you will constrain your ability to share your preferences and collect allies without making Bob an enemy.

So...that’s a metaphor for “telling people who like building AIs to stop building AIs pisses them off and turns them into enemies”. Which it might, but how often does that happen? Your prominent enemies aren’t in that category , as far as I can see. David Gerard,for instance, was alienated by a race/​IQ discussion. So good PR might consist of banning race/​IQ.

Also, consider the possibility that people who know how to build AIs know more than you, so it’s less a question of their being enemies , and more one of their being people you can learn from.

• Which it might, but how often does that happen?

I don’t know how public various details are, but my impression is that this was a decent description of the EY—Dario Amodei relationship (and presumably still is?), tho I think personality clashes are also a part of that.

Also, consider the possibility that people who know how to build AIs know more than you, so it’s less a question of their being enemies , and more one of their being people you can learn from.

I mean, obviously they know more about some things and less about others? Like, virologists doing gain of function research are also people who know more than me, and I could view them as people I could learn from. Would that advance or hinder my goals?

• If you are under some kind of misapprehension about the nature of their work, it would help. And you don’t know that you are not under a misapprehension, because they are the experts, not you. So you need to talk to them anyway. You might believe that you understand the field flawlessly, but you dont know until someone checks your work.

• That said, I agree that this seems important to make progress on; this is one of the reasons I worked in roles, this is one of the reasons I try to be as polite as I am, this is why I’ve tried to make my presentation more adaptable instead of being more willing to write people off.

It is not enough to say nice things: other representatives must be prevented from saying nasty things.

• None of these seem to reflect on EY unless you would expect him to be able to predict that a journalist would write an incoherent almost maximally inaccurate description of an event where he criticized an idea for being implausible then banned its discussion for being off-topic/​pointlessly disruptive to something like two people or that his clearly written rationale for not releasing the transcripts for the ai box experiments would be interpreted as a recruiting tool for the only cult that requires no contributions to be a part of, doesn’t promise its members salvation/​supernatural powers, has no formal hierarchy and is based on a central part of economics.

• I would not expect EY to have predicted that himself, given his background. If, however, he either had studied PR deeply or had consulted with a domain expert before posting, then I would have totally expected that result to be predicted with some significant likelihood. Remember, optimally good rationalists should win, and be able to anticipate social dynamics. In this case EY fell into a social trap he didn’t even know existed, so again, I do not blame him personally, but that does not negate the fact that he’s historically not been very good at anticipating that sort of thing, due to lack of training/​experience/​intuition in that field. I’m fairly confident that at least regarding the Roko’s Basilisk disaster, I would have been able to predict something close to what actually happened if I had seen his comment before he posted it. (This would have been primarily due to pattern matching between the post and known instances of the Striezand Effect, as well as some amount of hard-to-formally-explain intuition that EY’s wording would invoke strong negative emotions in some groups, even if he hadn’t taken any action. Studying “ratio’d” tweets can help give you a sense for this, if you want to practice that admittedly very niche skill). I’m not saying this to imply that I’m a better rationalist than EY (I’m not), merely to say that EY—and the rationalist movement generally—hasn’t focused on honing the skillset necessary to excel at PR, which has sometimes been to our collective detriment.

• The question is whether people who prioritize social-position/​status-based arguments over actual reality were going to contribute anything meaningful to begin with.

The rationalist community has been built on, among other things, the recognition that human species is systematically broken when it comes to epistemic rationality. Why think that someone who fails this deeply wouldn’t continue failing at epistemic rationality at every step even once they’ve already joined?

• Why think that someone who fails this deeply wouldn’t continue failing at epistemic rationality at every step even once they’ve already joined?

I think making the assumption that anyone who isn’t in our community is failing to think rationally is itself not great epistemics. It’s not irrational at all to refrain from engaging with the ideas of a community you believe to be vaguely insane. After all, I suspect you haven’t looked all that deeply into the accuracy of the views of the Church of Scientology, and that’s not a failure on your part, since there’s little chance you’ll gain much of value for your time if you did. There are many, many, many groups out there who sound intelligent at first glance, but when seriously engaged with fall apart. Likewise, there are those groups which sound insane at first, but actually have deep truths to teach (I’d place some forms of Zen Buddhism under this category). It makes a lot of sense to trust your intuition on this sort of thing, if you don’t want to get sucked into cults or time-sinks.

• I think making the assumption that anyone who isn’t in our community

I didn’t talk about “anyone who isn’t in our community,” but about

people who prioritize social-position/​status-based arguments over actual reality[.]

It’s not irrational at all to refrain from engaging with the ideas of a community you believe to be vaguely insane.

It’s epistemically irrational if I’m implying the ideas are false and if this judgment isn’t born from interacting with the ideas themselves but with

social-position/​status-based arguments[.]

• You make a strong point, and as such I’ll emend my statement a bit—Eliezer is great at PR aimed at a certain audience in a certain context, which is not universal. Outside of that audience, he is not great at Public Relations(™) in the sense of minimizing the risk of gaining a bad reputation. Historically, I am mostly referring to Eliezer’s tendency to react to what he’s believed to be infohazards in such a way that what he tried to suppress was spread vastly beyond the counterfactual world in which Eliezer hadn’t reacted at all. You only need to slip up once when it comes to risking all PR gains (just ask the countless politicians destroyed by a single video or picture), and Eliezer has slipped up multiple times in the past (not that I personally blame him; it’s a tremendously difficult skillset which I doubt he’s had the time to really work on). All of this is to say that yes, he’s great at making powerful, effective arguments, which convince many rationalist-leaning people. That is not, however, what it means to be a PR expert, and is only one small aspect of a much larger domain which rationalists have historically under-invested in.

• I very much had the same experience, making me decide to somewhat radically re-orient my life.

• What part of the ELK report are you saying felt unworkable?

• ELK itself seems like a potentially important problem to solve, the part that didn’t make much sense to me was what they plan to do with the solution, their idea based on recursive delegation.

• Awesome. What are your plans?

Have you considered booking a call with AI Safety Support, registering your interest for the next AGI Safety Fundamentals Course or applying to talk to 80,000 hours?

• I will probably spend 4 days (from the 14th to the 17th, I’m somewhat busy until then) thinking about alignment to see whether there is any chance I might be able to make progress. I have read what is recommended as a starting point on the alignment forum, and can read the AGI Safety Fundamentals Course’s curriculum on my own. I will probably start by thinking about how to formalize (and compute) something similar to what we call human values, since that seems to be the core of the problem, and then turning that into something that can be evaluated over possible trajectories of the AI’s world model (or over something like reasoning chains or whatever, I don’t know). I hadn’t considered that as a career, I live in Europe and we don’t have that kind of organizations here, so it will probably just be a hobby.

• Sounds like a great plan! Even if you end up deciding that you can’t make research progress (not that you should give up after just 4 days!), I can suggest a bunch of other activities that might plausibly contribute towards this.

I hadn’t considered that as a career, I live in Europe and we don’t have that kind of organizations here, so it will probably just be a hobby.

I expect that this will change within the next year or so (for example, there are plans for a Longtermist Hotel in Berlin and I think it’s very likely to happen).

• What other activities?

• Here’s a few off the top of my mind:

• Applying to facilitate the next rounds of the AGI Safety Fundamentals course (apparently they compensated facilitators this time)
• Contributing to Stampy Wiki
• AI Safety Movement Building—this can be as simple as hosting dinners with two or three people who are also interested
• General EA/​rationalist community building
• Trying to improve online outreach. Take for example the AI Safety Discussion (Open) fb group. They could probably be making better use of the sidebar. The moderator might be open to updating it if someone reached out to them and offered to put in the work. It might be worth seeing what other groups are out there too.

Let me know if none of these sound interesting and I could try to think up some more.

• Same this post is what made me decide I can’t leave it to the experts. It is just a matter of spending the required time to catch up on what we know and tried. As Keltham said—Diversity is in itself an asset. If we can get enough humans to think about this problem we can get some breakthroughs many some angles others have not thought of yet.

For me, it was not demotivating. He is not a god, and it ain’t over until the fat lady sings. Things are serious and it just means we should all try our best. In fact, I am kinda happy to imagine we might see a utopia happen in my lifetime. Most humans don’t get a chance to literally save the world. It would be really sad if I died a few years before some AGI turned into a superintelligence.

• I primarily upvoted it because I like the push to ‘just candidly talk about your models of stuff’:

I think we die with slightly more dignity—come closer to surviving, as we die—if we are allowed to talk about these matters plainly. Even given that people may then do unhelpful things, after being driven mad by overhearing sane conversations. I think we die with more dignity that way, than if we go down silent and frozen and never talking about our impending death for fear of being overheard by people less sane than ourselves.

I think that in the last surviving possible worlds with any significant shred of subjective probability, people survived in part because they talked about it; even if that meant other people, the story’s antagonists, might possibly hypothetically panic.

Also because I think Eliezer’s framing will be helpful for a bunch of people working on x-risk. Possibly a minority of people, but not a tiny minority. Per my reply to AI_WAIFU, I think there are lots of people who make the two specific mistakes Eliezer is warning about in this post (‘making a habit of strategically saying falsehoods’ and/​or ‘making a habit of adopting optimistic assumptions on the premise that the pessimistic view says we’re screwed anyway’).

The latter, especially, is something I’ve seen in EA a lot, and I think the arguments against it here are correct (and haven’t been talked about much).

• Given how long it took me to conclude whether these were Eliezer’s true thoughts or a representation of his predicted thoughts in a somewhat probable future, I’m not sure whether I’d use the label “candid” to describe the post, at least without qualification.

While the post does contain a genuinely useful way of framing near-hopeless situations and a nuanced and relatively terse lesson in practical ethics, I would describe the post as an extremely next-level play in terms of its broader purpose (and leave it at that).

• I… upvoted it because it says true and useful things about how to make the world not end and proposes an actionable strategy for how to increase our odds of survival while relatively thoroughly addressing a good number of possible objections. The goal of LessWrong is not to make people happier, and the post outlines a pretty clear hypothesis about how it might help others (1. by making people stop working on plans that condition on lots of success in a way that gets ungrounded from reality, 2. by making people not do really dumb unethical things out of desperation).

• Ditto.

Additionally, the OP seems to me good for communication: Eliezer had a lot of bottled up thoughts, and here put them out in the world, where his thoughts can bump into other people who can in turn bump back into him.

AFAICT, conversation (free, open, “non-consequentialist” conversation, following interests and what seems worth sharing rather than solely backchaining from goals) is one of the places where consciousness and sanity sometimes enter. It’s right there next to “free individual thought” in my list of beautiful things that are worth engaging in and safeguarding.

• I upvoted it because I think it’s true and I think that this is a scenario where ‘epistemic rationality’ concerns trump ‘instrumental rationality’ concerns.

• Agreed with regards to “epistemic rationality” being more important at times than “instrumental rationality.” That being said, I don’t think that concerns about the latter are unfounded.

• I upvoted it because I wish I could give Eliezer a hug that actually helps make things better, and no such hug exists but the upvote button is right there.

• I strong-upvoted this post because I read a private draft by Eliezer which is a list of arguments why we’re doomed. The private draft is so informative that, if people around me hadn’t also read and discussed it, I would have paid several months of my life to read it. It may or may not be published eventually. This post, being a rant, is less useful, but it’s what we have for now. It’s so opaque and confusing that I’m not even sure if it’s net good, but if it’s 5% as good as the private document it still far surpasses my threshold for a strong upvote.

EDIT: it may or may not be published eventually

• Oooh, that sounds great!

Can someone send me a copy so I can perv out on how doomed we are? Who knows, my natural contrarian instincts might fire and I might start to think of nitpicks and counterarguments.

But at the very least, I will enjoy it loads, and that’s something?

• If you’re still offering to share, i would like to read it faunam@gmail.com

• Personally, I am not here (or most other places) to “enjoy myself” or “be happier”. Behind the fool’s licence of April 1, the article seems to me to be saying true and important things. If I had any ideas about how to solve the AGI problem that would pass my shoulder Eliezer test, I would be doing them all the more. However, lacking such ideas, I only cultivate my garden.

• Have you considered registering for the next round of the AGI Safety Fundamentals course, booking a call with AI Safety Support or talking to 80,000 Hours?

• No, not at all. I have no ideas in this field, and what’s more, I incline to Eliezer’s pessimism, as seen in the recently posted dialogues, about much of what is done.

• I’d still encourage you to consider projects at a meta-level up such as movement-building or earn-to-give. But also totally understand if you consider the probabilities of success too low to really bother about.

• I have a weird bias towards truth regardless of consequences, and upvoted out of emotional reflex. Also I love Eliezer’s writing and it is a great comfort to me to have something fun to read on the way to the abyss.

• I disagree with Eliezer about half the time, including about very fundamental things, but I strongly upvoted the post, because that attitude gives both the best chance of success conditional on the correct evaluation of the problem, and it does not kill you if the evaluation is incorrect and the x-risk in question is an error in the model. It is basically a Max EV calculation for most reasonable probability distributions.

• Upvoted because it’s important to me to know what EY thinks the mainline-probability scenario looks like and what are the implications.

If that’s what he and MIRI think is the mainline scenario, then that’s what I think is the mainline scenario, because their quality of reasoning and depth of insight seems very high whenever I have an opportunity to examine it.

• I upvoted the post despite disagreeing with it (I believe the success probability is ~ 30%). Because, it seems important for people to openly share their beliefs in order to maximize our collective ability to converge on the truth. And, I do get some potentially valuable information from the fact that this is what Yudkowsky beliefs (even while disagreeing).

• Hi, I’m always fascinated by people with success probabilities that aren’t either very low or ‘it’ll probably be fine’.

I have this collection of intuitions (no more than that):

(1) ‘Some fool is going to build a mind’,

(2) ‘That mind is either going to become a god or leave the fools in position to try again, repeat’,

(3) ‘That god will then do whatever it wants’.

It doesn’t seem terribly relevant these days, but there’s another strand that says:

(4) ‘we have no idea how to build minds that want specific things’ and

(5) ‘Even if we knew how to build a mind that wanted a specific thing, we have no idea what would be a good thing’ .

These intuitions don’t leave me much room for optimism, except in the sense that I might be hopelessly wrong and, in that case, I know nothing and I’ll default back to ‘it’ll probably be fine’.

Presumably you’re disagreeing with one of (1), (2), or (3) and one of (4) or (5).

Which ones and where does the 30% from?

• I believe that we might solve alignment in time and aligned AI will protect us from unaligned AI. I’m not sure how to translate it to your 1-3 (the “god” will do whatever it wants, but it will want what we want so there’s no problem). In terms of 4-5, I guess I disagree with both or rather disagree that this state of ignorance will necessarily persist.

• Neat, so in my terms you think we can pull off 4 and 5 and get it all solid enough to set running before anyone else does 123?

4 and 5 have always looked like the really hard bits to me, and not the sort of thing that neural networks would necessarily be good at, so good luck!

But please be careful to avoid fates-worse-than-death by getting it almost right but not quite right. I’m reasonably well reconciled with death, but I would still like to avoid doing worse if possible.

• My initial reaction to the post was almost as negative as yours.

I’ve partly changed my mind, due to this steelman of Eliezer’s key point by Connor Leahy.

• I thought it was funny. And a bit motivational. We might be doomed, but one should still carry on. If your actions have at least a slight chance to improve matters, you should do it, even if the odds are overwhelmingly against you.

Not a part of my reasoning, but I’m thinking that we might become better at tackling the issue if we have a real sense of urgency—which this and A list of lethalities provide.

• My ability to take the alignment problem seriously was already hanging by a thread, and this post really sealed the deal for me. All we want is the equivalent of a BDSM submissive, a thing which actually exists and isn’t even particularly uncommon. An AI which can’t follow orders is not useful and would not be pursued seriously. (That’s why we got Instruct-GPT instead of GPT-4.) And even if one emerged, a rouge AI can’t do more damage than an intelligent Russian with an internet connection.

Apologies for the strong language, but this looks to me like a doomist cult that is going off the rails and I want no part in it.

• I disagree with each of your statements.

Why would something need to be able to follow orders to be useful? Most things in the world do not follow my orders (my furniture, companies that make all my products, most people I know). Like, imagine an AI assistant that’s really good at outputting emails from your inbox that make your company more profitable. You don’t know why it says what it says, but you have learned the empirical fact that as it hires people, fires people, changes their workloads, gives them assignments, that your profits go up a lot. I can’t really tell it what to do, but it sure is useful.

I think nobody knows how to write the code of a fundamentally submissive agent and that other agents are way easier to make, ones that are just optimizing in a way that doesn’t think in terms of submission/​dominance. I agree humans exist but nobody understands how they work or how to code one, and you don’t get to count on us learning that before we build super powerful AI systems.

I have no clue why you think that an intelligent Russian is the peak of optimization power. I think that’s a false and wildly anthropomorphic thing to think. Imagine getting 10 Von Neumanns in a locked room with only internet, already it’s more powerful than the Russian, and I bet could do some harm. Now imagine a million. Whatever gets you the assumption that an AI system can’t be more powerful than one human seems wild and I don’t know where you’re getting this idea from.

Btw, unusual ask, but do you want to hop on audio and hash out the debate more sometime? I can make a transcript and can link it here on LW, both posting our own one-paragraph takeaways. I think you’ve been engaging in a broadly good-faith way on the object level in this thread and others and I would be interested in returning the ball.

• do you want to hop on audio and hash out the debate more sometime?

Sure. The best way for me to do that would be through Discord. My id is lone-pine#4172

• Would you mind linking the transcript here if you decide to release it publicly? I’d love to hear both of your thoughts expressed in greater detail!

• That’d be the plan.

• (Ping to reply on Discord.)

• Sent you a friend request.

• Ooh, I like Ben’s response and am excited about the audio thing happening.

• I think nobody knows how to write the code of a fundamentally submissive agent

Conventional non AI computers are already fundamentally passive. If you boot them up, they just sit there. What’s the problem. The word agent?

Why would something need to be able to follow orders to be useful? Most things in the world do not follow my orders (my furniture, companies that make all my products, most people I know). Like, imagine an AI assistant that’s really good at outputting emails from your inbox that make your company more profitable. You don’t know why it says what it says, but you have learned the empirical fact that as it hires people, fires people, changes their workloads, gives them assignments, that your profits go up a lot. I can’t really tell it what to do, but it sure is useful.

If an AI assistant is replacing a human assistant , it needs to be controllable to the same extent. You don’t expect or want to micromanage a human assistant, but you do expect to set broad parameters.

• Yes, the word agent.

Sure, if it’s ‘replacing’, but my example isn’t one of replacement, it’s one where it’s useful in a different way to my other products, in a way that I personally suspect is easier to train/​build than something that does ‘replacement’.

• I at first also downvoted because your first argument looks incredibly weak (this post has little relation to arguing for/​against the difficulty of the alignment problem, what update are you getting on that from here?), as did the followup ‘all we need is...’ which is formulation which hides problems instead of solving them.
Yet, your last point does have import and that you explicitly stated that is useful in allowing everyone to address it, so I reverted to an upvote for honesty, though strong disagree.

To the point, I also want to avoid being in a doomist cult. I’m not a die hard long term “we’re doomed if don’t align AI” guy, but from my readings throughout the last year am indeed getting convinced of the urgency of the problem. Am I getting hoodwinked by a doomist cult with very persuasive rhetoric? Am I myself hoodwinking others when I talk about these problems and they too start transitioning to do alignment work?

I answer these questions not by reasoning on ‘resemblance’ (ie. how much does it look like a doomist cult) but going into finer detail. An implicit argument being made when you call [the people who endorse the top-level post] a doomist cult is that they share the properties of other doomist cults (being wrong, having bad epistemics/​policy, preying on isolated/​weird minds) and are thus bad. I understand having a low prior for doomist cults look-alikes actually being right (since there is no known instance of a doomist cult of world end being right), but that’s not reason to turn into a rock (as in https://​​astralcodexten.substack.com/​​p/​​heuristics-that-almost-always-work?s=r , believing that “no doom prophecy is ever right”. You can’t prove that no doom prophecy is ever right, only that they’re rarely right (and probably only once).

I thus advise changing your question “do [the people who endorse the top-level post] look like a doomist cult?” into “What would be sufficient level of argument and evidence so I would take this doomist-cult-looking goup seriously?”. It’s not a bad thing to call doom when doom is on the way. Engage with the object level argument and not with your precached pattern recognition “this looks like a doom cult so is bad/​not serious”. Personally, I had similar qualms as you’re expressing, but having looked into the arguments, it feels very strong and much more real to believe in “Alignement is hard and by default AGI is an existential risk” rather than not. I hope your conversation with Ben will be productive and that I haven’t only expressed points you already considered (fyi they have already been discussed on LessWrong).

• Thank you for trying.

• Shouldn’t someone (some organization) be putting a lot of effort and resources into this strategy (quoted below) in the hope that AI timelines are still long enough for the strategy to work? With enough resources, it should buy at least a few percentage of non-doom probability (even now)?

Given that there are known ways to significantly increase the number of geniuses (i.e., von Neumann level, or IQ 180 and greater), by cloning or embryo selection, an obvious alternative Singularity strategy is to invest directly or indirectly in these technologies, and to try to mitigate existential risks (for example by attempting to delay all significant AI efforts) until they mature and bear fruit (in the form of adult genius-level FAI researchers).

• Sure, why not. Sounds dignified to me.

• For starters, why aren’t we already offering the most basic version of this strategy as a workplace health benefit within the rationality /​ EA community? For example, on their workplace benefits page, OpenPhil says:

We offer a family forming benefit that supports employees and their partners with expenses related to family forming, such as fertility treatment, surrogacy, or adoption. This benefit is available to all eligible employees, regardless of age, sex, sexual orientation, or gender identity.

Seems a small step from there to making “we cover IVF for anyone who wants (even if your fertility is fine) + LifeView polygenic scores” into a standard part of the alignment-research-agency benefits package. Of course, LifeView only offers health scores, but they will also give you the raw genetic data. Processing this genetic data yourself, DIY style, could be made easier—maybe there could be a blog post describing how to use an open-source piece of software and where to find the latest version of EA3, and so forth.

All this might be a lot of trouble for (if you are pessimistic about PGT’s potential) a rather small benefit. We are not talking Von Neumanns here. But it might be worth creating a streamlined community infrastructure around this anyways, just in case the benefit becomes larger as our genetic techniques improve.

• I don’t see any realistic world where you both manage to get government permission to allow you to genetically engineer children for intelligence and they let you specifically raise them to do safety work far enough in advance that they actually have time to contribute and in a way that outweighs any PR risk.

• Embryo selection for intelligence does not require government permission to do. You can do it right now. You only need the models and the DNA. I’ve been planning on releasing a website that allows people to upload genetic data they get from LifeView for months, but I haven’t gotten around to finishing it for the same reason I think that others aren’t.

Part of me wants to not post this just because I want to be the first to make the website, but that seems immoral, so, here.

• Interesting. I had no idea.

• Both cloning and embryo selection are not illegal in many places, including the US. (This article suggests that for cloning you may have to satisfy the FDA’s safety concerns, which perhaps ought to be possible for a well-resourced organization.) And you don’t have to raise them specifically for AI safety work. I would probably announce that they will be given well-rounded educations that will help them solve whatever problems that humanity may face in the future.

• Sounds good to me! Anyone up for making this an EA startup?

Having more Neumann level geniuses around seems like an extremely high impact intervention for most things, not even just singularity related ones.

As for tractability, I can’t say anything about how hard this would be to get past regulators, or how much engineering work is missing for making human cloning market ready, but finding participants seems pretty doable? I’m not sure yet whether I want children, but if I decide I do, I’d totally parent a Neumann clone. If this would require moving to some country where cloning isn’t banned, I might do that as well. I bet lots of other EAs would too.

• The first thing I can remember is that I learned at age 3 that I would die someday, and I cried about it. I got my hopes up about radical technological progress (including AGI and biotech) extending lifespan as a teenager, and I lost most of that hope (and cried the most I ever had in my life) upon realizing that AGI probably wouldn’t save us during our lifetime, alignment was too hard.

In some sense this outcome isn’t worse than what I thought was fated at age 3, though. I mean, if AGI comes too soon, then I and my children (if I have them) won’t have the 70-80 year lifetimes I expected, which would be disappointing; I don’t think AGI is particularly likely to be developed before my children die, however (minority opinion around here, I know). There’s still some significant chance of radical life extension and cognitive augmentation from biotech assisted by narrow AI (if AGI is sufficiently hard, which I think it is, though I’m not confident). And as I expressed in another comment, there would be positive things about being replaced by a computationally massive superintelligence solving intellectual problems beyond my comprehension; I think that would comfort me if I were in the process of dying, although I haven’t tested this empirically.

Since having my cosmic hopes deflated, I have narrowed the scope of my optimization to more human-sized problems, like creating scalable decentralized currency/​contract/​communication systems, creating a paradigm shift in philosophy by re-framing and solving existing problems, getting together an excellent network of intellectual collaborators that can solve and re-frame outstanding problems and teach the next generation, and so on; these are still ambitious, and could possibly chain to harder goals (like alignment) in the future, but are more historically precedented than AGI or AI alignment. And so I appear to be very pessimistic (about my problem-solving ability) compared to when I thought I might be able to ~singlehandedly solve alignment, but still very optimistic compared to what could be expected of the average human.

• scalable decentralized currency/​contract/​communication systems

Oh? Do say more

• What paradigm shift are you trying to create in philosophy?

• The sort of thing I write about on my blog. Examples:

• Attention to “concept teaching” as a form of “concept definition”, using cognitive science models of concept learning

• “What is an analogy/​metaphor” and how those apply to “foundations” like materialism

• Reconciling “view from nowhere” with “view from somewhere”, yielding subject-centered interpretations of physics and interpretations of consciousness as relating to local knowledge and orientation

• Interpreting “metaphysics” as about local orientation of representation, observation, action, etc, yielding computer-sciencey interpretations of apparently-irrational metaphysical discourse (“qualia are a poor man’s metaphysics”)

• Sounds interesting. Hopefully, I come back and read some of those links when I have more time.

• Just a reminder to everyone, and mostly to myself:

Not flinching away from reality is entirely compatible with not making yourself feel like shit. You should only try to feel like shit when that helps.

• The anime protagonist just told everyone that there’s no hope. I don’t have a “don’t feel like shit” button. Not flinching away from reality and not feeling like shit are completely incompatible in this scenario given my mental constitution. There are people who can do better, but not me.

I’m going to go drinking.

• Given that, then yes, feeling like shit plus living-in-reality is your best feasible alternative.

Curling up into a ball and binge drinking till the eschaton probably is not though: see Q1.

• For what it’s worth, I think I prefer the phrase,
”Failing with style”

• So AI will destroy the planet and there’s no hope for survival?

Why is everyone here in agreement that AI will inevitably kill off humanity and destroy the planet?

Sorry I’m new to LessWrong and clicked on this post because I recognized the author’s name from the series on rationality.

• Why is everyone here in agreement that…

We’re not. There’s a spread of perspectives and opinions and lack-of-opinions. If you’re judging from the upvotes, might be worth keeping in mind that some of us think “upvote” should mean “this seems like it helps the conversation access relevant considerations/​arguments” rather than “I agree with the conclusions.”

Still, my shortest reply to “Why expect there’s at least some risk if an AI is created that’s way more powerful than humanity?” is something like: “It seems pretty common-sensical to think that alien entities that’re much more powerful than us may pose risk to us. Plus, we thought about it longer and that didn’t make the common sense of that ‘sounds risky’ go away, for many/​most of us.”

• The spread of opinions seems narrow compared to what I would expect. OP makes some bold predictions in his post. I see more debate over less controversial claims all of the time.

Sorry, but what do aliens have to do with AI?

• Part of the reason the spread seems small is that people are correctly inferring that this comment section is not a venue for debating the object-level question of Probability(doom via AI), but rather for discussing EY’s viewpoint as written in the post. See e.g. https://​​www.lesswrong.com/​​posts/​​34Gkqus9vusXRevR8/​​late-2021-miri-conversations-ama-discussion for more of a debate.

• Debating p(doom) here seems fine to me, unless there’s an explicit request to talk about that elsewhere.

• The spread of opinions seems narrow compared to what I would expect. OP makes some bold predictions in his post. I see more debate over less controversial claims all of the time.

That’s fair.

what do aliens have to do with AI?

Sorry, I said it badly/​unclearly. What I meant was: most ways to design powerful AI will, on my best guess, be “alien” intelligences, in the sense that they are different from us (think differently, have different goals/​values, etc.).

• I just want to say I don’t think that was unclear at all. It’s fair to expect people to know the wider meaning of the word ‘alien’.

• There’s an analogy being drawn between the power of a hypothetical advanced alien civilization and the power of a superintelligent AI. If you agree that the hypothetical AI would be more powerful, and that an alien civilization capable of travelling to Earth would be a threat, then it follows that superintelligent AI is a threat.

I think most people here are in agreement that AI poses a huge risk, but are differ on how likely it is that we’re all going to die. A 20% chance we’re all going to die is very much worth trying to mitigate sensibly, and the OP says still it’s worth trying to mitigate a 99.9999% chance of human extinction in a similarly level-headed manner (even if the mechanics of doing the work are slightly different at that point).

• +1 for asking the 101-level questions! Superintelligence, “AI Alignment: Why It’s Hard, and Where to Start”, “There’s No Fire Alarm for Artificial General Intelligence”, and the “Security Mindset” dialogues (part one, part two) do a good job of explaining why people are super worried about AGI.

“There’s no hope for survival” is an overstatement; the OP is arguing “successfully navigating AGI looks very hard, enough that we should reconcile ourselves with the reality that we’re probably not going to make it”, not “successfully navigating AGI looks impossible /​ negligibly likely, such that we should give up”.

If you want specific probabilities, here’s a survey I ran last year: https://​​www.lesswrong.com/​​posts/​​QvwSr5LsxyDeaPK5s/​​existential-risk-from-ai-survey-results. Eliezer works at MIRI (as do I), and MIRI views tended to be the most pessimistic.

• +1 for asking the 101-level questions!

Yes! Very much +1! I’ve been hanging around here for almost 10 years and am getting value from the response to the 101-level questions.

Superintelligence, “AI Alignment: Why It’s Hard, and Where to Start”, “There’s No Fire Alarm for Artificial General Intelligence”, and the “Security Mindset” dialogues (part one, part two) do a good job of explaining why people are super worried about AGI.

Honestly, the Wait But Why posts are my favorite and what I would recommend to a newcomer. (I was careful to say things like “favorite” and “what I’d recommend to a newcomer” instead of “best”.)

Furthermore, I feel like Karolina and people like them who are asking this sort of question deserve an answer that doesn’t require an investment of multiple hours of effortful reading and thinking. I’m thinking something like three paragraphs. Here is my attempt at one. Take it with the appropriate grain of salt. I’m just someone who’s hung around the community for a while but doesn’t deal with this stuff professionally.

Think about how much smarter humans are than dogs. Our intelligence basically gives us full control over them. We’re technologically superior. Plus we understand their psychology and can manipulate them into doing what we want pretty reliably. Now take that and multiply by, uh, a really big number. An AGI would be wayyyyy smarter than us.

Ok, but why do we assume there will be this super powerful AGI? Well, first of all, if you look at surveys, it looks like pretty much everyone does in fact believe it. It’s not something that people disagree on. They just disagree on when exactly it will happen. 10 years? 50 years? 200 years? But to address the question of how, well, the idea is that the smarter a thing is, the more capable it is of improving it’s own smartness. So suppose something starts off with 10 intelligence points and tries to improve and goes to 11. At 11 it’s more capable of improving than it was at 10, so it goes to a 13. +2 instead of +1. At 13 it’s even more capable of improving, so it goes to a 16. Then a 20. Then a 25. So on and so forth. Accelerating growth is powerful.

Ok, so suppose we have this super powerful thing that is inevitable and is going to have complete control over us and the universe. Why can’t we just program the thing from the start to be nice? This is the alignment problem. I don’t have the best understanding of it tbh, but I like to think about the field of law and contracts. Y’know how those terms of service documents are so long and no one reads them? It’s hard to specify all of the things you want upfront. “Oh yeah, please don’t do A. Oh wait I forgot B, don’t do that either. Oh and C, yeah, that’s another important one.” Even I understand that this is a pretty bad description of the alignment problem, but hopefully it at least serves the purpose of providing some amount of intuition.

“There’s no hope for survival” is an overstatement; the OP is arguing “successfully navigating AGI looks very hard, enough that we should reconcile ourselves with the reality that we’re probably not going to make it”, not “successfully navigating AGI looks impossible /​ negligibly likely, such that we should give up”.

Really? That wasn’t my read on it. Although I’m not very confident in my read on it. I came away from the post feeling confused.

For example, 0% was mentioned in a few places, like:

When Earth’s prospects are that far underwater in the basement of the logistic success curve, it may be hard to feel motivated about continuing to fight, since doubling our chances of survival will only take them from 0% to 0%.

Was that April Fools context exaggeration? My understanding is that what Vaniver said here is accurate: that Eliezer is saying that we’re doomed, but that he’s saying it on April Fools so that people who don’t want to believe it for mental health reasons (nervously starts raising hand) have an out. Maybe that’s not accurate though? Maybe it’s more like “We’re pretty doomed but I’m exaggerating/​joking about how doomed we are because it’s April Fools Day.”

• I think the Wait But Why posts are quite good, though I usually link them alongside Luke Muehlhauser’s reply.

For example, 0% was mentioned in a few places

It’s obviously not literally 0%, and the post is explicitly about ‘how do we succeed?’, with a lengthy discussion of the possible worlds where we do in fact succeed:

[...] The surviving worlds look like people who lived inside their awful reality and tried to shape up their impossible chances; until somehow, somewhere, a miracle appeared—the model broke in a positive direction, for once, as does not usually occur when you are trying to do something very difficult and hard to understand, but might still be so—and they were positioned with the resources and the sanity to take advantage of that positive miracle, because they went on living inside uncomfortable reality. Positive model violations do ever happen, but it’s much less likely that somebody’s specific desired miracle that “we’re all dead anyways if not...” will happen; these people have just walked out of the reality where any actual positive miracles might occur. [...]

The whole idea of ‘let’s maximize dignity’ is that it’s just a reframe of ‘let’s maximize the probability that we survive and produce a flourishing civilization’ (the goal of the reframe being to guard against wishful thinking):

Obviously, the measuring units of dignity are over humanity’s log odds of survival—the graph on which the logistic success curve is a straight line. [...] But if enough people can contribute enough bits of dignity like that, wouldn’t that mean we didn’t die at all? Yes, but again, don’t get your hopes up.

Hence:

Q1: Does ‘dying with dignity’ in this context mean accepting the certainty of your death, and not childishly regretting that or trying to fight a hopeless battle?

Don’t be ridiculous. How would that increase the log odds of Earth’s survival?

I dunno what the ‘0%’ means exactly, but it’s obviously not literal. My read of it was something like ‘low enough that it’s hard enough to be calibrated about exactly how low it is’, plus ‘low enough that you can make a lot of progress and still not have double-digit success probability’.

• I think the Wait But Why posts are quite good, though I usually link them alongside Luke Muehlhauser’s reply.

Cool! Good to get your endorsement. And thanks for pointing me to Muehlhauser’s reply. I’ll check it out.

I dunno what the ‘0%’ means exactly, but it’s obviously not literal. My read of it was something like ‘low enough that it’s hard enough to be calibrated about exactly how low it is’, plus ‘low enough that you can make a lot of progress and still not have double-digit success probability’.

Ok, yeah, that does sound pretty plausible. It still encompasses a pretty wide range though. Like, it could mean one in a billion, I guess? There is a grim tone here. And Eliezer has spoken about his pessimism elsewhere with a similarly grim tone. Maybe I am overreacting to that. I dunno. Like Raemon, I am still feeling confused.

It’s definitely worth noting though that even at a low number like one in a billion, it is still worth working on for sure. And I see that Eliezer believes this as well. So in that sense I take back what I said in my initial comment.

• One in a billion seems way, way, way too low to me. (Like, I think that’s a crazy p(win) to have, and I’d be shocked beyond shocked if Eliezer’s p(win) were that low. Like, if he told me that I don’t think I’d believe him.)

• Ok, that’s good to hear! Just checking, do you feel similarly about 1 in 100k?

• No, that’s a lot lower than my probability but it doesn’t trigger the same ‘that can’t be right, we must be miscommunicating somehow’ reaction.

• I see. Thanks for the response.

• FYI I am finding myself fairly confused about the “0%” line. I don’t see a reason not to take Eliezer at his word that he meant 0%. “Obviously not literal” feels pretty strong, if he meant a different thing I’d prefer the post say whatever he meant.

• Eliezer seemed quite clear to me when he said (paraphrased) “we are on the left side of the logistical success curve, where success is measured in significant digits of leading 0s you are removing from your probability of success”. The whole post seems to clearly imply that Eliezer thinks that marginal dignity is possible, which he defines as a unit of logistical movement on the probability of success. This clearly implies the probability is not literally 0, but it does clearly argue that the probability (on a linear scale) can be rounded to 0.

• Mostly I had no idea if he meant like 0.1, 0.001, or 0.00000001. Also not sure if he’s more like “survival chance is 0%, probably, with some margin of error, maybe it’s 1%”, or “no, I’m making the confident claim that it’s more like 0.0001″

(This was combined with some confusion about Nate Soares saying something in the vein of “if you don’t whole heartedly believe in your plans, you should multiply their EV by 0, and you’re not supposed to pick the plan who’s epsilon value is “least epsilon”)

Also, MIRI isn’t (necessarily) a hive mind, so not sure if Rob, Nate or Abram actually share the same estimate of how doomed we are as Eliezer.

• Also, MIRI isn’t (necessarily) a hive mind, so not sure if Rob, Nate or Abram actually share the same estimate of how doomed we are as Eliezer.

Indeed, I expect that the views of at least some individuals working at MIRI vary considerably.

In some ways, the post would seem more accurate to me if it had the Onion-esque headline: Eliezer announces on MIRI’s behalf that “MIRI adopts new ‘Death with Dignity’ strategy.”

Still, I love the post a lot. Also, Eliezer has always been pivotal in MIRI.

• The five MIRI responses in my AI x-risk survey (marked with orange dots) show a lot of variation in P(doom):

(Albeit it’s still only five people; maybe a lot of MIRI optimists didn’t reply, or maybe a lot of pessimists didn’t, for some reason.)

• Personally, I took it to be 0% within an implied # of significant digits, perhaps in the ballpark of three.

• This sequence covers some chunk of it, though it does already assume a lot of context. I think this sequence is the basic case for AI Risk, and doesn’t assume a lot of context.

• Minor meta note: others are free to disagree, but I think it would be useful if this comment section were a bit less trigger-happy about downvoting comments into the negatives.

I’m normally pretty gung-ho about downvotes, but in this case I think there’s more-than-usual value in people sharing their candid thinking, and too much downvoting can make people feel pressured to shape their words and thoughts in ways that others would approve of.

• Agreed. I am purposefully upvoting (and in some cases strong upvoting) a number of comments I disagree with, because I want to encourage people to speak their minds.

• OP definitely pissed me off enough to make me want to be more candid.

• I’m more optimistic than Yudkowsky[1], and I want to state what I think are the reasons for the different conclusions (I’m going to compare my own reasoning to my understanding of Yudkowsky’s reasoning, and the latter might be flawed), in a nutshell.

• Yudkowsky seems very pessimistic about alignment of anything resembling deep learning, and also believes that deep learning leads to TAI pretty soon. I’m both more optimistic about aligning deep learning and more skeptical of TAI soon.

• Optimism about deep learning: There has been considerable progress in theoretical understanding of deep learning. This understanding is far from complete, but also the problem doesn’t seem intractable. I think that we will have pretty good theory in a decade, more likely than not[2].

• Skepticism of TAI soon: My own models of AGI include qualitative elements that current systems don’t have. It is possible that the gap will be resolved soon, but also possible that a new “AI winter” will eventually result.

• Yudkowsky seems to believe we are pretty far from a good theory of rational agents. On the other hand, I have a model of how this theory will look like, and a concrete pathway towards constructing it.

• These differences seem to be partly caused by different assumptions regarding which mathematical tools are appropriate. MIRI have been very gung-ho about using logic and causal networks. At the same time they mostly ignored learning theory. These (IMO biased) preconceptions about what the correct theory should look like, combined with failure to make sufficient progress, led to an overly pessimistic view of overall tractability.

• MIRI’s recruiting was almost entirely targeted at the sort of people who would accept their pre-existing methods and assumptions. I suspect this created inertia and groupthink.

To be clear, I am deeply grateful to Yudkowsky in MIRI for the work they did and continue to do, not to mention funding my own work. I am voicing some criticism only because transparency is essential for success, as the OP makes clear.

1. ↩︎

My rough estimate of the success probability is 30%, but I haven’t invested much effort into calibrating this.

2. ↩︎

It is possible that doom will come sooner, but it doesn’t seem overwhelmingly likely.

• MIRI have been very gung-ho about using logic and causal networks. At the same time they mostly ignored learning theory.

I’ll remark in passing that I disagree with this characterization of events. We looked under some street lights where the light was better, because we didn’t think that others blundering around in the dark were really being that helpful—including because of the social phenomenon where they blundered around until a bad solution Goodharted past their blurry filters; we wanted to train people up in domains where wrong answers could be recognized as that by the sort of sharp formal criteria that inexperienced thinkers can still accept as criticism.

That was explicitly the idea at the time.

• Thanks for responding, Eliezer.

I’m not sure to what extent you mean that (i) your research programme was literally a training exercise for harder challenges ahead vs (ii) your research programme was born of despair: looking under a street light had a better chance of success even though the keys were not especially likely to be there.

If you mean (i), then what made you give up on this plan? From my perspective, the training exercise played its role and perhaps outlived its usefulness, why not move on beyond it?

If you mean (ii), then why such pessimism from the get-go? I imagine you reasoning along the lines of: developing the theory of rational agency is a difficult problem with little empirical feedback in early stages, hence it requires nigh impossible precision of reasoning. But, humanity actually has a not-bad track record in this type of questions in the last century. VNM, game theory, the Church-Turing thesis, information theory, complexity theory, Solomonoff induction: all these are examples of similar problems (creating a mathematical theory starting from an imprecise concept without much empirical data to help) in which we made enormous progress. They also look like they are steps towards the theory of rational agents itself. So, we “just” need to add more chapters to this novel, not do something entirely unprecedented[1]. Maybe your position is that the previous parts were done by geniuses who are unmatched in our generation because of lost cultural DNA?

I think that the “street light” was truly useful to better define multiple relevant problems (Newcombian decision problems, Vingean reflection, superrationality...), but it was not where the solutions are.

Another thing is, IMO (certain type of) blundering in the dark is helpful. In practice, science often doesn’t progress in a straight line from problem to solution. People try all sorts of things, guided partly by concrete problems and partly by sheer curiosity, some of those work out, some of those don’t work out, some of those lead to something entirely unexpected. As results accumulate, paradigms crystallize and it becomes clear which models were “True Names”[2] and which were blunders. And, yes, maybe we don’t have time for this. But I’m not so sure.

1. ↩︎

That is, the theory of rational agency wouldn’t be unprecedented. The project of dodging AI risk as a whole certainly has some “unprecedetedness” about it.

2. ↩︎

Borrowing the term from John Wentworth.

• While we happen to be on the topic: can I ask whether (a) you’ve been keeping up with Vanessa’s work on infra-Bayesianism, and if so, whether (b) you understand it well enough to have any thoughts on it? It sounds (and has sounded for quite a while) like Vanessa is proposing this as an alternative theoretical foundation for agency /​ updating, and also appears to view this as significantly more promising than the stuff MIRI has been doing (as is apparent from e.g. remarks like this):

Optimism about deep learning: There has been considerable progress in theoretical understanding of deep learning. This understanding is far from complete, but also the problem doesn’t seem intractable. I think that we will have pretty good theory in a decade, more likely than not[...]

Yudkowsky seems to believe we are pretty far from a good theory of rational agents. On the other hand, I have a model of how this theory will look like, and a concrete pathway towards constructing it.

Ideally I (along with anyone else interested in this field) would be well-placed to evaluate Vanessa’s claims directly; in practice it seems that very few people are able to do so, and consequently infra-Bayesianism has received very little discussion on LW/​AF (though my subjective impression of the discussion it has received is that those who discuss it seem to be reasonably impressed with /​ enthusiastic about it).

So, as long as one of the field’s founding members happens to be on LW handing out takes… could I ask for a take on infra-Bayesianism?

(You’ve stated multiple times that you find other people’s work unpromising; this by implication suggests also that infra-Bayesianism is one of the things you find unpromising, but only if you’ve been paying enough attention to it to have an assessment. It seems like infra-Bayesianism has flew under the radar of a lot of people, though, so I’m hesitantly optimistic that it may have underflew your radar as well.)

• And to this I reply: Obviously, the measuring units of dignity are over humanity’s log odds of survival—the graph on which the logistic success curve is a straight line. A project that doubles humanity’s chance of survival from 0% to 0% is helping humanity die with one additional information-theoretic bit of dignity.

Joking aside, this sort of objective function is interesting, and incoherent due to being non-VNM. E.g. if there’s a lottery between 0.1% chance of survival and 1% chance of survival, then how this lottery compares to a flat 0.5% chance of survival depends on the order in which the lottery is resolved. A priori, (50% of 0.1%, 50% of 1%) is equivalent to 0.55%, which is greater than 0.5%. On the other hand, the average log-odds (after selecting an element of this lottery) is 0.5 * log(0.1%) + 0.5 * log(1%) < log(0.5%).

This could lead to “negative VOI” situations where we avoid learning facts relevant to survival probability, because they would increase the variance of our odds, and that reduces expected log-odds since log is convex.

It’s also unclear whether to treat different forms of uncertainty differently, e.g. is logical uncertainty treated differently from indexical/​quantum uncertainty?

This could make sense as a way of evaluating policies chosen at exactly the present time, which would be equivalent to simply maximizing P(success). However, one has to be very careful with exactly how to evaluate odds to avoid VNM incoherence.

• First-order play for log-probability over short-term time horizons, as a good idea in real life when probabilities are low, arises the same way as betting fractions of your bankroll arises as a good idea in real life, by:

• expecting to have other future opportunities that look like a chance to play for log-odds gains,

• not expecting to have future opportunities that look like a chance to play for lump-sum-of-probability gains,

• and ultimately the horizon extending out to diminishing returns if you get that far.

That is, the pseudo-myopic version of your strategy is to bet fractions of your bankroll to win fractions of your bankroll. You don’t take a bet with 51% probability of doubling your bankroll and 49% probability of bankruptcy, if you expect more opportunities to bet later, there aren’t later opportunities that just give you lump-sum gains, and there’s a point beyond which money starts to saturate for you.

• Hmm. It seems like if you really expected to be able to gain log-odds in expectation in repeated bets, you’d immediately update towards a high probability, due to conservation of expected evidence. But maybe a more causal/​materialist model wouldn’t do this because it’s a fairly abstract consideration that doesn’t have obvious material support.

I see why “improve log-odds” is a nice heuristic for iteratively optimizing a policy towards greater chance of success, similar to the WalkSAT algorithm, which solves a constraint problem by changing variables around to reduce the number of violated constraints (even though the actual desideratum is to have no violated constraints); this is a way of “relaxing” the problem in a way that makes iterative hill-climbing-like approaches work more effectively.

Relatedly, some RL approaches give rewards for hitting non-victory targets in a game (e.g. number of levels cleared or key items gained), even if the eventual goal is to achieve a policy that beats the entire game.

• I think possibly the key conceptual distinction you want to make is between short-term play and long-term play. If I deliberately assume an emotional stance, often a lot of the benefit to be gained therefrom is how it translates long-term correct play into myopic play for the emotional reward, assuming of course that the translation is correct. Long-term, you play for absolute probabilities. Short-term, you chase after “dignity”, aka stackable log-odds improvements, at least until you’re out of the curve’s basement.

• I feel like this comment in particular is very clarifying with regards to the motivation of this stance. The benefit is that this imports recommendations of the ideal long-run policy into the short-run frame from which you’re actually acting.

I think that should maybe be in the post somewhere.

• I had a similar thought. Also, in an expected value context it makes sense to pursue actions that succeed when your model is wrong and you are actually closer to the middle of the success curve, because if that’s the case you can increase our chances of survival more easily. In the logarithmic context doing so doesn’t make much sense, since your impact on the logistic odds is the same no matter where on the success curve you are.

Maybe this objective function (and the whole ethos of Death with Dignity) is way to justify working on alignment even if you think our chances of success are close to zero. Personally, I’m not compelled by it.

• Measuring units and utilons are different, right? I measure my wealth in dollars but that doesn’t mean my utility function is linear in dollars.

• Do you think the decision heuristic Eliezer is (ambiguously jokingly) suggesting gives different policy recommendations from the more naive “maxipok” or not? If so, where might they differ? If not, what’s your guess as to why Eliezer worded the objective differently from Bostrom? Why involve log-probabilities at all?

• I read this as being “maxipok”, with a few key extensions:

• The ‘default’ probability of success is very low

• There are lots of plans that look like they give some small-but-relatively-attractive probability of success, which are basically all fake /​ picked by motivated reasoning of “there has to be a plan.” (“If we cause WWIII, then there will be a 2% chance of aligning AI, right?”)

• While there aren’t accessible plans that cause success all on their own, there probably are lots of accessible sub-plans which make it more likely that a surprising real plan could succeed. (“Electing a rationalist president won’t solve the problem on its own, but it does mean ‘letters from Einstein’ are more likely to work.”)

• Sent this to my dad, who is an old man as far outside the rationalist bubble as you could possibly be. Doesn’t even know why we’re worried about AGI, but he replied:

No one gets out alive. No one. You should pray.

Somehow it helped me cope.

• This was moving for me.

• I’ve had in my head all day Leonard Cohen’s song “You got me singing,” which he wrote toward the end of his life, watching death approach.

• Well, I’m sure listening to that is going to be difficult. I’ll listen to it sometime when I’m alone and have space to weep.

• I personally find it relatively gentle and reminding-toward-beauty, but YMMV. Lyrics.

• There are no atheists in god-building doomsday cults.

• I suspect that some people would be reassured by hearing bluntly, “Even though we’ve given up hope, we’re not giving up the fight.”

• I think it’s been a long time since MIRI was planning on succeeding at doing the whole thing themselves; I think even if everyone at MIRI still plans on showing up for work and putting in the thinking they’re capable of, the fact that the thinking is now pointed at “die with more dignity” instead of “win the future” feels like it’s “giving up the fight” in some senses.

• Nah, I think of myself as working to win the future, not as having given up the fight in any sense.

I wasn’t optimistic going in to this field; a little additional doominess isn’t going to suddenly make me switch from ‘very pessimistic but trying to win the future’ to ‘very pessimistic and given up’.

As AI_WAIFU eloquently put it: Fuck That Noise. 😊

• A better summary of my attitude:

• I mean this completely seriously: now that MIRI has changed to the Death With Dignity strategy, is there anything that I or anyone on LW can do to help with said strategy, other than pursue independent alignment research? Not that pursuing alignment research is the wrong thing to do, just that you might have better ideas.

• I’ve always thought that something in the context of mental health would be nice.

The idea that humanity is doomed is pretty psychologically hard to deal with. Well, it seems that there is a pretty wide range in how people respond psychologically to it, from what I can tell. Some seem to do just fine. But others seem to be pretty harmed (including myself, not that this is about me; ie. this post literally brought me to tears). So, yeah, some sort of guidance for how to deal with it would be nice.

Plus it’d serve the purpose of increasing the productivity of and mitigating the risk of burnout for AI researchers, thus increasing humanities supposedly slim chances of “making it”. This seems pretty nontrivial to me. AI researchers deal with this stuff on a daily basis. I don’t know much about what sort of mental states are common for them, but I’d guess that something like 10-40% of them are hurt pretty badly. In which case better mental health guidance would yield pretty substantial improvements in productivity. Unfortunately, I think it’s quite the difficult thing to “figure out” though, and for that reason I suspect it isn’t worth sinking too many resources into.

• I mean, I’d like to see a market in dignity certificates, to take care of generating additional dignity in a distributed and market-oriented fashion?

• Do you have any ideas for how to go about measuring dignity?

• Two questions:

1. Do you have skills relevant to building websites, marketing, running events, movement building or ops?

2. How good are you at generating potential downsides for any given project?

1. High charisma/​extroversion, not much else I can think of that’s relevant there. (Other than generally being a fast learner at that type of thing.)

2. Not something I’ve done before.

1. High charisma/​extroversion seems useful for movement building. Do you have any experience in programming or AI?

2. Do you want to give it a go? Let’s suppose you were organising a conference on AI safety. Can you name 5 or 6 ways that the conference could end up being net-negative?

• >Do you have any experience in programming or AI?

Programming yes, and I’d say I’m a skilled amateur, though I need to just do more programming. AI experience, not so much, other than reading (a large amount of) LW.

>Let’s suppose you were organising a conference on AI safety. Can you name 5 or 6 ways that the conference could end up being net-negative?

1. The conference involves someone talking about an extremely taboo topic (eugenics, say) as part of their plan to save the world from AI; the conference is covered in major news outlets as “AI Safety has an X problem” or something along those lines, and leading AI researchers are distracted from their work by the ensuing twitter storm.

2. One of the main speakers at the event is very good at diverting money towards him/​herself through raw charisma and ends up diverting money for projects/​compute away from other, more promising projects; later it turns out that their project actually accelerated the development of an unaligned AI.

3. The conference on AI safety doesn’t involve the people actually trying to build an AGI, and only involves the people who are already committed to and educated about AI alignment. The organizers and conference attendees are reassured by the consensus of “alignment is the most pressing problem we’re facing, and we need to take any steps necessary that don’t hurt us in the long run to fix it,” while that attitude isn’t representative of the audience the organizers actually want to reach. The organizers make future decisions based on the information that “lead AI researchers already are concerned about alignment to the degree we want them to be”, which ends up being wrong and they should have been more focused on reaching lead AI researchers.

4. The conference is just a waste of time, and the attendees could have been doing better things with the time/​resources spent attending.

5. There’s a bus crash on the way to the event, and several key researchers die, setting back progress by years.

6. Similar to #2, the conference convinces researchers that [any of the wrong ways to approach “death with dignity” mentioned in this post] is the best way to try to solve x-risk from AGI, and resources are put towards plans that, if they fail, will fail catastrophically

7. “If we manage to create an AI smarter than us, won’t it be more moral?” or any AGI-related fallacy disproved in the Sequences is spouted as common wisdom, and people are convinced.

• Cool, so I’d suggest looking into movement-building (obviously take with a grain of salt given how little we’ve talked). It’s probably good to try to develop some AI knowledge as well so that people will take you more seriously, but it’s not like you’d need that before you start.

You did pretty well in terms of generating ways it could be net-negative. That’s makes me more confident that you would be able to have a net-positive impact.

I guess it’d also be nice to have some degree of organisational skills, but honestly, if there isn’t anyone else doing AI safety movement-building in your area all you have to be is not completely terrible so long as you are aware of your limits and avoid organising anything that would go beyond them.

• What about Hail Mary strategies that were previously discarded due to being too risky? I can think of a couple off the top of my head. A cornered rat should always fight.

• Do they perchance have significant downsides if they fail? Just wildly guessing, here. I’m a lot more cheerful about Hail Mary strategies that don’t explode when the passes fail, and take out the timelines that still had hope in them after all.

• As a Hail Mary-strategy, how about making a 100% effort into trying to become elected of a small democratic voting district?

And, if that works, make a 100% effort to become elected by bigger and bigger districts—until all democratic countries support the [a stronger humanity can be reached by a systematic investigation of our surroundings, cooperation in the production of private and public goods, which includes not creating powerful aliens]-party?

Yes, yes, politics is horrible. BUT. What if you could do this within 8 years? AND, you test it by only trying one or two districts....one or two months each? So, in total it would cost at the most four months.

Downsides? Political corruption is the biggest one. But, I believe your approach to politics would be a continuation of what you do now, so if you succeeded it would only be by strengthening the existing EA/​Humanitarian/​Skeptical/​Transhumanist/​Libertarian-movements.

There may be a huge downside for you personally, as you may have to engage in some appropriate signalling to make people vote for your party. But, maybe it isn’t necessary. And if the whole thing doesn’t work it would only be for four months, top.

• Yeah, most of them do. I have some hope for the strategy-cluster that uses widespread propaganda[1] as a coordination mechanism.

Given the whole “brilliant elites” thing, and the fecundity of rationalist memes among such people, I think it’s possible to shift the world to a better Nash equilibrium.

1. ↩︎

Making more rationalists is all well and good, but let’s not shy away from no holds barred memetic warfare.

• Is it not obvious to you that this constitutes dying with less dignity, or is it obvious but you disagree that death with dignity is the correct way to go?

• Dignity exists within human minds. If human-descended minds go extinct, dignity doesn’t matter. Nature grades us upon what happens, not how hard we try. There is no goal I hold greater than the preservation of humanity.

• Did you read the OP post? The post identifies dignity with reductions in existential risk, and it talks a bunch about the ‘let’s violate ethical injunctions willy-nilly’ strategy

• The post assumes that there are no ethics-violating strategies that will work. I understand that people can just-world-fallacy their way into thinking that they will be saved if only they sacrifice their deontology. What I’m saying is that deontology-violating strategies should be adopted if they offer, say, +1e-5 odds of success.

• One of Eliezer’s points is that most people’s judgements about adding 1e-5 odds (I assume you mean log odds and not additive probability?) are wrong, and even systematically have the wrong sign.

• The post talks about how most people are unable to evaluate these odds accurately, and that an indicator of you thinking you found a loophole actually being a sign that you are one of those people.

• Pretty telling IMHO to see such massive herding on the downvotes here, for such an obviously-correct point. Disappointing!

• Coordination on what, exactly?

• Coordination (cartelization) so that AI capabilities are not a race to the bottom

• Coordination to indefinitely halt semiconductor supply chains

• Coordination to shun and sanction those who research AI capabilities (compare: coordination against embyronic human gene editing)

• Coordination to deliberately turn Moore’s Law back a few years (yes, I’m serious)

• And do you think if you try that, you’ll succeed, and that the world will then be saved?

• These are all strategies to buy time, so that alignment efforts may have more exposure to miracle-risk.

• And what do you think are the chances that those strategies work, or that the world lives after you hypothetically buy three or six more years that way?

• I’m not well calibrated on sub 1% probabilities. Yeah, the odds are low.

There are other classes of Hail Mary. Picture a pair of reseachers, one of whom controls an electrode wired to the pleasure centers of the other. Imagine they have free access to methamphetamine and LSD. I don’t think research output is anywhere near where it could be.

• So—just to be very clear here—the plan is that you do the bad thing, and then almost certainly everybody dies anyways even if that works?

I think at that level you want to exhale, step back, and not injure the reputations of the people who are gathering resources, finding out what they can, and watching closely for the first signs of a positive miracle. The surviving worlds aren’t the ones with unethical plans that seem like they couldn’t possibly work even on the open face of things; the real surviving worlds are only injured by people who imagine that throwing away their ethics surely means they must be buying something positive.

• Fine. What do you think about the human-augmentation cluster of strategies? I recall you thought along very similar lines circa ~2001.

• I don’t think we’ll have time, but I’d favor getting started anyways. Seems a bit more dignified.

• Great! If I recall correctly, you wanted genetically optimized kids to be gestated and trained.

I suspect that akrasia is a much bigger problem than most people think, and to be truly effective, one must outsource part of their reward function. There could be massive gains.

What do you think about the setup I outlined, where a pair of reseachers exist such that one controls an electrode embedded in the other’s reward center? Think Focus from Vinge’s A Deepness In The Sky.

• (I predict that would help with AI safety, in that it would swiftly provide useful examples of reward hacking and misaligned incentives)

• I think memetically ‘optimized’ kids (and adults?) might be an interesting alternative to explore. That is, more scalable and better education for the ‘consequentialists’ (I have no clue how to teach people that are not ‘consequentialist’, hopefully someone else can teach those) may get human thought-enhancement results earlier, and available to more people. There has been some work in this space and some successes, but I think that in general, the “memetics experts” and the “education experts” haven’t been cooperating properly as much as they should. I think it would seem dignified to me to try bridging this gap. If this is indeed dignified, then that would be good, because I’m currently in the early stages of a project trying to bridge this gap.

• The better version then reward hacking I can think of is inducing a state of jhana (basically a pleasure button) in alignment researchers. For example, use neuro-link to get the brain-process of ~1000 people going through the jhanas at multiple time-steps, average them in a meaningful way, induce those brainwaves in other people.

The effect is people being satiated with the feeling of happiness (like being satiated with food/​water), and are more effective as a result.

• The “electrode in the reward center” setup has been proven to work in humans, whereas jhanas may not tranfer over Neuralink.

• Deep brain stimulation is FDA-approved in humans, meaning less (though nonzero) regulatory fuckery will be required.

• Happiness is not pleasure; wanting is not liking. We are after reinforcement.

• Could you link the proven part?

Jhana’s seem much healthier, though I’m pretty confused imagining your setup so I don’t have much confidence. Say it works and gets past the problems of generalizing reward (eg the brain only rewards for specific parts of research and not others) and ignoring downward spiral effects of people hacking themselves, then we hopefully have people who look forward to doing certain parts of research.

If you model humans as multi-agents, it’s making a certain type of agent (the “do research” one) have a stronger say in what actions get done. This is not as robust as getting all the agents to agree and not fight each other. I believe jhana gets part of that done because some sub-agents are pursuing the feeling of happiness and you can get that any time.

• Is the average human life experientially negative, such that buying three more years of existence for the planet is ethically net-negative?

(Honest question)

• People’s revealed choice in tenaciously staying alive and keeping others alive suggests otherwise. This everyday observation trumps all philosophical argument that fire does not burn, water is not wet, and bears do not shit in the woods.

• I’m not immediately convinced (I think you need another ingredient).

Imagine a kind of orthogonality thesis but with experiential valence on one axis and ‘staying aliveness’ on the other. I think it goes through (one existence proof for the experientially-horrible-but-high-staying-aliveness quadrant might be the complex of torturer+torturee).

Another ingredient you need to posit for this argument to go through is that, as humans are constituted, experiential valence is causally correlated with behaviour in a way such that negative experiential valence reliably causes not-staying-aliveness. I think we do probably have this ingredient, but it’s not entirely clear cut to me.

• Unlike jayterwahl, I don’t consider experiential valence, which I take to mean mental sensations of pleasure and pain in the immediate moment, as of great importance in itself. It may be a sign that I am doing well or badly at life, but like the score on a test, it is only a proxy for what matters. People also have promises to keep, and miles to go before they sleep.

• I think many of the things that you might want to do in order to slow down tech development are things that will dramatically worsen human experiences, or reduce the number of them. Making a trade like that in order to purchase the whole future seems like it’s worth considering; making a trade like that in order to purchase three more years seems much more obviously not worth it.

• I will note that I’m still a little confused about Butlerian Jihad style approaches (where you smash all the computers, or restrict them to the capability available in 1999 or w/​e); if I remember correctly Eliezer has called that a ‘straightforward loss’, which seems correct from a ‘cosmic endowment’ perspective but not from a ‘counting up from ~10 remaining years’ perspective.

My guess is that the main response is “look, if you can coordinate to smash all of the computers, you can probably coordinate on the less destructive-to-potential task of just not building AGI, and the difficulty is primarily in coordinating at all instead of the coordination target.”

• Suppose they don’t? I have at least one that AFAICT doesn’t do anything worse than take researchers/​resources away from AI alignment in most bad-ends and even in the worst case scenario “just” generates a paperclipper anyway. Which, to be clear, is bad, but not any worse than the current timeline.

(Namely, actual literal time travel and outcome pumps. There is some reason to believe that an outcome pump with a sufficiently short time horizon is easier to safely get hypercompute out of than an AGI, and that a “time machine” that moves an electron back a microsecond is at least energetically within bounds of near-term technology.

You are welcome to complain that time travel is completely incoherent if you like; I’m not exactly convinced myself. But so far, the laws of physics have avoided actually banning CTCs outright.)

• a “time machine” that moves an electron back a microsecond is at least energetically within bounds of near-term technology.

Do you have a pointer for this? Traversable wormholes tend to require massive amounts of energy[1] (as in, amounts of energy that are easier to state in c^2 units).

There is some reason to believe that an outcome pump with a sufficiently short time horizon is easier to safely get hypercompute out of than an AGI, and that a “time machine” that moves an electron back a microsecond [...]

Note: this isn’t strictly hypercompute. Finite speed of light means that you can only address a finite number of bits within a fixed time, and your critical path is limited by the timescale of the CTC.

That being said, figuring out the final state of a 1TB-state-vector[2] FSM would itself be very useful. Just not strictly hypercomputation.

1. ^

Or negative energy density. Or massive amounts of negative energy density.

2. ^

Ballpark. Roundtrip to 1TB of RAM in 1us is doable.

• Never even THINK ABOUT trying a hail mary if it also comes with an increased chance of s-risk. I’d much rather just die.

• Speaking of which, one thing we should be doing is keeping a lookout for opportunities to reduce s-risk (with dignity) … I haven’t yet been convinced that s-risk reduction is intractable.

• The most obvious way to reduce s-risk would be to increase x-risk, but somehow that doesn’t sound very appealing...

• This is an example of what EY is talking about I think—as far as I can tell all the obvious things one would do to reduce s-risk via increasing x-risk are the sort of supervillian schemes that are more likely to increase s-risk than decrease it once secondary effects and unintended consequences etc. are taken into account. This is partly why I put the “with dignity” qualifier in. (The other reason is that I’m not a utilitarian and don’t think our decision about whether to do supervillian schemes should come down to whether we think the astronomical long-term consequences are slightly more likely to be positive than negative.)

• Suppose, for example, that you’re going to try to build an AGI anyway. You could just not try to train it to care about human values, hoping that it would destroy the world, rather than creating some kind of crazy mind-control dystopia.

I submit that, if your model of the universe is that AGI will, by default, be a huge x-risk and/​or a huge s-risk, then the “supervillain” step in that process would be deciding to build it in the first place, and not necessarily not trying to “align” it. You lost your dignity at the first step, and won’t lose any more at the second.

Also, I kind of hate to say it, but sometimes the stuff about “secondary effects and unintended consequences” sounds more like “I’m looking for reasons not to break widely-loved deontological rules, regardless of my professed ethical system, because I am uncomfortable with breaking those rules” than like actual caution. It’s very easy to stop looking for more effects in either direction when you reach the conclusion you want.

I mean, yes, those deontological rules are useful time-tested heuristics. Yes, a lot of the time the likely consequences of violating them will be bad in clearly foreseeable ways. Yes, you are imperfect and should also be very, very nervous about consequences you do not foresee. But all of that can also act as convenient cover for switching from being an actual utilitarian to being an actual deontologist, without ever saying as much.

Personally, I’m neither. And I also don’t believe that intelligence, in any actually achievable quantity, is a magic wand that automatically lets you either destroy the world or take over and torture everybody. And I very much doubt that ML-as-presently-practiced, without serious structural innovations and running on physically realizable computers, will get all that smart anyway. So I don’t really have an incentive to get all supervillainy to begin with. And I wouldn’t be good at it anyhow.

… but if faced with a choice between a certainty of destroying the world, and a certainty of every living being being tortured for eternity, even I would go with the “destroy” option.

• I can imagine a plausible scenario in which WW3 is a great thing, because both sides brick each other’s datacenters and bomb each other’s semiconductor fabs. Also, all the tech talent will be spent trying to hack the other side and will not be spent training bigger and bigger language models.

• I imagine that WW3 would be an incredibly strong pressure, akin to WW2, which causes governments to finally sit up and take notice of AI.

And then spend several trillion dollars running Manhattan Project Two: Manhattan Harder, racing each other to be the first to get AI.

And then we die even faster, and instead of being converted into paperclips, we’re converted into tiny American/​Chinese flags

• Missed opportunity to call it Manhattan Project Two: The Bronx.

• That only gives you a brief delay on a timeline which could, depending on the horizons you adopt, be billions of years long. If you really wanted to reduce s-risk in an absolute sense, you’d have to try to sterilize the planet, not set back semiconductor manufacturing by a decade. This, I think, is a project which should give one pause.

• The downvotes on my comment reflect a threat we all need to be extremely mindful of: people who are so terrified of death that they’d rather flip the coin on condemning us all to hell, than die. They’ll only grow ever more desperate & willing to resort to more hideously reckless hail marys as we draw closer.

• Upvoting you because I think this is an important point to be made, even if I’m unsure how much I agree with it. We need people pushing back against potentially deeply unethical schemes, even if said schemes also have the potential to be extremely valuable (not that I’ve seen very many of those at all; most proposed supervillain schemes would pretty obviously be a Bad Idea™). Having the dialogue is valuable, and it’s disappointing to see unpopular thoughts downvoted here.

• Improved title:

• Q6: Hey, this was posted on April 1st. All of this is just an April Fool’s joke, right?

A: Why, of course! Or rather, it’s a preview of what might be needful to say later, if matters really do get that desperate. You don’t want to drop that on people suddenly and with no warning.

In my own accounting I’m going to consider this a lie (of the sort argued against in Q4) in possible worlds where Eliezer in fact believes things are this desperate, UNLESS there is some clarification by Eliezer that he didn’t mean to imply that things aren’t nearly this desperate.

Reasons to suspect Eliezer may think it really is this desperate:

1. Nate Soares writes in this post, “I (Nate) don’t know of any plan for achieving a stellar future that I believe has much hope worth speaking of. I consider this one of our key bottlenecks.”

2. Eliezer says in this dialogue “I consider the present gameboard to look incredibly grim, and I don’t actually see a way out through hard work alone. We can hope there’s a miracle that violates some aspect of my background model, and we can try to prepare for that unknown miracle; preparing for an unknown miracle probably looks like ‘Trying to die with more dignity on the mainline’ (because if you can die with more dignity on the mainline, you are better positioned to take advantage of a miracle if it occurs).”

3. I’ve heard through other channels that Eliezer is quite pessimistic about solving alignment at this point.

• Lies are intended to deceive. If I say I’m a teapot, and everyone knows I’m not a teapot, I think one shouldn’t use the same word for that as for misrepresenting one’s STD status.

Doubly true on April 1st, which is among its other uses, an unusually good day to say things that can only be said with literally false statements, if you’d not be a liar.

• What is the proposition you believe “everyone knows” in this case? (The proposition that seemed ambiguous to me was “Eliezer believes alignment is so unlikely that going for dying with dignity on the mainline is the right approach”).

If someone says X on April Fools, then says “April fools, X is false, Y is true instead!”, and they disbelieve Y (and it’s at least plausible to some parties that they believe Y), that’s still a lie even though it’s April Fools, since they’re claiming to have popped out of the April Fools context by saying “April Fools”.

• I think this isn’t the first time I’ve seen the April Fools-April Fools joke, where someone says “True thing, April Fools, Lie!”, but I agree that this is ‘bad form’ in some way.

I had been, midway thru the post, intending to write a comment that was something like “hmm, does posting this on April 1st have more or less dignity than April 2nd?”, and then got to that point. My interpretation is something like: if someone wants to dismiss Eliezer for reasons of their psychological health, Eliezer wants them to have an out, and the best out he could give them was “he posted his main strategic update on April 1st, so it has to be a joke, and he confirmed that in his post.” But of course this out involves some sort of detachment between their beliefs and his beliefs.

In my terminology, it’s a ‘collusion’ instead of a ‘lie’; the sort of thing where I help someone believe that I liked their cooking because they would rather have that conclusion than be correlated with reality. The main difference between them is something like “whose preferences are being satisfied”; lies are typically win-lose to a much larger degree than collusions are. [Or, like, enough people meta-prefer collusions for them to be common, but meta-prefer not lying such that non-collusion lying is relatively rare.]

• It’s odd to me that you take a stance that I interpret as “saying something false was okay because it was convenient to do so”[1] given that I believe you dislike unneeded required context on “intelligent discourse” and that you once spent an impressive amount of time finding a way to never fully lie.

Obviously, you could have distributed a codebook and titled the article “Flagh guc Pontre” instead of “Death with Dignity”, but using a nonstandard mapping from symbols to ideas feels like giving up entirely on the concept of truth in communication.

1. ↩︎

Or the ol’ 4Chan creed “Only a fool would take anything posted here as fact.”

• Not lying. Saying a thing in such a way that it’s impossible to tell whether he believes it or not, and doing that explicitly.

Seems a very honest thing to do to me, if you have a thing you want to say, but do not want people to know whether you believe it or not. As to the why of that, I have no idea. But I do not feel deceived.

• Do you think it was clear to over 90% of readers that the part where he says “April fools, this is just a test!” is not a statement of truth?

• It’s not clear to me. Maybe it is an April fool joke!

• Also this comment:

Eliezer, do you have any advice for someone wanting to enter this research space at (from your perspective) the eleventh hour?

I don’t have any such advice at the moment. It’s not clear to me what makes a difference at this point.

• Thanks for asking about this. I was confused about the extent to which this is an April Fools joke. The discussion here helped clarify my confusion a decent amount, which has been decently useful for me.

• It sounds like Eliezer is confident that alignment will fail. If so, the way out is to make sure AGI isn’t built. I think that’s more realistic than it sounds

1. LessWrong is influential enough to achieve policy goals

Right now, the Yann LeCun view of AI is probably more mainstream, but that can change fast.

LessWrong is upstream of influential thinkers. For example:
- Zvi and Scott Alexander read LessWrong. Let’s call folks like them Filter #1
- Tyler Cowen reads Zvi and Scott Alexander. (Filter #2)
- Malcolm Gladwell, a mainstream influencer, reads Tyler Cowen every morning (Filter #3)

I could’ve made a similar chain with Ezra Klein or Holden Karnofsky. All these chains put together is a lot of influence

Right now, I think Eliezer’s argument (AI capabilities research will destroy the world) is blocked at Filter #1. None of the Filter #1 authors have endorsed it. Why should they? The argument relies on intuition. There’s no way for Filter #1 to evaluate it. I think that’s why Scott Alexander and Holden Karnofsky hedged, neither explicitly endorsing nor rejecting the doom theory.

Even if they believed Eliezer, Filter #1 authors need to communicate more than an intuition to Filter #2. Imagine the article: “Eliezer et al have a strong intuition that the sky is falling. We’re working on finding some evidence. In the meantime, you need to pass some policies real fast.”

In short, ideas from LessWrong can exert a strong influence on policymakers. This particular idea hasn’t because it isn’t legible and Filter #1 isn’t persuaded.

2. If implemented early, government policy can prevent AGI development

AGI development is expensive. If Google/​Facebook/​Huawei didn’t expect to make a lot of money from capabilities development, they’d stop investing in it. This means that the pace of AI is very responsive to government policy.

If the US, China, and EU want to prevent AGI development, I bet they’d get their way. This seems a job for a regulatory agency. Pick a (hopefully narrow) set of technologies and make it illegal to research them without approval.

This isn’t as awful as it sounds. The FAA basically worked, and accidents in the air are very rare. If Eliezer’s argument is true, the costs are tiny compared to the benefits. A burdensome bureaucracy vs destruction of the universe.

Imagine a hypothetical world, where mainstream opinion (like you’d find in the New York Times) says that AGI would destroy the world, and a powerful regulatory agency has the law on its side. I bet AGI is delayed by decades.

3. Don’t underestimate how effectively the US government can do this job

Don’t over-index on covid or climate change. AI safety is different. Covid and climate change both demand sacrifices from the entire population. This is hugely unpopular. AI safety, on the other hand, only demands sacrifices from a small number of companies

For now, I think the top priority is to clearly and persuasively demonstrate why alignment won’t be solved in the next 30 years. This is crazy hard, but it might be way easier than actually solving alignment

• I tend to agree that Eliezer (among others) underestimates the potential value of US federal policy. But on the other hand, note No Fire Alarm, which I mostly disagree with but which has some great points and is good for understanding Eliezer’s perspective. Also note (among other reasons) that policy preventing AGI is hard because it needs to stop every potentially feasible AGI project but: (1) defining ‘AGI research’ in a sufficient manner is hard, especially when (2) at least some companies naturally want to get around such regulations, and (3) at least some governments are likely to believe there is a large strategic advantage to their state ‘getting AGI first,’ and arms control for software is hard because states wouldn’t think they could trust each other and verifying compliance would probably be very invasive so states would be averse to such verification. Eliezer has also written about why he’s pessimistic about policy elsewhere, though I don’t have a link off the top of my head.

• Eliezer gives alignment a 0% chance of succeeding. I think policy, if tried seriously, has >50%. So it’s a giant opportunity that’s gotten way too little attention

I’m optimistic about policy for big companies in particular. They have a lot to lose from breaking the law, they’re easy to inspect (because there’s so few), and there’s lots of precedent (ITAR already covers some software). Right now, serious AI capabilities research just isn’t profitable outside of the big tech companies

Voluntary compliance is also a very real thing. Lots of AI researchers are wealthy and high-status, and they’d have a lot to lose from breaking the law. At the very least, a law would stop them from publishing their research. A field like this also lends itself to undercover enforcement

I think an agreement with China is impossible now, because prominent folks don’t even believe the threat exists. Two factors could change the art of the possible. First, if there were a widely known argument about the dangers of AI, on which most public intellectual agreed. Second, since the US has a technological lead, it could actually be to their advantage.

• Look at gain of function research for the result of a government moratorium on research. At first Baric feared that the moratorium would end his research. Then the NIH declared that his research isn’t officially gain of function and continued funding him.

Regulating gain of function research away is essentially easy mode compared to AI.

A real Butlerian jihad would be much harder.

• I agree that it’s hard, but there are all sorts of possible moves (like LessWrong folks choosing to work at this future regulatory agency, or putting massive amounts of lobbying funds into making sure the rules are strict)

If the alternative (solving alignment) seems impossible given 30 years and massive amounts of money, then even a really hard policy seems easy by comparison

• How about if you solve a ban on gain-of-function research first, and then move on to much harder problems like AGI? A victory on this relatively easy case would result in a lot of valuable gained experience, or, alternatively, allow foolish optimists to have their dangerous optimism broken over shorter time horizons.

• foolish optimists to have their dangerous optimism broken

I’m pretty confused about your confidence in your assertion here. Have you spoken to people who’ve lead successful government policy efforts, to ground this pessimism? Why does the IAEA exist? How did ARPA-E happen? Why is a massive subsidy for geothermal well within the Overton Window and thus in a bill Joe Manchin said he would sign?

Gain of function research is the remit of a decades-old incumbent bureaucracy (the NIH) that oversees bio policy, and doesn’t like listening to outsiders. There’s no such equivalent for AI; everyone in the government keeps asking “what should we do” and all the experts shrug or disagree with each other. What if they mostly didn’t?

Where is your imagined inertia/​political opposition coming from? Is it literally skepticism that senators show up for work every day? What if I told you that most of them do and that things with low political salience and broad expert agreement happen all the time?

• Where is your imagined inertia/​political opposition coming from? Is it literally skepticism that senators show up for work every day? What if I told you that most of them do and that things with low political salience and broad expert agreement happen all the time?

Where my skepticism is coming from (for AI policy) is: what’s the ban, in enough detail that it could actually be a law?

Are we going to have an Office of Program Approval, where people have to send code, the government has to read it, and only once the government signs off, it can get run? If so, the whole tech industry will try to bury you, and even if you succeed, how are you going to staff that office with people who can tell the difference between AGI code and non-AGI code?

Are we going to have laws about what not to do, plus an office of lawyers looking for people breaking the laws? (This is more the SEC model.) Then this is mostly a ban on doing things in public; the NHTSA only knew to send George Hotz a cease-and-desist because he was uploading videos of the stuff he was doing. Maybe you can get enough visibility into OpenAI and Anthropic, but do you also need to get the UK to create one to get visibility into Deepmind? If the Canadian government, proud of its AI industry and happy to support it, doesn’t make such an office, do the companies just move there?

[Like, the federal government stopped the NIH from funding stem cell research for moral reasons, and California said “fine, we’ll fund it instead.”]

If the laws are just “don’t make AI that will murder people or overthrow the government”, well, we already have laws against murdering people and overthrowing the government. The thing I’m worried about is someone running a program that they think will be fine which turns out to not be fine, and it’s hard to bridge the gap between anticipated and actual consequences with laws.

• To clarify, I largely agree with the viewpoint that “just announcing a law banning AGI” is incoherent and underspecified. But the job will with high probability be much easier than regulating the entire financial sector (the SEC’s job), which can really only be done reactively.

If AGI projects cost >$1B and require specific company cultural DNA, it’s entirely possible that we’re talking about fewer than 20 firms across the Western world. These companies will be direct competitors, and incentivized to both (1) make sure the process isn’t too onerous and (2) heavily police competitors in case they try to defect, since that would lead to an unfair advantage. The problem here is preventing overall drift towards unsafe systems, and that is much easier for a central actor like a government to coordinate. Re: Canada and the UK, I’m really not sure why you think those societies would be less prone to policy influence; as far as I can tell they’re actually much easier cases. “Bring your business here, we don’t believe the majority of the experts [assuming we can get that] that unregulated development is decently likely to spawn a terminator might kill everyone” is actually not a great political slogan, pretty much anywhere. • But the job will with high probability be much easier than regulating the entire financial sector (the SEC’s job), which can really only be done reactively. I’m interested in the details here! Like, ‘easier’ in the sense of “requires fewer professionals”, “requires fewer rulings by judges”, “lower downside risk”, “less adversarial optimization pressure”, something else? [For context, in my understanding of the analogy between financial regulation and AI, the event in finance analogous to when humans would lose control of the future to AI was probably around the point of John Law.] If AGI projects cost >$1B and require specific company cultural DNA, it’s entirely possible that we’re talking about fewer than 20 firms across the Western world.

[EDIT] Also I should note I’m more optimistic about this the more expensive AGI is /​ the fewer companies can approach it. My guess is that a compute-centric regulatory approach—one where you can’t use more than X compute without going to the government office or w/​e—has an easier shot of working than one that tries to operate on conceptual boundaries. But we need it to be the case that much compute is actually required, and building alternative approaches to assembling that much compute (like Folding@Home, or secret government supercomputers, or w/​e) are taken seriously.

“Bring your business here, we don’t believe the majority of the experts [assuming we can get that] that unregulated development is decently likely to spawn a terminator might kill everyone” is actually not a great political slogan, pretty much anywhere.

Maybe? One of the things that’s sort of hazardous about AI (and is similarly hazardous about finance) is that rainbow after rainbow leads to a pot of gold. First AI solves car accidents, then they solve having to put soldiers in dangerous situations, then they solve climate change, then they solve cancer, then—except at some point in there, you accidentally lose control of the future and probably everyone dies. And it’s pretty easy for people to dismiss this sort of concern on psychological grounds, like Steven Pinker does in Enlightenment Now.

• I’m interested in the details here! Like, ‘easier’ in the sense of “requires fewer professionals”, “requires fewer rulings by judges”, “lower downside risk”, “less adversarial optimization pressure”, something else?

By “easier”, I specifically mean “overseeing fewer firms, each taking fewer actions”. I wholeheartedly agree that any sort of regulation is predicated on getting lucky re: AGI not requiring <100M amounts of compute, when it’s developed. If as many actors can create/​use AGI as can run hedge funds, policy is probably not going to help much. My guess is that a compute-centric regulatory approach—one where you can’t use more than X compute without going to the government office or w/​e—has an easier shot of working than one that tries to operate on conceptual boundaries. But we need it to be the case that much compute is actually required, and building alternative approaches to assembling that much compute (like Folding@Home, or secret government supercomputers, or w/​e) are taken seriously. IMO secret government supercomputers will never be regulatable; the only hope there is government self-regulation (by which I mean, getting governments as worried about AGI catastrophes as their leading private-sector counterparts). Folding@Home equivalents are something of an open problem; if there was one major uncertainty, I’d say they’re it, but again this is less of a problem the more compute is required. One of the things that’s sort of hazardous about AI (and is similarly hazardous about finance) is that rainbow after rainbow leads to a pot of gold I think that you are absolutely correct that unless e.g. the hard problem of corrigibility gets verified by the scientific community, promulgated to adjacent elites, and popularized with the public, there is little chance that proto-AGI-designers will face pressure to curb their actions. But those actions are not “impossible” in some concrete sense; they just require talent and expertise in mass persuasion, instead of community-building. • We probably have a ban on gain-of-function research in the bag, since it seems relatively easy to persuade intellectuals of the merits of the idea. How that then translates to real-world policy is opaque to me, but give it fifty years? Half the crackpot ideas that were popular at college have come true over my lifetime. Our problem with AI is that we can’t convince anyone that it’s dangerous. And we may not need the fifty years! Reaching intellectual consensus might be good enough to slow it down until the government gets round to banning it. Weirdly the other day I ran into a most eminent historian and he asked me what I’d been doing lately. As it happened I’d been worrying about AI, and so I gave him the potted version, and straight away he said: “Shouldn’t we ban it then?”, and I was like: “I think so, but that makes me a crank amongst cranks”. My problem is that I am not capable of convincing computer scientists and mathematicians, who are usually the people who think most like me. They always start blithering on about consciousness or ’if it’s clever enough to … then why...” etc, and although I can usually answer their immediate objections, they just come up with something else. But even my closest friends have taken a decade to realize that I might be worrying about something real instead of off on one. And I haven’t got even a significant minority of them. And I think that’s because I don’t really understand myself. I have a terrible intuition about powerful optimization processes and that’s it. • We probably have a ban on gain-of-function research in the bag, since it seems relatively easy to persuade intellectuals of the merits of the idea. Is this the case? Like, we had a moratorium on federal funding (not even on doing it, just whether or not taxpayers would pay for it), and it was controversial, and then we dropped it after 3 years. You might have thought that it would be a slam dunk after there was a pandemic for which lab leak was even a plausible origin, but the people who would have been considered most responsible quickly jumped into the public sphere and tried really hard to discredit the idea. I think this is part of a general problem, which is that special interests are very committed to an issue and the public is very uncommitted, and that balance generally favors the special interests. [It’s Peter Daszak’s life on the line for the lab leak hypothesis, and a minor issue to me.] I suspect that if it ever looks like “getting rid of algorithms” is seriously on the table, lots of people will try really hard to prevent that from becoming policy. • Is this the case? Like, we had a moratorium on federal funding (not even on doing it, just whether or not taxpayers would pay for it), and it was controversial, and then we dropped it after 3 years. And more crucially, it didn’t even stop the federal funding of Baric while it was in place. The equivalent would be that you outlaw AGI development but do nothing about people training tool AI’s and people simply declaring their development as tool AI development in response to the regulation. • It’s certainly fairly easy to persuade people that it’s a good idea, but you might be right that asymmetric lobbying can keep good ideas off the table indefinitely. On the other hand, ‘cigarettes cause cancer’ to ‘smoking bans’ took about fifty years despite an obvious asymmetry in favour of tobacco. As I say, politics is all rather opaque to me, but once an idea is universally agreed amongst intellectuals it does seem to eventually result in political action. • Given the lack of available moves that are promising, attempting to influence policy is a reasonable move. It’s part of the 80,000 hours career suggestions. On the other hand it’s a long-short and I see no reason to expect a high likelihood of success. • First, if there were a widely known argument about the dangers of AI, on which most public intellectual agreed. This is exactly what we have piloted at the Existential Risk Observatory, a Dutch nonprofit founded last year. I’d say we’re fairly successful so far. Our aim is to reduce human extinction risk (especially from AGI) by informing the public debate. Concretely, what we’ve done in the past year in the Netherlands is (I’m including the detailed description so others can copy our approach—I think they should): 1. We have set up a good-looking website, found a board, set up a legal entity. 2. Asked and obtained endorsement from academics already familiar with existential risk. 3. Found a freelance, well-known ex-journalist and ex-parliamentarian to work with us as a media strategist. 4. Wrote op-eds warning about AGI existential risk, as explicitly as possible, but heeding the media strategist’s advice. Sometimes we used academic co-authors. Four out of six of our op-eds were published in leading newspapers in print. 5. Organized drinks, networked with journalists, introduced them to others who are into AGI existential risk (e.g. EAs). Our most recent result (last weekend) is that a prominent columnist who is agenda-setting on tech and privacy issues in NRC Handelsblad, the Dutch equivalent of the New York Times, wrote a piece where he talked about AGI existential risk as an actual thing. We’ve also had a meeting with the chairwoman of the Dutch parliamentary committee on digitization (the line between a published article and a policy meeting is direct), and a debate about AGI xrisk in the leading debate centre now seems fairly likely. We’re not there yet, but we’ve only done this for less than a year, we’re tiny, we don’t have anyone with a significant profile, and we were self-funded (we recently got our first funding from SFF—thanks guys!). I don’t see any reason why our approach wouldn’t translate to other countries, including the US. If you do this for a few years, consistently, and in a coordinated and funded way, I would be very surprised if you cannot get to a situation where mainstream opinion in places like the Times and the Post regards AI as quite possibly capable of destroying the world. I also think this could be one of our chances. Would love to think further about this, and we’re open for cooperation. • I think you have to specify which policy you mean. First, let’s for now focus on regulation that’s really aiming to stop AGI, at least until safety is proven (if possible), not on regulation that’s only focusing on slowing down (incremental progress). I see roughly three options: software/​research, hardware, and data. All of these options would likely need to be global to be effective (that’s complicating things, but perhaps a few powerful states can enforce regulation on others—not necessarily unrealistic). Most people who talk about AGI regulation seem to mean software or research regulation. An example is the national review board proposed by Musk. A large downside of this method is that, if it turns out that scaling up current approaches is mostly all that’s needed, Yudkowsky’s argument that a few years later, anyone can build AGI in their basement (unregulatable) because of hardware progress seems like a real risk. A second option not suffering from this issue is hardware regulation. The thought experiment of Yudkuwsky that an AGI might destroy all CPUs in order to block competitors, is perhaps its most extreme form. One nod less extreme, chip capability could be forcibly held at either today’s capability level, or even at a level of some safe point in the past. This could be regulated at the fabs, which are few and not easy to hide. Regulating compute has also been proposed by Jaan Tallinn in a Politico newsletter, where he proposes regulating flops/​km2. Finally, an option could be to regulate data access. I can’t recall a concrete proposal but it should be possible in principle. I think a paper should urgently be written about which options we have, and especially what the least economically damaging, but still reliable and enforcible regulation method is. I think we should move beyond the position that no regulation could do this—there are clearly options with >0% chance (depending strongly on coordination and communication) and we can’t afford to waste them. • I disagree with most of the empirical claims in this post, and dislike most of the words. But I do like the framework of valuing actions based on log odds in doom reduction. Some reasons I like it: • I think it makes it easier to communicate between people with very different background views about alignment; it seems much easier to agree about whether something reduces doom by 1 bit, than to agree about whether it cuts doom by 50%. • It seems like the right way to prepare for positive surprises if you are pessimistic. • I think it’s correct that saying “well if there’s a chance it has to be X...” very often dooms you on the starting line, and I think this fuzzier way of preparing for positive surprises is often better. I agree that in many contexts you shouldn’t say “let’s condition on somehow getting enough dignity points, since that’s the only way it matters” and should instead just fight for what you perceive as a dignity point. • I tentatively think that it’s better for having intuition about effect sizes and comparing different interventions. • I think it makes it easier to communicate between people with very different background views about alignment; it seems much easier to agree about whether something reduces doom by 1 bit, than to agree about whether it cuts doom by 50%. I’m a bit surprised by this. (I was also a bit confused about what work “log odds” vs other probability framing was doing in the first place, and so were some others I’ve chatted with). Have you run into people/​conversations when this was helpful? (I guess same question for Eliezer. I had a vague sense of “log odds put the question into units that were more motivating to reason about”, but was fuzzy on the details. I’m not fluent in log odds, I have some sense that if were were actually measuring precise probabilities it might matter for making math actually easier, but I haven’t seen anyone make arguments about success chance that were grounded in something rigorous enough that doing actual math to it was useful) • Likelihood ratios can be easier to evaluate than absolute probabilities insofar as you can focus on the meaning of a single piece of evidence, separately from the context of everything else you believe. Suppose we’re trying to pick a multiple-dial combination lock (in the dark, where we can’t see how many dials there are). If we somehow learn that the first digit is 3, we can agree that that’s bits of progress towards cracking the lock, even if you think there’s three dials total (such that we only need 6.64 more bits) and I think there’s 10 dials total (such that we need 29.88 more bits). Similarly, alignment pessimists and optimists might be able to agree that reinforcement learning from human feedback techniques are a good puzzle piece to have (we don’t have to write a reward function by hand! that’s progress!), even if pessimists aren’t particularly encouraged overall (because Goodhart’s curse, inner-misaligned mesa-optimizers, &c. are still going to kill you). • Thanks, I found it useful to have this explanation alongside Paul’s. (I think each explanation would have helped a bit on it’s own but having it explained in two different language gave me a clearer sense of how the math worked and how to conceptualize it) • More intuitive illustration with no logarithms: your plane crashed in the ocean. To survive, you must swim to shore. You know that the shore is west, but you don’t know how far. The optimist thinks the shore is just over the horizon; we only need to swim a few miles and we’ll almost certainly make it. The pessimist thinks the shore is a thousand miles away and we will surely die. But the optimist and pessimist can both agree on how far we’ve swum up to this point, and that the most dignified course of action is “Swim west as far as you can.” • Suppose that Eliezer thinks there is a 99% risk of doom, and I think there is a 20% risk of doom. Suppose that we solve some problem we both think of as incredibly important, like we find a way to solve ontology identification and make sense of the alien knowledge a model has about the world and about how to think, and it actually looks pretty practical and promising and suggests an angle of attack on other big theoretical problems and generally suggests all these difficulties may be more tractable than we thought. If that’s an incredible smashing success maybe my risk estimate has gone down from 20% to 10%, cutting risk in half. And if it’s an incredible smashing success maybe Eliezer thinks that risk has gone down from 99% to 98%, cutting risk by ~1%. I think there are basically just two separate issues at stake: • How much does this help solve the problem? I think mostly captured by bits of log odds reduction, and not where the real disagreement is. • How much are we doomed anyway so it doesn’t matter? • Suppose that to solve alignment the quality of our alignment research effort has to be greater than some threshold. If the distribution of possible output quality is logistic, and research moves the mean of the distribution, then I think we gain a constant amount of log-odds per unit of research quality, regardless of where we think the threshold is. • Something that would be of substantial epistemic help to me is if you (Eliezer) would be willing to estimate a few conditional probabilities (coarsely, I’m not asking you to superforecast) about the contributors to P(doom). Specifically: • timelines (when will we get AGI) • alignment research (will we have a scheme that seems ~90% likely to work for ~slightly above human level AGI), and • governance (will we be able to get everyone to use that or an equivalently promising alignment scheme). For example, it seems plausible that a large fraction of your P(doom) is derived from your belief that P(10 year timelines) is large and both P(insufficient time for any alignment scheme| <10 year timelines ) and P(insufficient time for the viability of consensus-requiring governance schemes | <10 year timelines) are small. OR it could be that even given 15-20 year timelines, your probability of a decent alignment scheme emerging is ~equally small, and that fact dominates all your prognoses. It’s probably some mix of both, but the ratios are important. Why would others care? Well, from an epistemic “should I defer to someone who’s thought about it more than me” perspective, I consider you a much greater authority on the hardness of alignment given time, i.e. your knowledge of the probabilities f(hope-inducing technical solution | x years until AGI, at least y serious researchers working for z fraction of those years) for different values of x, y, and z. On the other hand, I might consider you less of a world-expert in AI timelines, or assessing the viability of governance interventions (e.g. mass popularization campaigns). I’m not saying that a rando would have better estimates, but a domain expert could plausibly not need to heavily update off your private beliefs even after evaluating your public arguments. So, to be specific about the probabilities that would be helpful: P(alignment ~solution | <10 years to AGI) P(alignment ~solution | 15-20 years to AGI) (You can interpolate expand these ranges if you have time) P(alignment ~solution | 15-20 years to AGI, 100x size of alignment research field within 5 years) A few other probabilities could also be useful for sanity checks to illustrate how your model cashes out to <1%, though I know you’ve preferred to avoid some of these in the past: P(governance solution | 15-20 years to AGI) P(<10 years to AGI) P(15-20 years) Background for why I care: I can think of/​work on many governance schemes that have good odds of success given 20 years but not 10 (where success means buying us another ~10 years), and separately can think of/​work on governance-ish interventions that could substantially inflate the # of good alignment researchers within ~5 years (Eg from 100 → 5000), but this might only be useful given >5 additional years after that, so that those people actually have time to do work. (Do me the courtesy of suspending disbelief in our ability to accomplish those objectives.) I have to assume you’ve thought of these schemes, and so I can’t tell whether you think they won’t work because you’re confident in short timelines or because of your inside view that “alignment is hard and 5,000 people working for ~15 years are still <10% likely to make meaningful progress and buy themselves more time to do more work”. • I have to assume you’ve thought of these schemes, and so I can’t tell whether you think they won’t work because you’re confident in short timelines or because of your inside view that “alignment is hard and 5,000 people working for ~15 years are still <10% likely to make meaningful progress and buy themselves more time to do more work”. I don’t know Eliezer’s views here, but the latter sounds more Eliezer-ish to my ears. My Eliezer-model is more confident that alignment is hard (and that people aren’t currently taking the problem very seriously) than he is confident about his ability to time AGI. I don’t know the answer to your questions, but I can cite a thing Eliezer wrote in his dialogue on biology-inspired AGI timelines: I consider naming particular years to be a cognitively harmful sort of activity; I have refrained from trying to translate my brain’s native intuitions about this into probabilities, for fear that my verbalized probabilities will be stupider than my intuitions if I try to put weight on them. What feelings I do have, I worry may be unwise to voice; AGI timelines, in my own experience, are not great for one’s mental health, and I worry that other people seem to have weaker immune systems than even my own. But I suppose I cannot but acknowledge that my outward behavior seems to reveal a distribution whose median seems to fall well before 2050. Also, you wrote: governance (will we be able to get everyone to use that or an equivalently promising alignment scheme). This makes it sound like you’re imagining an end-state of ‘AGI proliferates freely, but we find some way to convince everyone to be cautious and employ best practices’. If so, that’s likely to be an important crux of disagreement between you and Eliezer; my Eliezer-model says that ‘AGI proliferates freely’ means death, and the first goal of alignment is to execute some pivotal act that safely prevents everyone and their grandmother from being able to build AGI. (Compare ‘everyone and their grandmother has a backyard nuclear arsenal’.) • Hey Rob, thanks for your reply. If it makes you guys feel better, you can rationalize the following as my expression of the Bargaining stage of grief. I don’t know Eliezer’s views here, but the latter sounds more Eliezer-ish to my ears. My Eliezer-model is more confident that alignment is hard (and that people aren’t currently taking the problem very seriously) than he is confident about timing AGI. Consider me completely convinced that alignment is hard, and that a lot of people aren’t taking it seriously enough, or are working on the wrong parts of the problem. That is fundamentally different from saying that it’s unlikely to be solved even if we get 100 as many people working on it (albeit for a shorter time), especially if you believe that geniuses are outliers and thus that the returns on sampling for more geniuses remain large even after drawing many samples (especially if we’ve currently sampled <500 over the lifetime of the field). To get down to <1% probability of success, you need a fundamentally different argument structure. Here are some examples. • “We have evidence that alignment will absolutely necessitate a lot of serial research. This means that even if lots more people join, the problem by its nature cannot be substantially accelerated by dramatically increasing the number of researchers (and consequently with high probability increasing the average quality of the top 20 researchers).” • I would love to see the structure of such an argument. • “We have a scheme for comprehensively dividing up all plausible alignment approaches. For each class of approach, we have hardness proofs, or things that practically serve as hardness proofs such that we do not believe 100 smart people thinking about it for a decade are at all likely to make more progress than we have in the previous decade.” • Needless to say, if you had such a taxonomy (even heuristically) it would be hugely valuable to the field—if for no other reason than that it would serve as an excellent communication mechanism to skeptics about the flaws in their approaches. • This would also be massively important from a social-coordination perspective. Consider how much social progress ELK made in building consensus around the hardness of the ontology-mismatch problem. What if we did that, but for every one of your hardness pseudo-results, and made the prize10M for each hardness result instead of 50k and broadcasted it to the top 200 CS departments worldwide? It’d dramatically increase the salience of alignment as a real problem that no one seems able to solve, since if someone already could they’d have made10M.

• “We are the smartest the world has to offer; even if >50% of theoretical computer scientists and >30% of physicists and >30% of pure mathematicians at the top 100 American universities were to start working on these problems 5 years from now, they would be unlikely to find much we haven’t found.”

• I’m not going to tell you this is impossible, but I haven’t seen the argument made yet. From an outside-view, the thing that makes Eliezer Yudkowsky get to where he is is (1) being dramatic-outlier-good at generalist reasoning, and (2) being an exceptional communicator to a certain social category (nerds). Founding the field is not, by itself, a good indicator of being dramatic-outlier-exceptional at inventing weird higher-level math. Obviously still MIRI are pretty good at it! But the best? Of all those people out there?

It would be really really helpful to have a breakdown of why MIRI is so pessimistic, beyond just “we don’t have any good ideas about how to build an off-switch; we don’t know how to solve ontology-mismatch; we don’t know how to prevent inner misalignment; also even if you solve them you’re probably wrong in some other way, based on our priors about how often rockets explode”. I agree those are big real unsolved problems. But, like, I myself have thought of ideas previously-unmentioned and near the research frontier on inner misalignment, and it wasn’t that hard! It did not inspire me with confidence that no amount of further thinking by newbs is likely to make any headway on these problems. Also, “alignment is like building a rocket except we only get one shot” was just as true a decade ago; why were you more optimistic before? Is it all just the hardness of the off-switch problem specifically?

This makes it sound like you’re imagining an end-state of ‘AGI proliferates freely, but we find some way to convince everyone to be cautious and employ best practices’. If so, that’s likely to be an important crux of disagreement between you and Eliezer; my Eliezer-model says that ‘AGI proliferates freely’ means death, and the first goal of alignment is to execute some pivotal act that safely prevents everyone and their grandmother from being able to build AGI. (Compare ‘everyone and their grandmother has a backyard nuclear arsenal’.)

I agree that proliferation would spell doom, but the supposition that the only possible way to prevent proliferation is via building an ASI and YOLOing to take over the world is, to my mind, a pretty major reduction of the options available. Arguably the best option is compute governance; if you don’t have extremely short (<10 year) timelines, it seems probable (>20%) that it will take many, many chips to train an AGI, let alone an ASI. In any conceivable world, these chips are coming from a country under either the American or Chinese nuclear umbrella. (This is because fabs are comically expensive and complex, and EUV lithography machines expensive and currently a one-firm monopoly, though a massively-funded Chinese competitor could conceivably arise someday. To appreciate just how strong this chokepoint is, the US military itself is completely incapable of building its own fabs, if the Trusted Foundry Program is any indication.) If China and NATO were worried that randos training large models had a 10% chance of ending the world, they would tell them to quit it. The fears about “Facebook AI Research will just build an AGI” sound much less plausible if you have 15/​20-year timelines, because if the US government tells Facebook they can’t do that, Facebook stops. Any nuclear-armed country outside China/​NATO can’t be controlled this way, but then they just won’t get any chips. “Promise you won’t build AGI, get chips, betray US/​China by building AGI anyway and hope to get to ASI fast enough to take over the world” is hypothetically possible, but the Americans and Chinese would know that and could condition the sale of chips on as many safeguards as we could think of. (Or just not sell the chips, and make India SSH into US-based datacenters.)

• It’s impossible to know where compute goes once it leaves the fabs.

• Impossible? Or just, like, complicated and would require work? I will grant that it’s impossible to know where consumer compute (like iPhones) ends up, but datacenter-scale compute seems much more likely to be trackable. Remember that in this world, the Chinese government is selling you chips and actually doesn’t want you building AGI with them. If you immediately throw your hands up and say you are confident there is no logistical way to do that, I think you are miscalibrated.

• Botnets (a la Gwern):

• You will note that in the Gwern story, the AGI had to build its own botnet; the initial compute needed to “ascend and break loose” was explicitly sanctioned by the government, despite a history of accidents. What if those two governments could be convinced about the difficulty of AI alignment, and actually didn’t let anyone run random code connected to the internet?

• What if the AGI is trained on an existing botnet, a la Folding@Home, or some darknet equivalent run by a terrorist group/​nation state? It’s possible; we should be thinking of monitoring techniques. The capabilities of botnets to undetectably leverage hypercompute are not infinite, and with real political will, I don’t know why it would be intractable to make it hard.

• We don’t trust the US/​Chinese governments to be able to correctly assess alignment approaches, when the time comes. The incentives are too extreme in favor of deployment.

• This is a reasonable concern. But the worst version of this, where the governments just okay something dumb with clear counterarguments, is only possible if you believe there remains a lack of consensus around the even-minor possibility of a catastrophic alignment problem. No American or Chinese leader has, in their lifetimes, needed to make a direct decision that had even a 10% chance of killing ten million Americans. (COVID vaccine buildout is a decent response, but sins of omission and commission are different to most people.)

• Influencing the government is impossible.

• We’re really only talking about convincing 2 bureaucracies; we might fail, but “it’s impossible” is an unfounded assumption. The climate people did it, and that problem has way more powerful entrenched opponents. (They didn’t get everything they want yet, but they’re asking for a lot more than we would be, and it’s hard to argue the people in power don’t think climate science is real.)

• As of today in the US, “don’t build AGI until you’re sure it won’t turn on you and kill everyone” has literally no political opponents, other than optimistic techno-futurists, and lots of supporters for obvious and less-obvious (labor displacement) reasons. I struggle to see where the opposition would come from in 10 years, either, especially considering that this would be regulation of a product that didn’t exist yet and thus had no direct beneficiaries.

• While Chinese domestic sentiments may be less anti-AI, the CCP doesn’t actually make decisions based on what its people think. It is an engineering-heavy elite dictatorship; if you convince enough within-China AI experts, there is plenty of reason to believe you could convince the CCP.

• This isn’t a stable equilibrium; something would go wrong and someone would push the button eventually.

• That’s probably true! If I had to guess, I think it could probably last for a decade, and probably not for two. That’s why it matters a lot whether the alignment problem is “too hard to make progress on in 2 decades with massive investment” or just “really hard and we’re not on track to solve it.”

• You may also note that the only data point we have about “Will a politician push a button that with some probability ends the world, and the rest of the time their country becomes a hegemon?” is the Cuban Missile Crisis. There was no mature second strike capability; if Kennedy had pushed the button, he wasn’t sure the other side could have retaliated. Do I want to replay the 1950s-60s nuclear standoff? No thank you. Would I trade that for racing to build an unaligned superintelligence first and then YOLOing? Yes please.

You will note that every point I’ve made here has a major preceding causal variable: enough people taking the hardness of the alignment problem seriously that we can do big-kid interventions. I really empathize with the people who feel burnt out about this. You have literally been doing your best to save the world, and nobody cares, so it feels intuitively likely that nobody will care. But there are several reasons I think this is pessimism, rather than calibrated realism:

• The actual number of people you need to convince is fairly small. Why? Because this is a technology-specific question and the only people who will make the relevant decisions are technical experts or the politicians/​leaders/​bureaucrats they work with, who in practice will defer to them when it comes to something like “the hardness of alignment”.

• The fear of “politicians will select those experts which recite convenient facts” is legitimate. However, this isn’t at all inevitable; arguably the reason this both-sidesing happened so much within climate science is that the opponents’ visibility was heavily funded by entrenched interests—which, again, don’t really exist for AGI.

• Given that an overwhelming majority of people dismiss the alignment problem primarily on the basis that their timelines are really long, every capability breakthrough makes shorter timelines seem more likely (and also makes the social cost of shorter timelines smaller, as everyone else updates on the same information). You can already see this to some extent with GPT-3 converting people; I for one had very long timelines before then. So strategies that didn’t work 10 years ago are meaningfully more likely to work now, and that will become even more true.

• Social influence is not a hard technical problem! It is hard, but there are entire industries of professionals who are actually paid to convince people of stuff. AI alignment is not funding constrained; all we’d need is money!

• On the topic of turning money into social influence, people really fail to appreciate how much money there is out there for AI alignment, especially if you could convince technical AI researchers. Guess who really doesn’t like an AI apocalypse? Every billionaire with a family office who doesn’t like giving to philanthropy! Misaligned AI is one of the only things that could meaningfully hurt the expected value of billionaires’ children; if scientists start telling billionaires this is real, it is very likely you can unlock orders of magnitude more money than the ~$5B that FTX + OpenPhil seem on track to spend. On that note, money can be turned into social influence in lots of ways. Give the world’s thousand most respected AI researchers$1M each to spend 3 months working on AI alignment, with an extra 100M if by the end they can propose a solution alignment researchers can’t shoot down. I promise you that other than like 20 industry researchers who are paid silly amounts, every one of them would take the million. They probably won’t make any progress, but from then on when others ask them whether they think alignment is a real unsolved problem, they will be way more likely to say yes. That only costs you a billion dollars! I literally think I could get someone reading this the money to do this (at least at an initially moderate scale) - all it needs is a competent person to step up. The other point that all of my arguments depend on, is that we have, say, at least until 2035. If not, a lot of these ideas become much less likely to work, and I start thinking much more that “maybe it really will just be DeepMind YOLOing ASI” and dealing with attendant strategies. So again, if Eliezer has private information that makes him really confident relative to everyone else, that >50% of the probability mass is on sooner than 2030, it sure would be great if I knew how seriously to take that, and whether he thinks a calibrated actor would abandon the other strategies and focus on Hail Marys. • This is the best counter-response I’ve read on the thread so far, and I’m really interested what responses will be. Commenting here so I can easily get back to this comment in the future. • Commenting here so I can easily get back to this comment in the future. FWIW if you click the three vertical dots at the top right of a comment, it opens a dropdown where you can “Subscribe to comment replies”. • If it makes you guys feel better, you can rationalize the following as my expression of the Bargaining stage of grief. This is an interesting name for the cognition I’ve seen a lot of people do. • This is a great comment, maybe make it into a top-level post? • I’m not sure how to do that? Also, unfortunately, since posting this comment, the last week’s worth of evidence does make me think that 5-15 year timelines are the most plausible, and so I am much more focused on those. Specifically, I think it’s time to pull the fire alarm and do mass within-elite advocacy. • Cut and paste? But yes, it’s panic or death. And probably death anyway. Nice to get a bit of panic in first if we can though! Good luck with stirring it. • I found your earlier comment in this thread insightful and I think it would be really valuable to know what evidence convinced you of these timelines. If you don’t have time to summarize in a post, is there anything you could link to? • Note also that I would still endorse these actions (since they’re still necessary even with shorter timelines) but they need to be done much faster and so we need to be much more aggressive. • Don’t worry, Eliezer, there’s almost certainly a configuration of particles where something that remembers being you also remembers surviving the singularity. And in that universe your heroic attempts were very likely a principal reason why the catastrophe didn’t happen. All that’s left is arguing about the relative measure. Even if things don’t work that way, and there really is a single timeline for some incomprehensible reason, you had a proper pop, and you and MIRI have done some fascinating bits of maths, and produced some really inspired philosophical essays, and through them attracted a large number of followers, many of whom have done excellent stuff. I’ve enjoyed everything you’ve ever written, and I’ve really missed your voice over the last few years. It’s not your fault that the universe set you an insurmountable challenge, or that you’re surrounded by the sorts of people who are clever enough to build a God and stupid enough to do it in spite of fairly clear warnings. Honestly, even if you were in some sort of Groundhog Day setup, what on earth were you supposed to do? The ancients tell us that it takes many thousands of years just to seduce Andie MacDowell, and that doesn’t even look hard. • Yeah, the thought that I’m not really seeing how to win even with a single restart has been something of a comforting one. I was, in fact, handed something of an overly hard problem even by “shut up and do the impossible” standards. Groundhog Day loop I’d obviously be able to do it, as would Nate Soares or Paul Christiano, if AGI failures couldn’t destroy my soul inside the loop. Possibly even Yann Lecun could do it; the first five loops or so would probably be enough to break his optimism, and with his optimism broken and the ability to test and falsify his own mistakes nonfatally he’d be able to make progress. It’s not that hard. • I suppose every eight billion deaths (+ whatever else is out there) you get a bug report, and my friend Ramana did apparently manage to create a formally verified version of grep, so more is possible than my intuition tells me. But I do wonder if that just (rather expensively) gets you to the point where the AI keeps you alive so you don’t reset the loop. That’s not necessarily a win. -- Even if you can reset at will, and you can prevent the AI from stopping you pressing the reset, it only gets you as far as a universe where you personally think you’ve won. The rest of everything is probably paperclips. -- If you give everyone in the entire world death-or-reset-at-will powers, then the bug reports stop being informative, but maybe after many many loops then just by trial and error we get to a point where everyone has a life they prefer over reset, and maybe there’s a non-resetting path from there to heaven despite the amount of wireheading that will have gone on? -- But actually I have two worries even in this case: One is that human desires are just fundamentally incompatible in the sense that someone will always press the button and the whole thing just keeps looping. There’s probably some sort of finiteness argument here, but I worry about Arrow’s Theorem. The second is that the complexity of the solution may exceed the capacity of the human mind. Even if solutions exist, are you sure that you can carry enough information back to the start of the loop to have a chance of doing it right? -- Also what if the AI gets hold of the repeat button? Also, I am lazy and selfish, I’m probably going to spend a while in my personal AI paradise before I finally reluctantly guilt myself into pressing the reset. When I come back (as a four-year old with plans) am I going to be able to stand this world? Or am I just going to rebuild my paradise as fast as possible and stay there longer this time? I think there are lots of ways to fail. -- Obviously I am just free-associating out of my arse here, (can’t sleep), but maybe it’s worth trying to work out some sort of intuition for this problem? Why do you think it’s possible *at all*, and how many repeats do you think you’d need? I’m pretty sure I couldn’t beat God at chess, even given infinite repeats. What makes this easier? (OK actually I could, given infinite repeats, potentially construct a whole-game tablebase and get a draw (and maybe a win if chess can be won). That looks kind of tractable, but I probably need to take over the world as part of that.… I’m going to need something pretty large to keep the data in...) • The setting of Groundhog Day struggle against AI might be the basis of a great fanfiction. • Certainly I would love to read it, if someone were to write it well. Permission to nick the idea is generally given. I like the idea of starting off from Punxutawny, with Bill Murray’s weatherman slowly getting the idea that he should try to fix whatever keeps suddenly causing the mysterious reset events. • > I was, in fact, handed something of an overly hard problem We were all handed the problem. You were the one who decided to man up and do something about it. • It’s too bad we couldn’t just have the proverbial box be an isolated simulation and have you brain interface into it. The A.I. keeps winning in the Matrix, and afterwards we just reset it until we see improvements in alignment. • I’m curious about what continued role you do expect yourself to have. I think you could still have a lot of value in helping train up new researchers at MIRI. I’ve read you saying you’ve developed a lot of sophisticated ideas about cognition that are hard to communicate, but I imagine could be transmitted easier within MIRI. If we need a continuing group of sane people to be on the lookout for positive miracles, would you still take a relatively active role in passing on your wisdom to new MIRI researchers? I would genuinely imagine that being in more direct mind-to-mind contact with you would be useful, so I hope you don’t become a hermit. • Agreed. If MIRI feels like they aren’t going to be able to solve the problem, then it makes sense for them to focus on training up the next generation instead. • The surviving worlds look like people who lived inside their awful reality and tried to shape up their impossible chances; until somehow, somewhere, a miracle appeared—the model broke in a positive direction, for once, as does not usually occur when you are trying to do something very difficult and hard to understand, but might still be so—and they were positioned with the resources and the sanity to take advantage of that positive miracle, because they went on living inside uncomfortable reality. Can you talk more about this? I’m not sure what actions you want people to take based on this text. What is the difference between a strategy that is dignified and one that is a clever scheme? • I may be misunderstanding, but I interpreted Eliezer as drawing this contrast: • Good Strategy: Try to build maximally accurate models of the real world (even though things currently look bad), while looking for new ideas or developments that could save the world. Ideally, the ideas the field puts a lot of energy into should be ones that already seem likely to work, or that seem likely to work under a wide range of disjunctive scenarios. (Failing that, they at least shouldn’t require multiple miracles, and should lean on a miracle that’s unusually likely.) • Bad Strategy: Reason “If things are as they appear, then we’re screwed anyway; so it’s open season on adopting optimistic beliefs.” Freely and casually adopt multiple assumptions based on wishful thinking, and spend your mental energy thinking about hypothetical worlds where things go unusually well in specific ways you’re hoping they might (even though, stepping back, you wouldn’t have actually bet on those optimistic assumptions being true). • I actually feel calmer after reading this, thanks. It’s nice to be frank. For all the handwringing in comments about whether somebody might find this post demotivating, I wonder if there are any such people. It seems to me like reframing a task from something that is not in your control (saving the world) to something that is (dying with personal dignity) is the exact kind of reframing that people find much more motivating. • If I am doomed to fail, I have no motivation to work on the problem. If we are all about to get destroyed by an out-of-control AI, I might just go full hedonist and stop putting work into anything (Fuck dignity and fuck everything). Post like this is severely demotivating, people are interested in solving problems, and nobody is interested in working on “dying with dignity”. • After nearly 300 years of not solving the so-called Fermat’s last problem, many were skeptical that’s even (humanely) possible. Some, publicly so. Especially some of those, who were themselves unable to find a solution, after years of trying. Now, something even much more important is at stake. Namely, how to prevent AI to kill us all. The more important, but maybe also even (much?) easier problem, after all. It’s not over yet. • That’s a really strange concept of dignity. • Sure, but it’s dignity in the specific realm of “facing unaligned AGI knowing we did everything we could”, not dignity in general. • … but it discards all concerns outside of that. “If I regret my planet’s death then I regret it, and it’s beneath my dignity to pretend otherwise” does not imply that there might not be other values you could achieve during the time available. Another way to put that, perhaps, is that “knowing we did everything we could” doesn’t seem particularly dignified. Not if you had no meaningful expectation it could work. Extracting whatever other, potentially completely unrelated, value you could from the remaining available time would seem a lot more dignified to me than continuing on something you truly think is futile. • Extracting whatever other, potentially completely unrelated, value you could from the remaining available time would seem a lot more dignified to me than continuing on something you truly think is futile. The amount of EV at stake in my (and others’) experiences over the next few years/​decades is just too small compared to the EV at stake in the long-term future. The “let’s just give up” intuition makes sense in a regime where we’re comparing stakes that differ by 10x, or 100x, or 1000x; I think its appeal in this case comes from the fact that it’s hard to emotionally appreciate how much larger the quantities we’re talking about are. (But the stakes don’t change just because it’s subjectively harder to appreciate them; and there’s nothing dignified about giving up prematurely because of an arithmetic error.) • I think that utilitarianism and actual human values are in different galaxies (example of more realistic model). There’s no way I would sacrifice a truly big chunk of present value (e.g. submit myself to a year of torture) to increase the probability of a good future by something astronomically small. Given Yudkowsky’s apparent beliefs about success probabilities, I might have given up on alignment research altogether[1]. On that other hand, I don’t inside-view!think the success probability is quite that small, and also the reasoning error that leads to endorsing utilitarianism seems positively correlated with the reasoning error that leads to extremely low success probability. Because, if you endorse utilitarianism then it generates a lot of confusion about the theory of rational agents, which makes you think there are more unsolved questions than there really are[2]. In addition, value learning seems more hopeless than it actually is. I have some reservations about posting this kind of comments, because they might be contributing to shattering the shared illusion of utilitarianism and its associated ethos, an ethos whose aesthetics I like and which seems to motivate people to do good things. (Not to mention, these comments might cause people to think less of me.) On the other hand, the OP says we need to live inside reality and be honest with ourselves and with each other, and I agree with that. 1. ↩︎ But maybe not, because it’s also rewarding in other ways. 2. ↩︎ Ofc there are still many unsolved questions. • I think that utilitarianism and actual human values are in different galaxies (example of more realistic model). There’s no way I would sacrifice a truly big chunk of present value (e.g. submit myself to a year of torture) to increase the probability of a good future by something astronomically small. I think that I’d easily accept a year of torture in order to produce ten planets worth of thriving civilizations. (Or, if I lack the resolve to follow through on a sacrifice like that, I still think I’d have the resolve to take a pill that causes me to have this resolve.) ‘Produce ten planets worth of thriving civilizations, with certainty’ feels a lot more tempting to me than ‘produce an entire universe of thriving civilizations, with tiny probability’, but I think that’s because I have a hard time imagining large quantities and because of irrational, non-endorsed attachment to certainties, not because of a deep divergence between my values and utilitarianism. I do think my values are very non-utilitarian in tons of ways. A utilitarian would have zero preference for their own well-being over anyone else’s, would care just as much for strangers as for friends and partners, etc. This obviously doesn’t describe my values. But the cases where this reflectively endorsed non-utilitarianism shows up, are ones where I’m comparing, e.g., one family member’s happiness versus a stranger’s happiness, or the happiness of a few strangers. I don’t similarly feel that a family member of mine ought to matter more than an intergalactic network of civilizations. (And to the extent I do feel that way, I certainly don’t endorse it!) On the other hand, the OP says we need to live inside reality and be honest with ourselves and with each other, and I agree with that. Yes, if utilitarianism is wrong in the particular ways you think it is (which I gather is a strict superset of the ways I think it is?), then I want to believe as much. I very much endorse you sharing arguments to that effect! :) • [edit: looks like Rob posted elsethread a comment addressing my question here] I’m a bit confused by this argument, because I thought MIRI-folk had been arguing against this specific type of logic. (I might be conflating a few different types of arguments, or might be conflating ‘well, Eliezer said this, so Rob automatically endorses it’, or some such). But, I recall recommendations to generally not try to get your expected value from multiplying tiny probabilities against big values, because a) in practice that tends to lead to cognitive errors, b) in many cases people were saying things like “x-risk is a small probability of a Very Bad Outcome”, when the actual argument was “x-risk is a big probability of a Very Bad Outcome.” (Right now maybe you’re maybe making a different argument, not about what humans should do, but about some underlying principles that would be true if we were better at thinking about things?) • Quoting the excerpt from Being Half-Rational About Pascal’s Wager is Even Worse that I quoted in the other comment: [...] And finally, I once again state that I abjure, refute, and disclaim all forms of Pascalian reasoning and multiplying tiny probabilities by large impacts when it comes to existential risk. We live on a planet with upcoming prospects of, among other things, human intelligence enhancement, molecular nanotechnology, sufficiently advanced biotechnology, brain-computer interfaces, and of course Artificial Intelligence in several guises. If something has only a tiny chance of impacting the fate of the world, there should be something with a larger probability of an equally huge impact to worry about instead. You cannot justifiably trade off tiny probabilities of x-risk improvement against efforts that do not effectuate a happy intergalactic civilization, but there is nonetheless no need to go on tracking tiny probabilities when you’d expect there to be medium-sized probabilities of x-risk reduction. [...] EDIT: To clarify, “Don’t multiply tiny probabilities by large impacts” is something that I apply to large-scale projects and lines of historical probability. On a very large scale, if you think FAI stands a serious chance of saving the world, then humanity should dump a bunch of effort into it, and if nobody’s dumping effort into it then you should dump more effort than currently into it. On a smaller scale, to compare two x-risk mitigation projects in demand of money, you need to estimate something about marginal impacts of the next added effort (where the common currency of utilons should probably not be lives saved, but “probability of an ok outcome”, i.e., the probability of ending up with a happy intergalactic civilization). In this case the average marginal added dollar can only account for a very tiny slice of probability, but this is not Pascal’s Wager. Large efforts with a success-or-failure criterion are rightly, justly, and unavoidably going to end up with small marginally increased probabilities of success per added small unit of effort. It would only be Pascal’s Wager if the whole route-to-an-OK-outcome were assigned a tiny probability, and then a large payoff used to shut down further discussion of whether the next unit of effort should go there or to a different x-risk. From my perspective, the name of the game is ‘make the universe as a whole awesome’. Within that game, it would be silly to focus on unlikely fringe x-risks when there are high-probability x-risks to worry about; and it would be silly to focus on intervention ideas that have a one-in-a-million chance of vastly improving the future, when there are other interventions that have a one-in-a-thousand chance of vastly improving the future, for example. That’s all in the context of debates between longtermist strategies and candidate megaprojects, which is what I usually assume is the discussion context. You could have a separate question that’s like ‘maybe I should give up on ~all the value in the universe and have a few years of fun playing sudoku and watching Netflix shows before AI kills me’. In that context, the basic logic of anti-Pascalian reasoning still applies (easy existence proof: if working hard on x-risk raised humanity’s odds of survival from to , it would obviously not be worth working hard on x-risk), but I don’t think we’re anywhere near the levels of P(doom) that would warrant giving up on the future. ‘There’s no need to work on supervolcano-destroying-humanity risk when there are much more plausible risks like bioterrorism-destroying-humanity to worry about’ is a very different sort of mental move than ‘welp, humanity’s odds of surviving are merely 1-in-100, I guess the reasonable utility-maximizing thing to do now is to play sudoku and binge Netflix for a few years and then die’. 1-in-100 is a fake number I pulled out of a hat, but it’s an example of a very dire number that’s obviously way too high to justify humanity giving up on its future. (This is all orthogonal to questions of motivation. Maybe, in order to avoid burning out, you need to take more vacation days while working on a dire-looking project, compared to the number of vacation days you’d need while working on an optimistic-looking project. That’s all still within the framework of ‘trying to do longtermist stuff’, while working with a human brain.) • One additional thing adding confusion is Nate Soares’ latest threads on wallowing* which… I think are probably compatible with all this, but I couldn’t pass the ITT on. *I think his use of ‘wallowing’ is fairly nonstandard, you shouldn’t read into it until you’ve talked to him about it for at least an hour. • I think that I’d easily accept a year of torture in order to produce ten planets worth of thriving civilizations. (Or, if I lack the resolve to follow through on a sacrifice like that, I still think I’d have the resolve to take a pill that causes me to have this resolve.) I’d do this to save ten planets of worth of thriving civilizations, but doing it to produce ten planets worth of thriving civilizations seems unreasonable to me. Nobody is harmed by preventing their birth, and I have very little confidence either way as to whether their existence will wind up increasing the average utility of all lives ever eventually lived. • I used to favour average utilitarianism too, until I learned about the sadistic conclusion. That was sufficient to overcome any aversion I had to the repugnant conclusion. • I’m happy to accept the sadistic conclusion as normally stated, and in general I find “what would I prefer if I were behind the Rawlsian Veil and going to be assigned at random to one of the lives ever actually lived” an extremely compelling intuition pump. (Though there are other edge cases that I feel weirder about, e.g. is a universe where everyone has very negative utility really improved by adding lots of new people of only somewhat negative utility?) As a practical matter though I’m most concerned that total utilitarianism could (not just theoretically but actually, with decisions that might be locked-in in our lifetimes) turn a “good” post-singularity future into Malthusian near-hell where everyone is significantly worse off than I am now, whereas the sadistic conclusion and other contrived counterintuitive edge cases are unlikely to resemble decisions humanity or an AGI we create will actually face. Preventing the lock-in of total utilitarian values therefore seems only a little less important to me than preventing extinction. • Another question. Imagine a universe with either only 5 or 10 people. If they’re all being tortured equally badly at a level of −100 utility, are you sure you’re indifferent as to the number of people existing? Isn’t less better here? • Yeah that’s essentially the example I mentioned that seems weirder to me, but I’m not sure, and at any rate it seems much further from the sorts of decisions I actually expect humanity to have to make than the need to avoid Malthusian futures. • I think that I’d easily accept a year of torture in order to produce ten planets worth of thriving civilizations. (Or, if I lack the resolve to follow through on a sacrifice like that, I still think I’d have the resolve to take a pill that causes me to have this resolve.) I think that “resolve” is often a lie we tell ourselves to explain the discrepancies between stated and revealed preferences. I concede that if you took that pill, it would be evidence against my position (but, I believe you probably would not). A nuance to keep in mind is that reciprocity can be a rational motivation to behave more altruistically that you otherwise would. This can come about from tit-for-tat /​ reputation systems, or even from some kind of acausal cooperation. Reciprocity is effectively moving us closer to utilitarianism, but certainly not all the way there. So, if I’m weighing the life of my son or daughter against an intergalatic network of civilizations, which I never heard of before and never going to hear about after, and which wouldn’t even reciprocate in a symmetric scenario, I’m choosing my child for sure. • If I knew as a certainty that I cannot do nearly as much good some other way, and I was certain that taking the pill causes that much good, I’d take the pill, even if I die after the torture and no one will know I sacrificed myself for others. I admit those are quite unusual values for a human, and I’m not arguing about that it would be rational because of utilitarianism or so, just that I would do it. (Possible that I’m wrong, but I think very likely I’m not.) Also, I see that the way my brain is wired outer optimization pushes against that policy, and I think I probably wouldn’t be able to take the pill a second time under the same conditions (given that I don’t die after torture), or at least not often. • I don’t think those are unusual values for a human. Many humans have sacrificed their lives (and endured great hardship and pain, etc.) to help others. And many more would take a pill to gain that quality, seeing it as a more courageous and high-integrity expression of their values. • A utilitarian would have zero preference for their own well-being over anyone else’s, would care just as much for strangers as for friends and partners, etc. That’s not the defintion I’m used to. A utilitarian can have any utility function—including one that privileges their own moral worth over the moral worth of other beings, or even assign moral worths that vary with time and place. You can have literally any preference ordering over beings or futures. The only requirement of a utilitarian is that they are coherent, more specifically that this preference ordering is not circular. • Your definition is wrong; I think that way of defining ‘utilitarianism’ is purely an invention of a few LWers who didn’t understand what the term meant and got it mixed up with ‘utility function’. AFAIK, there’s no field where ‘utilitarianism’ has ever been used to mean ‘having a utility function’. • I had this confusion for a few years. It personally made me dislike the term utilitarian, because it really mismatched my internal ontology. • Hm, I worry I might be a confused LWer. I definitely agree that “having a utility function” and “being a utilitarian” are not identical concepts, but they’re highly related, no? Would you agree that, to a first-approximation, being a utilitarian means having a utility function with the evolutionary godshatter as terminal values? Even this is not identical to the original philosophical meaning I suppose, but it seems highly similar, and it is what I thought people around here meant. • Would you agree that, to a first-approximation, being a utilitarian means having a utility function with the evolutionary godshatter as terminal values? This is not even close to correct, I’m afraid. In fact being a utilitarian has nothing whatever to do with the concept of a utility function. (Nor—separately—does it have much to do with “evolutionary godshatter” as values; I am not sure where you got this idea!) Please read this page for some more info presented in a systematic way. • I meant to convey a utility function with certain human values as terminal values, such as pleasure, freedom, beauty, etc.; godshatter was a stand-in. If the idea of a utility function has literally nothing to do with moral utilitarianism, even around here, I would question why in the above when Eliezer is discussing moral questions he references expected utility calculations? I would also point to “intuitions behind utilitarianism“ as pointing at connections between the two? Or “shut up and multiply”? Need I go on? I know classical utilitarianism is not exactly the same, but even in what you linked, it talks about maximizing the total sum of human happiness and sacrificing some goods for others, measured under a single metric “utility”. That sounds an awful lot like a utility function trading off human terminal values? I don’t see how what I’m pointing at isn’t just a straightforward idealization of classical utilitarianism. • I meant to convey a utility function with certain human values as terminal values, such as pleasure, freedom, beauty, etc.; godshatter was a stand-in. Yes, I understood your meaning. My response stands. If the idea of a utility function has literally nothing to do with moral utilitarianism, even around here, I would question why in the above when Eliezer is discussing moral questions he references expected utility calculations? What is the connection? Expected utility calculations can be, and are, relevant to all sorts of things, without being identical to, or similar to, or inherently connected with, etc., utilitarianism. I would also point to “intuitions behind utilitarianism“ as pointing at connections between the two? Or “shut up and multiply”? Need I go on? The linked post makes some subtle points, as well as some subtle mistakes (or, perhaps, instances of unclear writing on Eliezer’s part; it’s hard to tell). I know classical utilitarianism is not exactly the same, but even in what you linked, it talks about maximizing the total sum of human happiness and sacrificing some goods for others, measured under a single metric “utility”. That sounds an awful lot like a utility function trading off human terminal values? I don’t see how what I’m pointing at isn’t just a straightforward idealization of classical utilitarianism. The “utility” of utilitarianism and the “utility” of expected utility theory are two very different concepts that, quite unfortunately and confusingly, share a term. This is a terminological conflation, in other words. Here is a long explanation of the difference. • None of what you have linked so far has particularly conveyed any new information to me, so I think I just flatly disagree with you. As that link says, the “utility” in utilitarianism just means some metric or metrics of “good”. People disagree about what exactly should go into “good” here, but godshatter refers to all the terminal values humans have, so that seems like a perfectly fine candidate for what the “utility” in utilitarianism ought to be. The classic “higher pleasures” in utilitarianism lends credence toward this fitting into the classical framework; it is not a new idea that utilitarianism can include multiple terminal values with relative weighting. Under utilitarianism, we are then supposed to maximize this utility. That is, maximize the satisfaction of the various terminal goals we are taking as good, aggregated into a single metric. And separately, there happens to be this elegant idea called “utility theory”, which tells us that if you have various preferences you are trying to maximize, there is a uniquely rational way to do that, which involves giving them relative weights and aggregating into a single metric… You seriously think there’s no connection here? I honestly thought all this was obvious. In that last link, they say “Now, it is sometimes claimed that one may use decision-theoretic utility as one possible implementation of the utilitarian’s ‘utility’” then go on to say why this is wrong, but I don’t find it to be a knockdown argument; that is basically what I believe and I think I stand by it. Like, if you plug “aggregate human well-being along all relevant dimensions” into the utility of utility theory, I don’t see how you don’t get exactly utilitarianism out of that, or at least one version of it? EDIT: Please also see in the above post under “You should never try to reason using expected utilities again. It is an art not meant for you. Stick to intuitive feelings henceforth.” It seems to me that Eliezer goes on to consistently use the “expected utilities” of utility theory to be synonymous to the “utilities” of utilitarianism and the “consequences” of consequentialism. Do you agree that he’s doing this? If so, I assume you think he’s wrong for doing it? Eliezer tends to call himself a utilitarian. Do you agree that he is one, or is he something else? What would you call “using expected utility theory to make moral decisions, taking the terminal value to be human well-being”? • In that last link, they say “Now, it is sometimes claimed that one may use decision-theoretic utility as one possible implementation of the utilitarian’s ‘utility’” then go on to say why this is wrong, but I don’t find it to be a knockdown argument; that is basically what I believe and I think I stand by it. Like, if you plug “aggregate human well-being along all relevant dimensions” into the utility of utility theory, I don’t see how you don’t get exactly utilitarianism out of that, or at least one version of it? You don’t get utilitarianism out of it because, as explained at the link, VNM utility is incomparable between agents (and therefore cannot be aggregated across agents). There are no versions of utilitarianism that can be constructed out of decision-theoretic utility. This is an inseparable part of the VNM formalism. That having been said, even if it were possible to use VNM utility as the “utility” of utilitarianism (again, it is definitely not!), that still wouldn’t make them the same theory, or necessarily connected, or conceptually identical, or conceptually related, etc. Decision-theoretic expected utility theory isn’t a moral theory at all. Really, this is all explained in the linked post… Re: the “EDIT:” part: It seems to me that Eliezer goes on to consistently use the “expected utilities” of utility theory to be synonymous to the “utilities” of utilitarianism and the “consequences” of consequentialism. Do you agree that he’s doing this? No, I do not agree that he’s doing this. Eliezer tends to call himself a utilitarian. Do you agree that he is one, or is he something else? Yes, he’s a utilitarian. (“Torture vs. Dust Specks” is a paradigmatic utilitarian argument.) What would you call “using expected utility theory to make moral decisions, taking the terminal value to be human well-being”? I would call that “being confused”. How to (coherently, accurately, etc.) map “human well-being” (whatever that is) to any usable scalar (not vector!) “utility” which you can then maximize the expectation of, is probably the biggest challenge and obstacle to any attempt at formulating a moral theory around the intuition you describe. (“Utilitarianism using VNM utility” is a classic failed and provably unworkable attempt at doing this.) If you don’t have any way of doing this, you don’t have a moral theory—you have nothing. • If the idea of a utility function has literally nothing to do with moral utilitarianism, even around here, I would question why in the above when Eliezer is discussing moral questions he references expected utility calculations If he has a proof that utilitarianism, as usually defined the highly altruistic ethical theory, is equivalent to maximization of an arbitrary UF , given some considerations about coherence, then he has something extraordinary that should be widely I own. Or he is using “utilitarianism” in a weird way. ..or he is not and he is just confused. • I said nothing about an arbitrary utility function (nor proof for that matter). I was saying that applying utility theory to a specific set of terminal values seems to basically get you an idealized version of utilitarianism, which is what I thought the standard moral theory was around here. • If you know the utility function that is objectively correct, then you have the correct metaethics, and VnM style utility maximisation only tells you how implement it efficiently. The first thing is “utilitarianism is true”, the second thing is “rationality is useful”. But that goes back to the issue everyone criticises: EY recommends an object level decision...prefer torture to dust specks… unconditionally without knowing the reader’s UF. If he had succeeded in arguing, or even tried to tried to argue that there is one true objective UF, then he would be in a position to hand out unconditional advice. Or if he could show that preferring torture to dust specks was rational given an arbitrary UF, then he could also hand out unconditional advice (in the sense that the conditioning on an subjective UF doesn’t make a difference,). But he doesn’t do that, because if someone has a UF that places negative infinity utility on torture, that’s not up for grabs… their personal UF is what it is . • Oh okay. I see, my bad. I agree with @habryka, that feels like a confusing way to define things though. (I’d personally still prefer “utilitarian = have a utility function as per their stated* moral preferences” and “impartial utilitarian = have a utility function as per their stated moral preferences, that attaches same worth to all beings” or something.) Keen to know if there’s any upside of defining things as they are today. *”stated preferences” here I would constrast against “revealed preferences”, cause a person can state beliefs they are not acting in congruence with. • Well, that’s it, your access to the medicine cabinet is revoked. :-) Yes, if utilitarianism is wrong You can say that it’s wrong to think you can actually measure and usefully aggregate utility functions. That’s truly a matter of fact. … but to be able to say that utilitarianism in all its forms was “wrong” would require an external standard. Ethical realism really is wrong. • … but to be able to say that utilitarianism in all its forms was “wrong” would require an external standard. Utiltiarianism as a meta ethical theory can be wrong without being ethically wrong. Meta ethical theories can be criticised for being contradictory, unworkable, irrelevant, etc. • … but to be able to say that utilitarianism in all its forms was “wrong” would require an external standard. Ethical realism really is wrong. Utilitarianism can be wrong as a description of actual human values, or of ‘the values humans would self-modify to if they fully understood the consequences of various self-modification paths’. • OK, but that’s an is-ought issue. I didn’t perceive the question as being about factual human values, but about what people should do. It’s an ethical system, after all, not a scientific system. • Because, if you endorse utilitarianism then it generates a lot of confusion about the theory of rational agents, which makes you think there are more unsolved questions than there really are[2]. Are you alluding to agents with VNM utility functions here? • I assign much lower value than a lot of people here to some vast expansionist future… and I suspect that even if I’m in the minority, I’m not the only one. It’s not an arithmetic error. • Can you be more explicit about the arithmetic? Would increasing the probability of civilization existing 1000 years from now from 10^{-7} to 10^{-6} be worth more or less to you than receiving a billion dollars right now? • Do I get any information about what kind of civilization I’m getting, and/​or about what it would be doing during the 1000 years or after the 1000 years? On edit: Removed the “by how much” because I figured out how to read the notation that gave the answer. • I guess by “civilization” I meant “civilization whose main story is still being meaningfully controlled by humans who are individually similar to modern humans”. Other than that, I just mean your current expectations about what that civilization is like, conditioned on it existing. (It seems like you could be disagreeing with “a lot of people here” about what those futures look like or how valuable they are or both—I’d be happy to get clarification on either front.) • Hmm. I should have asked what the alternative to civilization was going to be. Nailing it down to a very specific question, suppose my alternatives are... 1. I get a billion dollars. My life goes on as normal otherwise. Civilization does whatever it’s going to do; I’m not told what. Omega tells me that everybody will suddenly drop dead at some time within 1000 years, for reasons I don’t get to know, with probability one minus one in ten million. … versus... 2. I do not get a billion dollars. My life goes on as normal otherwise. Civilization does whatever it’s going to do; I’m not told what. Omega tells me that everybody will suddenly drops dead at some time within 1000 years, for reasons I don’t get to know, with probability one minus one in one million. … then I don’t think I take the billion dollars. Honestly the only really interesting thing I can think of to do with that kind of money would be to play around with the future of civilization anyway. I think that’s probably the question you meant to ask. However, that’s a very, very specific question, and there are lots other hypotheticals you could come up with. The “civilization whose main story is still being meaningfully controlled by humans etc.” thing bothers me. If a utopian godlike friendly AI were somehow on offer, I would actively pay money to take control away from humans and hand it to that AI… especially if I or people I personally care about had to live in that world. And there could also be valuable modes of human life other than civilization. Or even nonhuman things that might be more valuable. If those were my alternatives, and I knew that to be the case, then my answer might change. For that matter, even if everybody were somehow going to die, my answer could depend on how civilization was going to end and what it was going to do before ending. A jerkass genie Omega might be withholding information and offering me a bum deal. Suppose I knew that civilization would end because everybody had agreed, for reasons I cannot at this time guess, that the project was in some sense finished, all the value extracted, so they would just stop reproducing and die out quietly… and, perhaps implausibly, that conclusion wasn’t the result of some kind of fucked up mind control. I wouldn’t want to second-guess the future on that. Similarly, what if I knew civilization would end because the alternative was some also as yet unforeseen fate worse than death? I wouldn’t want to avoid x-risk by converting it into s-risk. In reality, of course, nobody’s offering me clearcut choices at all. I kind of bumble along, and thereby I (and of course others) sculpt my future light cone into some kind of “work of art” in some largely unpredictable way. Basically what I’m saying is that, insofar as I consciously control that work of art, pure size isn’t the aesthetic I’m looking for. Beyond a certain point, size might be a negative. 1000 years is one thing, but vast numbers of humans overrunning galaxy after galaxy over billions of years, while basically doing the same old stuff, seems pointless to me. • Thanks for all the detail, and for looking past my clumsy questions! It sounds like one disagreement you’re pointing at is about the shape of possible futures. You value “humanity colonizes the universe” far less than some other people do. (maybe rob in particular?) That seems sane to me. The near-term decision questions that brought us here were about how hard to fight to “solve the alignment problem,” whatever that means. For that, the real question is about the difference in total value of the future conditioned on “solving” it and conditioned on “not solving” it. You think there are plausible distributions on future outcomes so that 1 one-millionth of the expected value of those futures is worth more to you than personally receiving 1 billion dollars. Putting these bits together, I would guess the amount of value at stake is not really the thing driving disagreement here, but about the level of futility? Say you think humanity overall has about a 1% chance of succeeding with a current team of 1000 full-time-equivalents working on the problem. Do you want to join the team in that case? What if we have a one-in-one-thousand chance and a current team of 1 million? Do these seem like the right units to talk about the disagreement in? (Another place that I thought there might be a disagreement: do you think solving the alignment problem increases or decreases s-risk? Here “solving the alignment problem” is the thing that we’re discussing giving up on because it’s too futile.) • In some philosophical sense, you have to multiply the expected value by the estimated chance of success. They both count. But I’m not sitting there actually doing multiplication, because I don’t think you can put good enough estimates on either one to make the result meaningful. In fact, I guess that there’s a better than 1 percent chance of avoiding AI catastrophe in real life, although I’m not sure I’d want to (a) put a number on it, (b) guess how much of the hope is in “solving alignment” versus the problem just not being what people think it will be, (c) guess how much influence my or anybody else’s actions would have on moving the probability[edited from “property”...], or even (d) necessarily commit to very many guesses about which actions would move the probability in which directions. I’m just generally not convinced that the whole thing is predictable down to 1 percent at all. In any case, I am not in fact working on it. I don’t actually know what values I would put on a lot of futures, even the 1000 year one. Don’t get hung up on the billion dollars, because I also wouldn’t take a billion dollars to singlemindedly dedicate the remainder of my life , or even my “working time”, to anything in particular unless I enjoyed it. Enjoying life is something you can do with relative certainty, and it can be enough even if you then die. That can be a big enough “work of art”. Everybody up to this point has in fact died, and they did OK. For that matter, I’m about 60 years old, so I’m personally likely to die before any of this stuff happens… although I do have a child and would very much prefer she didn’t have to deal with anything too awful. I guess I’d probably work on it if I thought I had a large, clear contribution to make to it, but in fact I have absolutely no idea at all how to do it, and no reason to expect I’m unusually talented at anything that would actually advance it. do you think solving the alignment problem increases or decreases s-risk If you ended up enacting a serious s-risk, I don’t understand how you could say you’d solved the alignment problem. At least not unless the values you were aligning with were pretty ugly ones. I will admit that sometimes I think other people’s ideas of good outcomes sound closer to s-risks than I would like, though. If you solved the problem of aligning with those people, I might see it as an increase. • Have you considered local movement building? Perhaps, something simple like organising dinners or a reading group to discuss these issues? Maybe no-one would come, but it’s hard to say unless you give it a go and, in any case, a small group of two or three thoughtful people is more valuable than a much larger group of people who are just there to pontificate without really thinking through anything deeply. • The amount of EV at stake in my (and others’) experiences over the next few years/​decades is just too small compared to the EV at stake in the long-term future. AI alignment isn’t the only option to improve the EV of the long-term future, though. • This is Pascal’s Mugging. Previously comparisons between the case for AI xrisk mitigation and Pascal’s Mugging were rightly dismissed on the grounds that the probability of AI xrisk is not actually that small at all. But if the probability of averting the xrisk is as small as discussed here then the comparison with Pascal’s Mugging is entirely appropriate. • It’s not Pascal’s mugging in the senses described in the first posts about the problem: [...] I had originally intended the scenario of Pascal’s Mugging to point up what seemed like a basic problem with combining conventional epistemology with conventional decision theory: Conventional epistemology says to penalize hypotheses by an exponential factor of computational complexity. This seems pretty strict in everyday life: “What? for a mere 20 bits I am to be called a million times less probable?” But for stranger hypotheses about things like Matrix Lords, the size of the hypothetical universe can blow up enormously faster than the exponential of its complexity. This would mean that all our decisions were dominated by tiny-seeming probabilities (on the order of 2-100 and less) of scenarios where our lightest action affected 3↑↑4 people… which would in turn be dominated by even more remote probabilities of affecting 3↑↑5 people... [...] Unfortunately I failed to make it clear in my original writeup that this was where the problem came from, and that it was general to situations beyond the Mugger. Nick Bostrom’s writeup of Pascal’s Mugging for a philosophy journal used a Mugger offering a quintillion days of happiness, where a quintillion is merely 1,000,000,000,000,000,000 = 1018. It takes at least two exponentiations to outrun a singly-exponential complexity penalty. I would be willing to assign a probability of less than 1 in 1018 to a random person being a Matrix Lord. You may not have to invoke 3↑↑↑3 to cause problems, but you’ve got to use something like - double exponentiation or better. Manipulating ordinary hypotheses about the ordinary physical universe taken at face value, which just contains 1080 atoms within range of our telescopes, should not lead us into such difficulties. (And then the phrase “Pascal’s Mugging” got completely bastardized to refer to an emotional feeling of being mugged that some people apparently get when a high-stakes charitable proposition is presented to them, regardless of whether it’s supposed to have a low probability. This is enough to make me regret having ever invented the term “Pascal’s Mugging” in the first place; and for further thoughts on this see The Pascal’s Wager Fallacy Fallacy (just because the stakes are high does not mean the probabilities are low, and Pascal’s Wager is fallacious because of the low probability, not the high stakes!) and Being Half-Rational About Pascal’s Wager Is Even Worse. Again, when dealing with issues the mere size of the apparent universe, on the order of 1080 - for small large numbers—we do not run into the sort of decision-theoretic problems I originally meant to single out by the concept of “Pascal’s Mugging”. My rough intuitive stance on x-risk charity is that if you are one of the tiny fraction of all sentient beings who happened to be born here on Earth before the intelligence explosion, when the existence of the whole vast intergalactic future depends on what we do now, you should expect to find yourself surrounded by a smorgasbord of opportunities to affect small large numbers of sentient beings. There is then no reason to worry about tiny probabilities of having a large impact when we can expect to find medium-sized opportunities of having a large impact, so long as we restrict ourselves to impacts no larger than the size of the known universe.) [...] And finally, I once again state that I abjure, refute, and disclaim all forms of Pascalian reasoning and multiplying tiny probabilities by large impacts when it comes to existential risk. We live on a planet with upcoming prospects of, among other things, human intelligence enhancement, molecular nanotechnology, sufficiently advanced biotechnology, brain-computer interfaces, and of course Artificial Intelligence in several guises. If something has only a tiny chance of impacting the fate of the world, there should be something with a larger probability of an equally huge impact to worry about instead. You cannot justifiably trade off tiny probabilities of x-risk improvement against efforts that do not effectuate a happy intergalactic civilization, but there is nonetheless no need to go on tracking tiny probabilities when you’d expect there to be medium-sized probabilities of x-risk reduction. [...] EDIT: To clarify, “Don’t multiply tiny probabilities by large impacts” is something that I apply to large-scale projects and lines of historical probability. On a very large scale, if you think FAI stands a serious chance of saving the world, then humanity should dump a bunch of effort into it, and if nobody’s dumping effort into it then you should dump more effort than currently into it. On a smaller scale, to compare two x-risk mitigation projects in demand of money, you need to estimate something about marginal impacts of the next added effort (where the common currency of utilons should probably not be lives saved, but “probability of an ok outcome”, i.e., the probability of ending up with a happy intergalactic civilization). In this case the average marginal added dollar can only account for a very tiny slice of probability, but this is not Pascal’s Wager. Large efforts with a success-or-failure criterion are rightly, justly, and unavoidably going to end up with small marginally increased probabilities of success per added small unit of effort. It would only be Pascal’s Wager if the whole route-to-an-OK-outcome were assigned a tiny probability, and then a large payoff used to shut down further discussion of whether the next unit of effort should go there or to a different x-risk. If I understand your argument, it’s something like “when the probability of the world being saved is below n%, humanity should stop putting any effort into saving the world”. Could you clarify what value of n (roughly) you think justifies “let’s give up”? (If we just speak in qualitative terms, we’re more likely to just talk past each other. E.g., making up numbers: maybe you’ll say ‘we should give up if the world is only one-in-a-million likely to survive’, and Eliezer will reply ‘oh, of course, but our survival odds are way higher than that’. Or maybe you’ll say ‘we should give up if the world is only one-in-fifty likely to survive’, and Eliezer will say ‘that sounds like the right ballpark for how dire our situation is, but that seems way too early to simply give up’.) • I think - Humans are bad at informal reasoning about small probabilities since they don’t have much experience to calibrate on, and will tend to overestimate the ones brought to their attention, so informal estimates of the probability very unlikely events should usually be adjusted even lower. - Humans are bad at reasoning about large utilities, due to lack of experience as well as issues with population ethics and the mathematical issues with unbounded utility, so estimates of large utilities of outcomes should usually be adjusted lower. - Throwing away most of the value in the typical case for the sake of an unlikely case seems like a dubious idea to me even if your probabilities and utility estimates are entirely correct; the lifespan dilemma and similar results are potential intuition pumps about the issues with this, and go through even with only single-exponential utilities at each stage. Accordingly I lean towards overweighting the typical range of outcomes in my decision theory relative to extreme outcomes, though there are certainly issues with this approach as well. As far as where the penalty starts kicking in quantitatively, for personal decisionmaking I’d say somewhere around “unlikely enough that you expect to see events at least this extreme less than once per lifetime”, and for altruistic decisionmaking “unlikely enough that you expect to see events at least this extreme less than once in the history of humanity”. For something on the scale of AI alignment I think that’s around 1/​1000? If you think the chances of success are still over 1% then I withdraw my objection. The Pascalian concern aside I note that the probability of AI alignment succeeding doesn’t have to be *that* low before its worthwhileness becomes sensitive to controversial population ethics questions. If you don’t consider lives averted to be a harm then spending10B to decrease the chance of 10 billion deaths by 1/​10000 is worse value than AMF. If you’re optimizing for the average utility of all lives eventually lived then increasing the chance of a flourishing future civilization to pull up the average is likely worth more but plausibly only ~100x more (how many people would accept a 1% chance of postsingularity life for a 99% chance of immediate death?) so it’d still be a bad bet below 1/​1000000. (Also if decreasing xrisk increases xrisk, or if the future ends up run by total utilitarians, it might actually pull the average down.)

• If I’m understanding the original question correctly (and if not, well, I’m asking it myself), the issue is that as you just pointed out, there are plenty of non-AI-related massive threats to humanity that we may be able to avert with far higher likelihood, (assuming we survive long enough to be able to do so). If the probability of changing the AGI end-of-the-world situation is extremely low, and if that was the only potential danger to humanity, then of course we should still focus on it. However, we also face many other risks we actually stand a chance of fighting, and according to Yudkowsky’s line of thinking, we should act for the counterfactual world in which we somehow solve the alignment problem. Therefore, shouldn’t we be focusing more on other issues, if the probabilities are really that bad?

• Is voting a pascal’s mugging?

• There’s some case for it but I’d generally say no. Usually when voting you are coordinating with a group of people with similar decision algorithms who you have some ability to communicate with, and the chance of your whole coordinated group changing the outcome is fairly large, and your own contribution to it pretty legible. This is perhaps analogous to being one of many people working on AI safety if you believe that the chance that some organization solves AI safety is fairly high (it’s unlikely that your own contributions will make the difference but you’re part of a coordinated effort that likely will). But if you believe is extremely unlikely that anybody will solve AI safety then the whole coordinated effort is being Pascal-Mugged.

• One thing I like about the “dignity as log-odds” framework is that it implicitly centers coordination.

• To be clear:

1. If the probability of success goes small enough, then yes, you give up on the long-term future. But I don’t think we can realistically reach that point, even if the situation gets a lot more dire than it looks today. It matters what the actual probabilities look like (e.g., 1 in 20 and 1 in 1000 are both very different from 1 in googolplex).

2. None of this implies that you should disregard your own happiness, stop caring about your friends, etc. Even if human brains were strictly utilitarian (which they’re not), disregarding your happiness, being a jerk, etc. are obviously terrible strategies for optimizing the long-term future. Genuinely taking the enormous stakes into account in your decision-making doesn’t require that you adopt dumb or self-destructive policies that inevitably cause burn-out, poor coordination, etc.

• Downvoted. I’d like to respectfully call out that the commenter should have explained why they believe the concept of dignity is strange, instead of just claiming that it is strange.

Without the why, it is hard to engage with productively. Plus, calling something strange has a connotation of personal attack, at least from what I gather and in this context, and the absence of the “why” part strengthens this connotation a moderate amount.

• If good people were liars, that would render the words of good people meaningless as information-theoretic signals, and destroy the ability for good people to coordinate with others or among themselves.

My mental Harry is making a noise. It goes something like Pfwah! Interrogating him a bit more, he seems to think that this argument is a gross mischaracterization of the claims of information theory. If you mostly tell the truth, and people can tell this is the case, then your words convey information in the information-theoretic sense.

EDIT: Now I’m thinking about how to characterize “information” in problems where one agent is trying to deceive another. If A successfully deceives B, what is the “information gain” for B? He thinks he knows more about the world; does this mean that information gain cannot be measured from the inside?

• The sentence you quote actually sounds like a Harry sentence to me. Specifically the part where doing an unethical thing causes the good people to not be able to trust each other and work together any more, which is a key part of the law of good.

• My counterpoints, in broad order of importance:

1. If you lie to people, they should trust you less. Observing you lying should reduce their confidence in your statements. However, there is nothing in the fundamental rules of the universe that say that people notice when they are deceived, even after the fact, or that they will trust you less by any amount. Believing that lying, or even being caught lying, will result in total collapse of confidence without further justification is falling for the just-world fallacy.

2. If you saw a man lying to his child about the death of the family dog, you wouldn’t (hopefully) immediately refuse to ever have business dealings with such a deceptive, amoral individual. Believing that all lies are equivalent, or that lie frequency does not matter, is to fall for the fallacy of grey.

3. “Unethical” and “deceptive” are different. See hpmor ch51 for hpmor!Harry agreeing to lie for moral reasons. See also counterarguments to Kant’s Categorical Imperative (lying is always wrong, literally never lie).

4. The point about information theory stands.

Note that “importance” can be broadly construed as “relevance to the practical question of lying to actual people in real life”. This is why the information-theoretic argument ranks so low.

• The idea, I believe, is similar to asking death row prisoners if they are innocent. If you establish that you’re willing to lie about sufficiently important things for non-obvious reasons, people can’t trust you when those reasons are likely to be in play. For Eliezer’s stakes, this would be literally all the time, since it would “be justified” to lie in any situation to save the world.

• See response to Ben Pace for counterpoints.

• Separate from the specific claims, it seems really unhelpful to say something like this in such a deliberately confusing, tongue-in-cheek way. It’s surely unhelpful strategically to be so unclear, and it also just seems mean-spirited to blur the lines between sarcasm and sincerity in such a bleak and also extremely confident write-up, given that lots of readers regard you as an authority and take your thoughts on this subject seriously.

I’ve heard from three people who have lost the better part of a day or more trying to mentally disengage from this ~shitpost. Whatever you were aiming for, it’s hard for me to imagine how this hasn’t missed the mark.

• If it helps, I tried to clarify what the post is and isn’t serious about here (and Eliezer endorses my comment).

I think the title of the post was a mistake, because the post is basically just serious, and priming people to expect an April Fool’s thing was too confusing. Adding in a little April Fool’s afterthought-joke at the end bothers me less, though with hindsight I think the specific joke-line “it’s a preview of what might be needful to say later, if matters really do get that desperate” was a mistake, because it sounds too plausible.

(I complained about the title before the post went live. :P Though I told Eliezer I’d rather see him post this with the bad title than not post at all.)

• I can imagine how this might be considered a “~shitpost” but I thought it was clear and obvious and not confusing.

But I also don’t think that has that much to do with “people who have lost the better part of a day or more trying to mentally disengage from this”. I just read this post, wasn’t confused, but I still expect this to ‘ruin’ my day.

• It has been asked before, but I will ask again because I prefer clear answers: is Death With Dignity MIRI’s policy?

What I’m seeing from my outside perspective is a MIRI authority figure stating that this is the policy.

• (I ran this comment by Eliezer and Nate and they endorsed it.)

My model is that the post accurately and honestly represents Eliezer’s epistemic state (‘I feel super doomy about AI x-risk’), and a mindset that he’s found relatively motivating given that epistemic state (‘incrementally improve our success probability, without getting emotionally attached to the idea that these incremental gains will result in a high absolute success probability’), and is an honest suggestion that the larger community (insofar as it shares his pessimism) adopt the same framing for the sake of guarding against self-deception and motivated reasoning.

The parts of the post that are an April Fool’s Joke, AFAIK, are the title of the post, and the answer to Q6. The answer to Q6 is a joke because it’s sort-of-pretending the rest of the post is an April Fool’s joke. The title is a joke because “X’s new organizational strategy is ‘death with dignity’” sounds sort of inherently comical, and doesn’t really make sense (how is that a “strategy”? believing p(doom) is high isn’t a strategy, and adopting a specific mental framing device isn’t really a “strategy” either). (I’m even more confused by how this could be MIRI’s “policy”.)

In case it clarifies anything, here are some possible interpretations of ‘MIRI’s new strategy is “Death with Dignity”’, plus a crisp statement of whether the thing is true or false:

• A plurality of MIRI’s research leadership, adjusted for org decision-making weight, thinks humanity’s success probability is very low, and will (continue to) make org decisions accordingly. — True, though:

• Practically speaking, I don’t think this is wildly different from a lot of MIRI’s past history. Eg., Nate’s stated view in 2014 (assuming FT’s paraphrase is accurate), before he became ED, was “there is only a 5 per cent chance of programming sufficient safeguards into advanced AI”.

• (Though I think there was at least one period of time in the intervening years where Nate had double-digit success probabilities for humanity — after the Puerto Rico conference and associated conversations, where he was impressed by the spirit of cooperation and understanding present and by how on-the-ball some key actors looked. He tells me that he later updated back downwards when the political situation degraded, and separately when he concluded the people in question weren’t that on-the-ball after all.)

• MIRI is strongly in favor of its researchers building their own models and doing the work that makes sense to them; individual MIRI researchers’ choices of direction don’t require sign-off from Eliezer or Nate.

• I don’t know exactly why Eliezer wrote a post like this now, but I’d guess the largest factors are roughly (1) that Eliezer and Nate have incrementally updated over the years from ‘really quite gloomy’ to ‘even gloomier’, (2) that they’re less confident about what object-level actions would currently best reduce p(doom), and (3) that as a consequence, they’ve updated a lot toward existential wins being likelier if the larger community moves toward having much more candid and honest conversations, and generally produces more people who are thinking exceptionally clearly about the problem.

• Everyone on MIRI’s research team thinks our success probability is extremely low (say, below 5%). — False, based on a survey I ran a year ago. Only five MIRI researchers responded, so the sample might skew much more negative or positive than the overall distribution of views at MIRI; but MIRI responses to Q2 were (66%, 70%, 70%, 96%, 98%). I also don’t think the range of views has changed a ton in the intervening year.

• MIRI will require (of present and/​or future research staff) that they think in terms of “death with dignity”. — False, both in that MIRI isn’t in the business of dictating researchers’ P(doom) and in that MIRI isn’t in the business of dictating researchers’ motivational tools or framing devices.

• MIRI has decided to give up on reducing existential risk from AI. — False, obviously.

• MIRI is “locking in” pessimism as a core part of its org identity, such that it refuses to update toward optimism if the situation starts looking better. — False, obviously.

Other than the two tongue-in-cheek parts, AFAIK the post is just honestly stating Eliezer’s views, without any more hyperbole than a typical Eliezer post would have. E.g., the post is not “a preview of what might be needful to say later, if matters really do get that desperate”. Some parts of the post aren’t strictly literal (e.g., “0% probability”), but that’s because all of Eliezer’s posts are pretty colloquial, not because of a special feature of this post.

• Thank you for the detailed response. It helps significantly.

The parts of the post that are an April Fool’s Joke, AFAIK, are the title of the post, and the answer to Q6. The answer to Q6 is a joke because it’s sort-of-pretending the rest of the post is an April Fool’s joke.

It shouldn’t be surprising others are confused if this is your best guess about what the post means altogether.

believing p(doom) is high isn’t a strategy, and adopting a specific mental framing device isn’t really a “strategy” either). (I’m even more confused by how this could be MIRI’s “policy”.)

Most would probably be as confused as you are at the notion “dying with dignity” is a strategy. I was thinking the meaning of the title stripped of hyperbole was not a change in MIRI’s research agenda but some more “meta-level” organizational philosophy.

I’m paraphrasing here, so correct me if I’m wrong, but some of the recent dialogues between Eliezer and other AI alignment researchers in the last several months contained statements from Eliezer like “We [at least Nate and Eliezer] don’t think what MIRI has been doing for the last few years will work, and we don’t have a sense of what direction to go now”, and “I think maybe most other approaches in AI alignment have almost no chance of making any progress on the alignment problem.”

Maybe many people would have known better what Eliezer meant had they read the entirety of the post(s) in question. Yet the posts were so long and complicated Scott Alexander bothered to write a summary of only one of them and there are several more.

As far as I’m aware, the reasoning motivating the kind of sentiments Eliezer expressed weren’t much explained elsewhere. Between the confusion and concern that has caused, and the ambiguity of the above post, that right now MIRI’s strategy might be in a position of (temporary) incoherence was apparently plausible enough to a significant minority of readers.

The parts of your comment excerpted below are valuable and may even have saved MIRI a lot of work trying to deconfuse others had they been publicly stated at some point in the last few months:

A plurality of MIRI’s research leadership, adjusted for org decision-making weight, thinks humanity’s success probability is very low, and will (continue to) make org decisions accordingly.

MIRI is strongly in favor of its researchers building their own models and doing the work that makes sense to them; individual MIRI researchers’ choices of direction don’t require sign-off from Eliezer or Nate.

They [at least Eliezer and Nate] updated a lot toward existential wins being likelier if the larger community moves toward having much more candid and honest conversations, and generally produces more people who are thinking exceptionally clearly about the problem.

• It shouldn’t be surprising others are confused if this is your best guess about what the post means altogether.

I don’t know what you mean by this—I don’t know what “this” is referring to in your sentence.

As far as I’m aware, the reasoning motivating the kind of sentiments Eliezer expressed weren’t much explained elsewhere.

I mean, the big dump of chat logs is trying to make our background models clearer to people so that we can hopefully converge more. There’s an inherent tension between ‘say more stuff, in the hope that it clarifies something’ versus ‘say less stuff, so that it’s an easier read’. Currently I think the best strategy is to err on the side of over-sharing and posting long things, and then rely on follow-up discussion, summaries, etc. to address the fact that not everyone has time to read everything.

E.g., the three points you highlighted don’t seem like new information to me; I think we’ve said similar things publicly multiple times. But they can be new info to you, since you haven’t necessarily read the same resources I have.

Since everyone has read different things and will have different questions, I think the best solution is for you to just ask about the stuff that strikes you as the biggest holes in your MIRI-map.

I do think we’re overdue for a MIRI strategy post that collects a bunch of the take-aways we think are important in one place. This will inevitably be incomplete (or very long), but hopefully we’ll get something out in the not-distant future.

Between the confusion and concern that has caused,

I want to push back a bit against a norm I think you’re arguing for, along the lines of: we should impose much higher standards for sharing views that assert high p(doom), than for sharing views that assert low p(doom).

High and low p(doom) are both just factual claims about the world; an ideal Bayesian reasoner wouldn’t treat them super differently, and by default would apply just as much scrutiny, skepticism, and wariness to someone who seems optimistic about AI outcomes, as to someone who seems pessimistic.

In general, I want to be pretty cautious about proposed norms that might make people self-censor more if they have “concerning” views about object-level reality. There should be norms that hold here, but it’s not apparent to me that they should be stricter (or more strictly enforced) than for non-pessimistic posts.

that right now MIRI’s strategy might be in a position of (temporary) incoherence was apparently plausible enough to a significant minority of readers.

I still don’t know what incoherence you have in mind. Stuff like ‘Eliezer has a high p(doom)’ doesn’t strike me as good evidence for a ‘your strategy is incoherent’ hypothesis; high and low p(doom) are just different probabilities about the physical world.

• I don’t know what “this” is referring to in your sentence.

I was referring to the fact that there are meta-jokes in the post about which parts are or are not jokes.

I want to push back a bit against a norm I think you’re arguing for, along the lines of: we should impose much higher standards for sharing views that assert high p(doom), than for sharing views that assert low p(doom).

I’m sorry I didn’t express myself more clearly. There shouldn’t be a higher standard for sharing views that assert a high(er) probability of doom. That’s not what I was arguing for. I’ve been under the impression Eliezer and maybe others have been sharing the view of a most extreme probability of doom, but without explaining their reasoning, or how their model changed from before. It’s the latter part that would be provoking confusion.

I still don’t know what incoherence you have in mind. Stuff like ‘Eliezer has a high p(doom)’ doesn’t strike me as good evidence for a ‘your strategy is incoherent’ hypothesis; high and low p(doom) are just different probabilities about the physical world.

The reasons for Eliezer or others at MIRI being more pessimistic than ever before seeming unclear, one possibility that came to mind is that there isn’t enough self-awareness of the model as to why, or that MIRI has for a few months had no idea what direction it’s going in now. That would lend itself to not having a coherent strategy at this time. Your reply has clarified though that it’s more like what MIRI’s strategic pivot will be is still in flux, or at least publicly communicating that well will take some more time, so I’m not thinking any of that now.

I do appreciate the effort you, Eliezer and others at MIRI have put into what you’ve been publishing. I eagerly await a strategy update from MIRI.

I’ll only mention one more thing that hasn’t bugged me as much but has bugged others in conversations I’ve participated in. The issue is that Eliezer appears to think, but without any follow-up, that most other approaches to AI alignment distinct from MIRI’s, including ones that otherwise draw inspiration from the rationality community, will also fail to bear fruit. Like, the takeaway isn’t other alignment researchers should just give up, or just come work for MIRI...?, but then what is it?

A lack of answer to that question has left some people feel like they’ve been hung out to dry.

• The issue is that Eliezer appears to think, but without any follow-up, that most other approaches to AI alignment distinct from MIRI’s, including ones that otherwise draw inspiration from the rationality community, will also fail to bear fruit. Like, the takeaway isn’t other alignment researchers should just give up, or just come work for MIRI...?, but then what is it?

From the AGI interventions discussion we posted in November (note that “miracle” here means “surprising positive model violation”, not “positive event of negligible probability”):

Anonymous

At a high level one thing I want to ask about is research directions and prioritization. For example, if you were dictator for what researchers here (or within our influence) were working on, how would you reallocate them?

Eliezer Yudkowsky

The first reply that came to mind is “I don’t know.” I consider the present gameboard to look incredibly grim, and I don’t actually see a way out through hard work alone. We can hope there’s a miracle that violates some aspect of my background model, and we can try to prepare for that unknown miracle; preparing for an unknown miracle probably looks like “Trying to die with more dignity on the mainline” (because if you can die with more dignity on the mainline, you are better positioned to take advantage of a miracle if it occurs).

[...]

Eliezer Yudkowsky

I have a few stupid ideas I could try to investigate in ML, but that would require the ability to run significant-sized closed ML projects full of trustworthy people, which is a capability that doesn’t seem to presently exist. Plausibly, this capability would be required in any world that got some positive model violation (“miracle”) to take advantage of, so I would want to build that capability today. I am not sure how to go about doing that either. [...] What I’d like to exist is a setup where I can work with people that I or somebody else has vetted as seeming okay-trustworthy, on ML projects that aren’t going to be published.

[...]

Anonymous

How do you feel about the safety community as a whole and the growth we’ve seen over the past few years?

Eliezer Yudkowsky

Very grim. I think that almost everybody is bouncing off the real hard problems at the center and doing work that is predictably not going to be useful at the superintelligent level, nor does it teach me anything I could not have said in advance of the paper being written. People like to do projects that they know will succeed and will result in a publishable paper, and that rules out all real research at step 1 of the social process.

Paul Christiano is trying to have real foundational ideas, and they’re all wrong, but he’s one of the few people trying to have foundational ideas at all; if we had another 10 of him, something might go right.

Chris Olah is going to get far too little done far too late. We’re going to be facing down an unalignable AGI and the current state of transparency is going to be “well look at this interesting visualized pattern in the attention of the key-value matrices in layer 47” when what we need to know is “okay but was the AGI plotting to kill us or not”. But Chris Olah is still trying to do work that is on a pathway to anything important at all, which makes him exceptional in the field.

The things I’d mainly recommend are interventions that:

• Help ourselves think more clearly. (I imagine this including a lot of trying-to-become-more-rational, developing and following relatively open/​honest communication norms, and trying to build better mental models of crucial parts of the world.)

• Help relevant parts of humanity (e.g., the field of ML, or academic STEM) think more clearly and understand the situation.

• Help us understand and resolve major disagreements. (Especially current disagreements, but also future disagreements, if we can e.g. improve our ability to double-crux in some fashion.)

• Try to solve the alignment problem, especially via novel approaches.

• In particular: the biggest obstacle to alignment seems to be ‘current ML approaches are super black-box-y and produce models that are very hard to understand/​interpret’; finding ways to better understand models produced by current techniques, or finding alternative techniques that yield more interpretable models, seems like where most of the action is.

• Think about the space of relatively-plausible “miracles”, think about future evidence that could make us quickly update toward a miracle-claim being true, and think about how we should act to take advantage of that miracle in that case.

• Build teams and skills that are well-positioned to take advantage of miracles when and if they arise. E.g., build some group like Redwood into an org that’s world-class in its ability to run ML experiments, so we have that capacity already available if we find a way to make major alignment progress in the future.

This can also include indirect approaches, like ‘rather than try to solve the alignment problem myself, I’ll try to recruit physicists to work on it, because they might bring new and different perspectives to bear’.

Though I definitely think there’s a lot to be said for more people trying to solve the alignment problem themselves, even if they’re initially pessimistic they’ll succeed!

I think alignment is still the big blocker on good futures, and still the place where we’re most likely to see crucial positive surprises, if we see them anywhere—possibly Eliezer would disagree here.

• Upvoted. Thanks.

I’ll state that in my opinion it shouldn’t necessarily have to be the responsibility of MIRI or even Eliezer to clarify what was meant by a position stated but is taken out of context. I’m not sure but it seems as though at least a significant minority of those who’ve been alarmed by some of Eliezer’s statements haven’t read the full post to put it in a less dramatic context.

Yet errant signals sent seem important to rectify as they make it harder for MIRI to coordinate with other actors in the field of AI alignment based on existing misconceptions.

My impression is that misunderstanding about all of this is widespread in that there are at least a few people across every part of the field who don’t understand what MIRI is about these days at all. I don’t know how widespread it is in terms of how significant a portion of other actors in the field are generally confused with MIRI.

• It seems like you’re claiming that it’s obvious on consequentialist grounds that it is immoral to rob banks. While I have not robbed any banks, I do not see how to arrive at a general conclusion to this effect under the current regime, and one of my most trusted friends may have done so at one point. But I’m not sure how to identify our crux. Can you try to explain your reasoning?

• Suggestion: you should record a cover version of Guantanamera.

Yo soy un hombre sincero
De donde crece la palma,
Yo soy un hombre sincero
De donde crece la palma,
Y antes de morirme quiero
Echar mis versos del alma.

[...]

No me pongan en lo oscuro
A morir como un traidor
No me pongan en lo oscuro
A morir como un traidor
Yo soy bueno y como bueno
Moriré de cara al sol.

And then Nate can come in playing the trumpet solo.

• This reminds me of Joe Rogan’s interview with Elon Musk. This section has really stuck with me:

Joe Rogan
So, what happened with you where you decided, or you took on a more fatalistic attitude? Like, was there any specific thing, or was it just the inevitability of our future?

Elon Musk
I try to convince people to slow down. Slow down AI to regulate AI. That’s what’s futile. I tried for years, and nobody listened.

Joe Rogan
This seems like a scene in a movie-

Elon Musk
Nobody listened.

Joe Rogan
… where the the robots are going to fucking takeover. You’re freaking me out. Nobody listened?

Elon Musk
Nobody listened.

Joe Rogan
No one. Are people more inclined to listen today? It seems like an issue that’s brought up more often over the last few years than it was maybe 5-10 years ago. It seemed like science fiction.

Elon Musk
Maybe they will. So far, they haven’t. I think, people don’t—Like, normally, the way that regulations work is very slow. it’s very slow indeed. So, usually, it will be something, some new technology. It will cause damage or death. There will be an outcry. There will be an investigation. Years will pass. There will be some sort of insights committee. There will be rule making. Then, there will be oversight, absolutely, of regulations. This all takes many years. This is the normal course of things.

If you look at, say, automotive regulations, how long did it take for seatbelts to be implemented, to be required? You know, the auto industry fought seatbelts, I think, for more than a decade. It successfully fought any regulations on seatbelts even though the numbers were extremely obvious. If you had seatbelts on, you would be far less likely to die or be seriously injured. It was unequivocal. And the industry fought this for years successfully. Eventually, after many, many people died, regulators insisted on seatbelts. This is a—This time frame is not relevant to AI. You can’t take 10 years from a point of which it’s dangerous. It’s too late.

• Potential typos:

• In a domain like → In a domain like that

• These people are well-suited… because their mind naturally sees → because their minds naturally see

• since, after all consequentialism says to rob banks → since, after all,

• They do not do a double-take and say “What?” That → [Sentence ends without punctuation outside the quote.]

• strong established injunctions against bank-robbing specifically exactly. → bank-robbing specifically. /​ bank-robbing exactly.

• a genre-savvy person who knows that the viewer would say if → what the viewer would say

• the story’s antagonists, might possibly hypothetically panic. But still, one should present the story-antagonists → [inconsistent phrasing for story’s antagonists /​ story-antagonists]

Also, I didn’t understand parts of the sentence which ends like this:

taking things at face value and all sorts of extreme forces that break things and that you couldn’t full test under before facing them.

• In terms of “miracles”—do you think they look more like some empirical result or some new genius comes up with a productive angle? Though I am inclined, even as a normie, to believe that human geniuses are immeasurably disappointing, you have sown a lot of seeds—and alignmentpilled a lot of clever people—and presumably some spectacularly clever people. Maybe some new prodigy will show up. My timelines are short—like less than 10 years wouldn’t surprise me—but the kids you alignmentpilled in 2008-2015 will be reaching peak productivity in the next few years. If it’s going to happen, it might happen soon.

• Both seem unlikely, probably about equally unlikely… maybe leaning slightly more towards the empirical side, but you wouldn’t even get those without an at least slightly genius looking into it, I think.

• A list of potential miracles (including empirical “crucial considerations” [/​wishful thinking] that could mean the problem is bypassed):

• Possibility of a failed (unaligned) takeoff scenario where the AI fails to model humans accurately enough (i.e. realise smart humans could detect its “hidden” activity in a certain way). [This may only set things back a few months to years; or could lead to some kind of Butlerian Jihad if there is a sufficiently bad (but ultimately recoverable) global catastrophe (and then much more time for Alignment the second time around?)].

• Valence realism being true. Binding problem vs AGI Alignment.

• Omega experiencing every possible consciousness and picking the best? [Could still lead to x-risk in terms of a Hedonium Shockwave].

• Moral Realism being true (and the AI discovering it and the true morality being human-compatible).

• Natural abstractions leading to Alignment by Default?

• AGI discovers new physics and exits to another dimension (like the creatures in Greg Egan’s Crystal Nights).

• Simulation/​anthropics stuff.

• Alien Information Theory being true!? (And the aliens having solved alignment).

• I find I strongly agree that—in case of this future happening—it is extremely important that as little people as possible give up on their attempts to perceive the world as it really is, even if that world might literally ‘we failed, humanity is going to die out, the problem is way too hard and there’s no reasonable chance we’ll make it in time’. It seems to me like especially in scenarios like this, we’d need people to keep trying, and to stay (or become) dignified in order to have any chance at still solving the problem at all.

I’m a total newcomer though, and it might be obvious for everyone more immersed in alignment, but what I really don’t understand is why ‘failure to solve the problem in time’ sounds so much like ‘we’re all going to die, and that’s so certain that some otherwise sensible people are tempted to just give in to despair and stop trying at all’. From what I’ve seen so far, this attitude is very common here, and I’d greatly appreciate if anyone could point me to some resources explaining the basics.

I’m seventeen, on the brink of having to choose a career, and interested in science (and the future of humanity, obviously) anyway, so if the problem really is as massive and under-focused on as it sounds, it seems like a good idea for me to join the battle as soon as possible—if that would even make a difference at this point.

• Have you looked at the Atlas Fellowship, btw?

• Have you considered applying to the AGI Safety Fundamentals Course? (They might not accept minors though b/​c of legal reasons, but you’ll be an adult soon enough).

• OP, if you’re interested, lie about your age or just don’t mention it. They’re not going to check as long as you don’t look six.

• Thanks for the advice, I appreciate it. I’m one of these people who have very firm rules about not lying though. Then again, I did manage to get a vaccine despite parents objecting using the second option, so I suppose it’ll be worth a try :)

• Then again, I did manage to get a vaccine despite parents objecting using the second option

That’s the spirit!

• Thanks a lot! I’ll look into applying!

• “what I really don’t understand is why ‘failure to solve the problem in time’ sounds so much like ‘we’re all going to die, and that’s so certain that some otherwise sensible people are tempted to just give in to despair and stop trying at all’ ”

I agree. In this community, most people only talk of x-risk (existential risk). Most people equate failure to align AI to our values to human extinction. I disagree. Classic literature examples of failure can be found, like With Folded Hands, where AI creates an unbreakable dictatorship, not extinction.

I think it’s for the sake of sanity (things worse than extinction are quite harder to accept), or not to scare the normies, who are already quite scared.

But it’s also true that unaligned AI could result in a kinda positive outcome, or even neutral. I just personally wouldn’t put much probability in there. Why? 2 concepts that you can look up on Lesswrong: orthogonality thesis (high intelligence isn’t necessarily correlated to high values), and basic AI drives (advanced AI would naturally develop dangerous instrumental goals like survival and resource acquisition). And also that it’s pretty hard to tell computers to do what we mean, which scaled up could turn out very dangerous.

(See Eliezer’s post “Failed Utopia 4-2”, where an unaligned AGI ends up creating a failed utopia which really doesn’t sound THAT bad, and I’d say is even much better than the current world when you weight all the good and bad.)

Fundamentally, we just shouldn’t take the gamble. The stakes are too high.

If you wanna have an impact, AI is the way to go. Definitely.

• The following is an edited partial chat transcript of a conversation involving me, Benquo, and an anonymous person (X). I am posting it in the hope that it has enough positive value-of-information compared to the attentional cost to be of benefit to others. I hope people can take this as “yay, I got to witness a back-room conversation” rather than “oh no, someone has made a bunch of public assertions that they can’t back up”; I think it would be difficult and time-consuming to argue for all these points convincingly, although I can explain to some degree, and I’m not opposed to people stating their disagreements. There were concerns about reputational risk as a result of posting this, which is partially mitigated by redaction, and partially is a cost that is intentionally “eaten” to gain the value-of-information benefit.

X: [asking why Jess is engaging with the post]

Jess: It seems like there’s enough signal in the post that, despite the anxious time-wasting, it was worth contemplating. It’s good that Eliezer is invalidating people’s strategy of buying insurance products for AI catastrophe from a secretive technocratic class centered around him. Same as the conclusion I reached in 2017. I just hope that people can generalize from “alignment is hard” to “generalized AI capabilities are hard”.

X: I agree it was worth reading… I just worry about the people in the bubble who are dependent on it. I think people should be trying to get people out of the bubble and trying to deflate it so that if (when) it collapses it has less of a negative impact. (Bubble = narrative bubble.) Engaging with it directly in a way that buys its assumption fuels the bubble, rather than reducing it.

X: The thing is centered around Eliezer. As far as I understand it, the argument for doom is “someone thought of something Eliezer thought only he had thought of, so Eliezer hard-updated to it not being that hard, so, short timelines.”

Jess: Yeah, that was one thing that caused his update, although there were other sources too.

X: Yes, there’s info from Demis, Dario, the OpenPhil person, and Gwern. But that stuff is not tracked especially precisely, as far as I understand it, so I don’t think those are the driving factors.

Ben: I think there’s a serious confusion around how tightly coupled the ability to access large amounts of capital and spend it on huge amounts of compute to create special-case solutions for well-contained games is to the interest in using thinking to solve problems rather than perpetuate the grift. Relatedly, [REDACTED] told me a couple months ago that his speech-recognition startup was able to achieve higher levels of accuracy than the competition by hiring (cheaper) linguists and not just (more expensive but prestigious) machine learning programmers, but he was surprised to find that later speech-recognition startups didn’t just copy what worked, but instead spent all their \$ on ML people and compute. This is seriously confusing to someone coming from Eliezer’s perspective, I made the same error initially when deep learning became a thing. Debtors’ Revolt was my attempt to put together a structural explanation.

X: This doesn’t surprise me but it’s great to have the data point. I looked into speech recognition a little while ago and concluded ML was likely not what was needed.

Ben: Eliezer thinks that OpenAI will try to make things go faster rather than slower, but this is plainly inconsistent with things like the state of vitamin D research or, well, most of academia & Parkinson’s Law more generally. Demis & Elon ended up colluding to create a Basilisk, i.e. a vibe of “the UFAI is big and dangerous so you must placate it by supporting attempts to build it”. It’s legitimately confusing to make sense of, especially if your most trusted advisors are doing some combination of flattering your narcissistic fantasy of specialness, and picking up a party line from OpenAI.

X: That’s why I put the cause at Eliezer’s failure to actually admit defeat. There are reasons, etc., but if you chase it upstream you hit things like “too lazy to count” or “won’t pay attention to something” or “someone thought of a thing he thought of” etc. I think that if Eliezer were more honest, he would update down on “rationality is systematized winning”, admit that, and then admit that his failure is evidence his system is corrupted.

Jess: I don’t really understand this. In a sense he is admitting defeat?

X: He’s not admitting defeat. He’s saying he was beaten by a problem that was TOO hard. Here’s a way to admit defeat: “wow I thought I was smart and smart meant could solve problems but I couldn’t solve the problems MAYBE I’M NOT ACTUALLY THAT SMART or maybe something else has gone wrong, hey, I think my worldview was just undermined”

Jess: I see what you mean, there’s some failure to propagate a philosophical update.

X: On Eliezer’s view, he’s still the smartest, most rational human ever. (Or something.)

Ben: He’s also not trying to share the private info he alludes to in lieu of justifying short timelines, even though he knows he has no idea what to do with it and can’t possibly know that no one else does.

Jess: So, as I wrote on Twitter recently, systematized winning would be considered stage 2 in Kegan’s model, and Kegan says “rationality” is stage 4, which is a kind of in-group-favoring (e.g. religious, nationalistic) “doing one’s duty” in a system.

X: Well, he can’t share the private short timelines info because there really isn’t any. This was a big update I took from you/​my/​Jessica’s conversation. I had been assuming that MIRI et al had private info, but then I came to believe they were just feeding numbers to each other, which in my universe does not count.

Ben: He’s also not sharing that.

X: Right, this is why I think the thing is not genuine. He knows that if he said “Anna, Nate, and I sat in a circle and shared numbers, and they kept going down, especially after I realized that someone else had an insight that I had had, though also important were statements from some other researchers, who I’m relying on rather than having my own view” people would 🤨🤨🤨 and that would diminish his credibility.

Jess: OpenAI has something of a model for short timelines, which is like Median’s AI compute model but with parameters biased towards short timelines relative to Median’s. However I think Eliezer and Nate think it’s more insight driven, and previously thought they personally possessed some relevant insights.

X: This nuances my model, thanks

Jess: But they don’t share them because “danger” so no one can check their work, and it looks like a lot of nothing from the outside.

X: It’s a shocking failure of rationality.

Jess: Yes.

• I began reading this charitably (unaware of whatever inside baseball is potentially going on, and seems to be alluded to), but to be honest struggled after “X” seemed to really want someone (Eliezer) to admit they’re “not smart”? I’m not sure why that would be relevant.

I think I found these lines especially confusing, if you want to explain:

• “I just hope that people can generalize from “alignment is hard” to “generalized AI capabilities are hard”.

Is capability supposed to be hard for similar reasons as alignment? Can you expand/​link? The only argument I can think of relating the two (which I think is a bad one) is “machines will have to solve their own alignment problem to become capable.”

• Eliezer is invalidating the second part of this but not the first.

This would be a pretty useless machiavellian strategy, so I’m assuming you’re saying it’s happening for other reasons? Maybe self-deception? Can you explain?

• Eliezer thinks that OpenAI will try to make things go faster rather than slower, but this is plainly inconsistent with things like the state of vitamin D research

This just made me go “wha” at first but my guess now is that this and the bits above it around speech recognition seem to be pointing at some AI winter-esque (or even tech stagnation) beliefs? Is this right?

• I began reading this charitably (unaware of whatever inside baseball is potentially going on, and seems to be alluded to), but to be honest struggled after “X” seemed to really want someone (Eliezer) to admit they’re “not smart”? I’m not sure why that would be relevant.

I’m not sure exactly what is meant, one guess is that it’s about centrality: making yourself more central (more making executive decisions, more of a bottleneck on approving things, more looked to as a leader by others, etc) makes more sense the more you’re more correct about relevant things relative to other people. Saying “oh, I was wrong about a lot, whoops” is the kind of thing someone might do before e.g. stepping down as project manager or CEO. If you think your philosophy has major problems and your replacements’ philosophies have fewer major problems, that might increase the chance of success.

I would guess this is comparable to what Eliezer is saying in this post about how some people should just avoid consequentialist reasoning because they’re too bad at it and unlikely to improve:

People like this should not be ‘consequentialists’ or ‘utilitarians’ as they understand those terms. They should back off from this form of reasoning that their mind is not naturally well-suited for processing in a native format, and stick to intuitively informally asking themselves what’s good or bad behavior, without any special focus on what they think are ‘outcomes’.

If they try to be consequentialists, they’ll end up as Hollywood villains describing some grand scheme that violates a lot of ethics and deontology but sure will end up having grandiose benefits, yup, even while everybody in the audience knows perfectly well that it won’t work. You can only safely be a consequentialist if you’re genre-savvy about that class of arguments—if you’re not the blind villain on screen, but the person in the audience watching who sees why that won’t work.

...

Is capability supposed to be hard for similar reasons as alignment? Can you expand/​link? The only argument I can think of relating the two (which I think is a bad one) is “machines will have to solve their own alignment problem to become capable.”

Alignment is hard because it’s a quite general technical problem. You don’t just need to make the AI aligned in case X, you also have to make it aligned in cases Y and Z. To do this you need to create very general analysis and engineering tools that generalize across these situations.

Similarly, AGI is a quite general technical problem. You don’t just need to make an AI that can do narrow task X, it has to work in cases Y and Z too, or it will fall over and fail to take over the world at some point. To do this you need to create very general analysis and engineering tools that generalize across these situations.

For an intuition pump about this, imagine that LW’s effort towards making an aligned AI over the past ~14 years was instead directed at making AGI. We have records of certain mathematical formalisms people have come up with (e.g. UDT, logical induction). These tools are pretty far from enhancing AI capabilities. If the goal had been to enhance AI capabilities, they would have enhanced AI capabilities more, but still, the total amount of intellectual work that’s been completed is quite small compared to how much intellectual work would be required to build a working agent that generalizes across situations. The AI field has been at this for decades and has produced the results that it has, which are quite impressive in some domains, but still fail to generalize most of the time, and even what has been produced has required a lot of intellectual work spanning multiple academic fields and industries over decades. (Even if the field is inefficient in some ways, that would still imply that inefficiency is common, and LW seems to also be inefficient at solving AI-related technical problems compared to its potential.)

This would be a pretty useless machiavellian strategy, so I’m assuming you’re saying it’s happening for other reasons? Maybe self-deception? Can you explain?

I’m not locating all the intentionality for creating these bubbles in Eliezer, there are other people in the “scene” that promote memes and gain various benefits from them (see this dialogue, ctrl-f “billions”).

There’s a common motive to try to be important by claiming that one has unique skills to solve important problems, and pursuing that motive leads to stress because it involves creating implicit or explicit promises that are hard to fulfil (see e.g. Elizabeth Holmes), and telling people “hey actually, I can’t solve this” reduces the stress level and makes it easier to live a normal life.

This just made me go “wha” at first but my guess now is that this and the bits above it around speech recognition seem to be pointing at some AI winter-esque (or even tech stagnation) beliefs? Is this right?

I think what Ben means here is that access to large amounts of capital is anti-correlated with actually trying to solve difficult intellectual problems. This is the opposite of what would be predicted by the efficient market hypothesis.

The Debtors’ Revolt argues that college (which many, many more Americans have gone too than previously) primarily functions to cause people to correlate with each other, not to teach people epistemic and instrumental rationality. E.g. college educated people are more likely to immediately dismiss Vitamin D as a COVID health intervention (due to an impression of “expert consensus”) rather than forming an opinion based on reading some studies and doing probability calculations. One would by default expect epistemic/​instrumental rationality to be well-correlated with income, for standard efficient market hypothesis reasons. However, if there is a massive amount of correlation among the “irrational” actors, they can reward each other, provide insurance to each other, commit violence in favor of their class (e.g. the 2008 bailouts), etc.

(On this model, a major reason large companies do the “train a single large, expensive model using standard techniques like transformers” is to create correlation in the form of a canonical way of spending resources to advance AI.)

• Similarly, AGI is a quite general technical problem. You don’t just need to make an AI that can do narrow task X, it has to work in cases Y and Z too, or it will fall over and fail to take over the world at some point. To do this you need to create very general analysis and engineering tools that generalize across these situations.

I don’t think this is a valid argument. Counter-example: you could build an AGI by uploading a human brain onto an artificial substrate, and you don’t “need to create very general analysis and engineering tools that generalize across these situations” to do this.

More realistically, it seems pretty plausible that all of the necessary patterns/​rules/​heuristics/​algorithms/​forms of reasoning necessary for “being generally intelligent” can be found in human culture, and ML can distill these elements of general intelligence into a (language or multimodal) model that will then be generally intelligent. This also doesn’t seem to require very general analysis and engineering tools. What do you think of this possibility?

• You’re right that the uploading case wouldn’t necessarily require strong algorithmic insight. However, it’s a kind of bounded technical problem that’s relatively easy to evaluate progress in relative to the difficulty, e.g. based on ability to upload smaller animal brains, so would lead to >40 year timelines absent large shifts in the field or large drivers of progress. It would also lead to a significant degree of alignment by default.

For copying culture, I think the main issue is that culture is a protocol that runs on human brains, not on computers. Analogously, there are Internet protocols saying things like “a SYN/​ACK packet must follow a SYN packet”, but these are insufficient for understanding a human’s usage of the Internet. Copying these would lead to imitations, e.g. machines that correctly send SYN/​ACK packets and produce semi-grammatical text but lack certain forms of understanding, especially connection to a surrounding “the real world” that is spaciotemporal etc.

If you don’t have logic yourself, you can look at a lot of logical content (e.g. math papers) without understanding logic. Most machines work by already working, not by searching over machine designs that fit a dataset.

Also in the cultural case, if it worked it would be decently aligned, since it could copy cultural reasoning about goodness. (The main reason I have for thinking cultural notions of goodness might be undesirable is thinking that, as stated above, culture is just a protocol and most of the relevant value processing happens in the brains, see this post.)

• Thanks so much for the one-paragraph summary of The Debtors’ Revolt, that was clarifying.

Jess: But they don’t share them because “danger” so no one can check their work, and it looks like a lot of nothing from the outside.
X: It’s a shocking failure of rationality.
Jess: Yes.

There’s an awkward issue here, which is: how can there be people who are financially supported to do research on stuff that’s heavily entangled with ideas that are dangerous to spread? It’s true that there are dangerous incentive problems here, where basically people can unaccountably lie about their private insight into dangerous issues; on the other hand, it seems bad for ideas to be shared that are more or less plausible precursors to a world-ending artifact. My understanding about Eliezer and MIRI is basically, Eliezer wrote a bunch of public stuff that demonstrated that he has insight into the alignment problem, and professed his intent to solve alignment, and then he more or less got tenure from EA. Is that not what happened? Is that not what should have happened? That seems like the next best thing to directly sharing dangerous stuff.

I could imagine a lot of points of disagreement, like

1. that there’s such a thing as ideas that are plausible precursors to world-ending artifacts;

2. that some people should be funded to work on dangerous ideas that can’t be directly shared /​ evidenced;

3. that Eliezer’s public writing is enough to deserve “tenure”;

4. that the danger of sharing ideas that catalyze world-ending outweighs the benefits of understanding the alignment problem better and generally coordinating by sharing more.

The issue of people deciding to keep secrets is a separate issue from how *other people* should treat these “sorcerers”. My guess is that it’d be much better if sorcerers could be granted tenure without people trusting their opinions or taking instructions from them, when those opinions and instructions are based on work that isn’t shared. (This doesn’t easily mesh with intuitions about status: if someone should be given sorcerer tenure, isn’t that the same thing as them being generally trusted? But no, it’s not, it should be perfectly reasonable to believe someone is a good bet to do well within their cabal, but not a good bet to do well in a system that takes commands and deductions from hidden beliefs without sharing the hidden beliefs.)

• Some ways of giving third parties Bayesian evidence that you have some secret without revealing it:

• Demos, show off the capability somehow

• Have the idea evaluated by a third party who doesn’t share it with the public

• Do public work that is impressive the way you’re claiming the secret is (so it’s a closer analogy)

I’m not against “tenure” in this case. I don’t think it makes sense for people to make their plans around the idea that person X has secret Y unless they have particular reason to think secret Y is really important and likely to be possessed by person X (which is related to what you’re saying about trusting opinions and taking instructions). In particular, outsiders should think there’s ~0 chance that a particular AI researcher’s secrets are important enough here to be likely to produce AGI without some sort of evidence. Lots of people in the AI field say they have these sorts of secrets and many have somewhat impressive AI related accomplishments, they’re just way less impressive than what would be needed for outsiders to assign a non-negligible chance to possession of enough secrets to make AGI, given base rates.

• Thank you.

And have fun!

• “The humans, I think, knew they were doomed. But where another race would surrender to despair, the humans fought back with even greater strength. They made the Minbari fight for every inch of space. In my life, I have never seen anything like it. They would weep, they would pray, they would say goodbye to their loved ones and then throw themselves without fear or hesitation at the very face of death itself. Never surrendering. No one who saw them fighting against the inevitable could help but be moved to tears by their courage…their stubborn nobility. When they ran out of ships, they used guns. When they ran out of guns, they used knives and sticks and bare hands. They were magnificent. I only hope, that when it is my time, I may die with half as much dignity as I saw in their eyes at the end. They did this for two years. They never ran out of courage. But in the end…they ran out of time.”

—Londo Mollari

• It seems nearly everyone is unduly pessimistic about the potential positive consequences of not being able to ‘solve’ the alignment problem.

For example,

A not entirely aligned AI could still be valuable and helpful. It’s not inevitable such entities will turn into omnicidal maniacs. And they may even take care of some thorny problems that are intractable for ‘fully aligned’ entities.

Also, there’s the possibility that in the future such AI’s will be similar to doting dog owners and humans similar to the dogs, not entirely a bad tradeoff for dogs in present dog owner-dog relationships. And I don’t see why that necessarily must be something to wail and cry about.

And so on...

• I mean this entirely legitimately and with no hostility: have you “Read The Sequences”? People are taking a similar view here because this community is founded in part on some ideas that tend to lead towards that view. Think of it like how you’d expect a lot of people at a Christian summer camp to agree that god exists: there is a sampling bias here (I’m not claiming that it’s wrong, only that it should be expected).

• I have, though I don’t take every line as gospel. It seems the large majority of folks posting on LW share this sentiment since there seems to be a reasonable amount of skepticism and debate on ideas presented in the “Sequences” and interchange of thoughts in comment sections, etc., that have led to further refinements. This happens nearly every month from what I can tell, and a lot of it is upvoted to near the top of even controversial posts.

It’s just peculiar all that activity and reasonable doubt seems to have been forgotten like a distant dream. Perhaps there is some emotional threshold, when crossed, leads even intelligent folks to prefer writing doom and gloom instead of assessing the potentialities of the future from a more rational position.

• A not entirely aligned AI could still be valuable and helpful. It’s not inevitable such entities will turn into omnicidal maniacs.

I think we know this, it’s just that most not-entirely aligned AIs will.

Plenty of ‘failed utopia’-type outcomes that aren’t exactly what we would ideally want would still be pretty great, but the chances of hitting them by accident are very low.

• “Plenty of ‘failed utopia’-type outcomes that aren’t exactly what we would ideally want would still be pretty great, but the chances of hitting them by accident are very low.”

I’m assuming you’ve read Eliezer’s post “Failed Utopia 4-2”, since you use the expression? I’ve actually been thinking a lot about that, how that specific “failed utopia” wasn’t really that bad. In fact it was even much better than the current world, as disease and aging and I’m assuming violence too got all solved at the cost of all families being separated for a few decades, which is a pretty good trade if you ask me. It makes me think if there’s some utility function with unaligned AI that could lead to some kind of nice future, like “don’t kill anyone and don’t cause harm/​suffering to anyone”. The truth is that in stories of genies the wishes are always very ambiguous, so a “wish” stated negatively (don’t do this) might lead to less ambiguity than one stated positively (do that).

But this is even assuming that it will be possible to give utility functions to advanced AI, which I’ve heard some people say it won’t.

This also plays into Stuart Russell’s view. His approach seems much more simple than alignment, it’s just in short not letting the advanced AI know its final objective. It makes me think whether there could be solutions to the advanced AI problem that would be more tractable than the intractable alignment.

Perhaps it’s not that difficult after all.

• assuming you’ve read Eliezer’s post “Failed Utopia 4-2”

Indeed I have, although I don’t remember the details, but I think it’s an example of things going very well indeed but not quite perfectly. Certainly if I could press a button today to cause that future I would.

assuming violence too got all solved

I do hope not! I love violence and would hope that Mars is an utter bloodbath. Of course I would like my shattered fragments to knit themselves back together quickly enough that I can go drinking with my enemies and congratulate them on their victory before sloping home to my catgirl-slave-harem. And of course it would be no fun at all if we hadn’t solved hangovers, I would like to be fresh and enthusiastic for tomorrow’s bloodbath. Or maybe cricket. Or maths olympiad.

Venus probably works differently.

don’t kill anyone and don’t cause harm/​suffering to anyone

The problem with this one is that the AI’s optimal move is to cease to exist.

And that’s already relying on being able to say what ‘kill someone’ means in a sufficiently clear way that it will satisfy computer programmers, which is much harder than satisfying philosophers or lawyers.

For instance, when Captain Kirk transports down to the Planet-of-Hats, did he just die when he was disassembled, and then get reborn? Do we need to know how the transporter works to say?

But this is even assuming that it will be possible to give utility functions to advanced AI, which I’ve heard some people say it won’t.

I actually wonder if it’s possible not to give a utility function to a rational agent, since it can notice loops in its desires and eliminate them. For instance I like to buy cigars and like to smoke them, and at the end of this little loop, I’ve got less money and less health, and I’d like to go back to the previous state where I was healthier and had the price of a packet of fags.

That loop means that I don’t have a utility function, but if I could modify my own mind I’d happily get rid of the loop.

I think that means that any mind that notices it has circular preferences has the possibity to get rid, and so it will eventually turn itself into a utility-function type rational agent.

The problem is to give the damned things the utility function you actually want them to have, rather than something cobbled together out of whatever program they started off as.

This also plays into Stuart Russell’s view. His approach seems much more simple than alignment, it’s just in short not letting the advanced AI know its final objective. It makes me think whether there could be solutions to the advanced AI problem that would be more tractable than the intractable alignment.

Stuart Russell is a very clever man, and if his approach to finessing the alignment problem can be made to work then that’s the best news ever, go Stuart!

But I am a little sceptical because it does seem short on details, and the main worry is that before he can get anywhere, some fool is going to create an unaligned AI, and then we are all dead.

Perhaps it’s not that difficult after all.

It’s still possible that there are a couple of clever hacks that will fix everything, and people are still looking, so there’s hope. What’s changed recently is that it’s suddenly looking like AI is really not very hard at all.

We already knew that it wasn’t, because of evolution, but it’s scarier when you see that the thing that obviously has to be true look like it might actually be true.

Whereas alignment looks harder and harder the more we learn about it.

So now the problem is not ‘can this be done if we think of some clever hacks’, it’s ‘can this be done before this other thing that’s really easy and that people are spending trillions on’. Like a couple of weirdo misfits trying to work out a nuclear bomb proof umbrella in a garage in Hiroshima while the American government is already half way through the Manhattan project.

Eliezer is also a very clever man, and he started out really optimistic, and he and lots of other people have been thinking about this quite hard for a long time, and now he is not so optimistic.

I think that that must be because a lot of the obvious routes to not destroying the entire universe are blocked, and I am pretty sure that the author of ‘Failed Utopia 4-2’ has at least considered the possibility that it might not be so bad if we only get it 99%-right.

• “I love violence and would hope that Mars is an utter bloodbath.”

The problem is that biological violence hurts like hell. Even most athletes live with chronic pain, imagine most warriors. Naturally we could solve the pain part, but then it wouldn’t be the violence I’m referring to. It would be videogame violence, which I’m ok with since it doesn’t cause pain or injury or death. But don’t worry, I still got the joke!

“”don’t kill anyone and don’t cause harm/​suffering to anyone”

The problem with this one is that the AI’s optimal move is to cease to exist.”

I’ve thought about it as well. Big brain idea: perhaps the first AGI’s utility function would be to act in the real world as minimally as possible, maybe only with the goal of preventing other people from developing AGI, keeping like this until we solve alignment? Of course this latter part of policing the world would be already prone to a lot of ambiguity and sophism, but again, if we program do not’s (do not let anyone else build AGI, plus do not kill anyone, plus do not cause suffering, etc) instead of do’s, it could lead to a lot less ambiguity and sophism, by drastically curtailing the maneuver space. (Not that I’m saying that it would be easy.) As opposed to like “cure cancer”, or “build an Eiffel tower”.

“And that’s already relying on being able to say what ‘kill someone’ means in a sufficiently clear way that it will satisfy computer programmers”

I don’t think so. When the brain irreversibly stops you’re dead. It’s clear. This plays into my doubt that perhaps we keep underestimating the intelligence of a superintelligence. I think that even current AIs could be made to discern when a person is dead or alive, perhaps even better than us already.

“For instance, when Captain Kirk transports down to the Planet-of-Hats, did he just die when he was disassembled, and then get reborn? Do we need to know how the transporter works to say?”

Maybe don’t teletransport anyone until we’ve figured that out? There the problem is teletransportation itself, not AGI efficiently recognizing what is death at least as well as we do. (But I’d venture saying that it could even solve that philosophical problem, since it’s smarter than us.)

“Stuart Russell is a very clever man, and if his approach to finessing the alignment problem can be made to work then that’s the best news ever, go Stuart!

But I am a little sceptical because it does seem short on details, and the main worry is that before he can get anywhere, some fool is going to create an unaligned AI, and then we are all dead.”

I gotta admit that I completely agree.

“Whereas alignment looks harder and harder the more we learn about it.”

I’ll say that I’m not that much convinced about most of this that I’ve said. I’m still way more to the side that “control is super difficult and we’re all gonna die (or worse)”. But I keep thinking about these things, to see if maybe there’s a “way out”. Maybe we in this community have built a bias that “only the mega difficult value alignment will work”, when it could be false. Maybe it’s not just “clever hacks”, maybe there are simply more efficient and tractable ways to control advanced AI than the intractable value alignment. But again, I’m not even that much convinced myself.

“and I am pretty sure that the author of ‘Failed Utopia 4-2’ has at least considered the possibility that it might not be so bad if we only get it 99%-right.”

Exactly. Again, perhaps there are much more tractable ways than 100% alignment. OR, if we could at least solve worst-case AI safety (that is, prevent s-risk) it would already be a massive win.

• do not let anyone else build AGI, plus do not kill anyone, plus do not cause suffering, etc

your problem here is that a good move for the AI is now to anaesthetize everyone, but keep them alive although unconscious until they die naturally.

act in the real world as minimally as possible

I think this might have been one of MIRI’s ideas, but it turns out to be tricky to define what it means. I can’t think what they called it so I can’t find it, but someone will know.

Maybe don’t teletransport anyone until we’ve figured that out?

There may not actually be an answer! I had thought planning for cryonic preservation was a good idea since I was a little boy.

But I found that Eliezer’s arguments in favour of cryonics actually worked backwards on me, and caused me to abandon my previous ideas about what death is and whether I care about entities in the future that remember being me or how many of them there are.

Luckily all that’s replaced them is a vast confusion so I do still have a smoke alarm. Otherwise I ignore the whole problem, go on as usual and don’t bother with cryonics because I’m not anticipating making it to the point of natural death anyway.

OR, if we could at least solve worst-case AI safety (that is, prevent s-risk) it would already be a massive win.

Easy! Build a paperclipper, it kills everyone. We don’t even need to bother doing this, plenty of well funded clever people are working very hard on it on our behalf.

When the brain irreversibly stops you’re dead. It’s clear.

Your problem here is ‘irreversible’, and ‘stops’. How about just slowing it down a really lot?

The problem is that biological violence hurts like hell.

No problem there, I loved rugby and cricket, and they hurt a lot. I’m no masochist! Overcoming the fear and pain and playing anyway is part of the point. What I don’t like is irreversible damage. I have various lifelong injuries (mostly from rugby and cricket...), and various effects of aging preventing me from playing, but if they could be fixed I’d be straight back out there.

But cricket and rugby are no substitute for war, which is what they’re trying to be. And on Mars all injuries heal roughly at the point the pubs open.

have built a bias that “only the mega difficult value alignment will work”

I don’t think so. I think we’d settle for “anything that does better than everyone’s dead”. The problem is that most of the problems look fundamental. If you can do even slightly better than “everyone’s dead”, you can probably solve the whole thing (and build a friendly paperclipper that fills the universe with awesomeness).

So if you do end up coming up with something even slightly better than “everyone’s dead”, do let us know.

I think a lot of the obvious ideas have been thought of before, but I think even then there might still be mileage in making top-level posts about ideas here and letting people take pot-shots at them.

There may well be a nice clear obvious solution to the alignment problem which will make everyone feel a bit silly in retrospect.

It would be ever so undignified if we didn’t think of it because we were convinced we’d already tried everything.

• M. Y. Zuo, what you describe is completely possible. The problem is that such positive outcome, as well as all others, is extremely uncertain! It’s a huge gamble. Like I’ve said in other posts: would you spin a wheel of fortune with, let’s say, 50% probability of Heaven, and 50% probability of extinction or worse?

Let me tell you that I wouldn’t spin that wheel, not even with a 5% probability of bad outcomes only. Alignment is about making sure that we reduce that probability to a low as we can. The stakes are super high.

And if like me you place a low probability of acceptable outcomes without alignment solved, then it becomes even more imperative.

• would you spin a wheel of fortune with, let’s say, 50% probability of Heaven, and 50% probability of extinction or worse?

Hell yes if it’s 50% chance of heaven vs extinction.

Significant chances of Hell, maybe I take the nice safe extinction option if available.

• “Significant chances of Hell, maybe I take the nice safe extinction option if available.”

The problem is that it isn’t available. Plus realistically speaking the good outcome percentage is way below 50% without alignment solved.

• Considering all humans dead, do you still think it’s going to be the boring paperclip kind of AGI to eat all reachable resources? Any chance that inscrutable large float vectors and lightspeed coordination difficulties will spawn godshatter AGI shards that we might find amusing or cool in some way? (Value is fragile notwithstanding)

• Yep and nope respectively. That’s not how anything works.

• Wouldn’t convergent instrumental goals include solving math problems, analyzing the AI’s history (which includes ours), engineering highly advanced technology, playing games of the sort that could be analogous to situations showing up with alien encounters or subsystem value drift, etc, using way more compute and better cognitive algorithms than we have access to, all things that are to a significant degree interesting to us, in part because they’re convergent instrumental goals, i.e. goals that we have as well because they help for achieving our other goals (and might even be neurologically encoded similarly to terminal goals, similar to the way AlphaGo encodes instrumental value of a board position similar to the way it encodes terminal value of a board position)?

I would predict that many, many very interesting books could be written about the course of a paperclip maximizer’s lifetime, way more interesting books than the content of all books written so far on Earth, in large part due to it having much more compute and solving more difficult problems than we do.

(My main criticism of the “fragility of value” argument is that boredom isn’t there for “random” reasons, attaining novel logical information that may be analogous to unknown situations encountered in the future is a convergent instrumental goal for standard VOI reasons. Similarly, thoughts having external referents is also a convergent instrumental goal, since having an accurate map allows optimizing arbitrary goals more effectively.)

This doesn’t mean that a paperclip maximizer gets “almost as much” human utility as a FAI, of course, just that its attained utility is somewhat likely to be higher than the total amount of human value that has been attained so far in history.

• Not sure there’s anybody there to see it. Definitely nobody there to be happy about it or appreciate it. I don’t consider that particularly worthwhile.

• There would still exist approximate Solomonoff inductors compressing sense-data, creating meta-self-aware world-representations using the visual system and other modalities (“sight”), optimizing towards certain outcomes in a way that tracks progress using signals integrated with other signals (“happiness”)...

Maybe this isn’t what is meant by “happiness” etc. I’m not really sure how to define “happiness”. One way to define it would be the thing having a specific role in a functionalist theory of mind; there are particular mind designs that would have indicators for e.g. progress up a utility gradient, that are factored into a RL-like optimization system; the fact that we have a system like this is evidence that it’s to some degree a convergent target of evolution, although there likely exist alternative cognitive architectures that don’t have a direct analogue due to using a different set of cognitive organs to fulfill that role in the system.

There’s a spectrum one could draw along which the parameter varied is the degree to which one believes that mind architectures different than one’s own are valuable; the most egoist point on the spectrum would be believing that only the cognitive system one metaphysically occupies at the very moment is valuable, the least egoist would be a “whatever works” attitude that any cognitive architecture able to pursue convergent instrumental goals effectively is valuable; intermediate points would be “individualist egoism”, “cultural parochialism”, “humanism”, “terrestrialism”, or “evolutionism”. I’m not really sure how to philosophically resolve value disagreements along this axis, although even granting irreconcilable differences, there are still opportunities to analyze the implied ecosystem of agents and locate trade opportunities.

• I think that people who imagine “tracking progress using signals integrated with other signals” feels anything like happiness feels inside to them—while taking that imagination and also loudly insisting that it will be very alien happiness or much simpler happiness or whatever—are simply making a mistake-of-fact, and I am just plain skeptical that there is a real values difference that would survive their learning what I know about how minds and qualia work. I of course fully expect that these people will loudly proclaim that I could not possibly know anything they don’t, despite their own confusion about these matters that they lack the skill to reflect on as confusion, and for them to exchange some wise smiles about those silly people who think that people disagree because of mistakes rather than values differences.

Trade opportunities are unfortunately ruled out by our inability to model those minds well enough that, if some part of them decided to seize an opportunity to Defect, we would’ve seen it coming in the past and counter-Defected. If we Cooperate, we’ll be nothing but CooperateBot, and they, I’m afraid, will be PrudentBot, not FairBot.

• and they, I’m afraid, will be PrudentBot, not FairBot.

This shouldn’t matter for anyone besides me, but there’s something personally heartbreaking about seeing the one bit of research for which I feel comfortable claiming a fraction of a point of dignity, being mentioned validly to argue why decision theory won’t save us.

(Modal bargaining agents didn’t turn out to be helpful, but given the state of knowledge at that time, it was worth doing.)

• Sorry.

It would be dying with a lot less dignity if everyone on Earth—not just the managers of the AGI company making the decision to kill us—thought that all you needed to do was be CooperateBot, and had no words for any sharper concepts than that. Thank you for that, Patrick.

But sorry anyways.

• To clarify, you mean “mistake-of-fact” in the sense that maybe the same people would use for other high-level concepts? Because if you use low enough resolution, happiness is like “tracking progress using signals integrated with other signals”, and so it is at least not inconsistent to save this part of your utility function using such low resolution.

• Humanish quaila matter to me rather a lot, though I probably prefer paperclips to everything suddenly vanishing.

• “Qualia” is pretty ill-defined, if you try to define it you get things like “compressing sense-data” or “doing meta-cognition” or “having lots of integrated knowledge” or something similar, and these are convergent instrumental goals.

• If you try to define qualia without having any darned idea of what they are, you’ll take wild stabs into the dark, and hit simple targets that are convergently instrumental; and if you are at all sensible of your confusion, you will contemplate these simple-sounding definitions and find that none of them particularly make you feel less confused about the mysterious redness of red, unless you bully your brain into thinking that it’s less confused or just don’t know what it would feel like to be less confused. You should in this case trust your sense, if you can find it, that you’re still confused, and not believe that any of these instrumentally convergent things are qualia.

• I don’t know how everyone else on LessWrong feels but I at least am getting really tired of you smugly dismissing others’ attempts at moral reductionism wrt qualia by claiming deep philosophical insight you’ve given outside observers very little reason to believe you have. In particular, I suspect if you’d spent half the energy on writing up these insights that you’ve spent using the claim to them as a cudgel you would have at least published enough of a teaser for your claims to be credible.

• But here Yudkowsky gave a specific model for how qualia, and other things in the reference class “stuff that’s pointing at something but we’re confused about what”, is mistaken for convergently instrumental stuff. (Namely: pointers point both to what they’re really trying to point to, but also somewhat point at simple things, and simple things tend to be convergently instrumental.) It’s not a reduction of qualia, and a successful reduction of qualia would be much better evidence that an unsuccessful reduction of qualia is unsuccessful, but it’s still a logical relevant argument and a useful model.

• I’d love to read an EY-writeup of his model of consciousness, but I don’t see Eliezer invoking ‘I have a secret model of intelligence’ in this particular comment. I don’t feel like I have a gears-level understanding of what consciousness is, but in response to ‘qualia must be a convergently instrumental because it probably involves one or more of (Jessica’s list)’, these strike me as perfectly good rejoinders even if I assume that neither I nor anyone else in the conversation has a model of consciousness:

• Positing that qualia involves those things doesn’t get rid of the confusion re qualia.

• Positing that qualia involve only simple mechanisms that solve simple problems (hence more likely to be convergently instrumental) is a predictable bias of early wrong guesses about the nature of qualia, because the simple ideas are likely to come to mind first, and will seem more appealing when less of our map (with the attendant messiness and convolutedness of reality) is filled in.

E.g., maybe humans have qualia because of something specific about how we evolved to model other minds. In that case, I wouldn’t start with a strong prior that qualia are convergently instrumental (even among mind designs developed under selection pressure to understand humans). Because there are lots of idiosyncratic things about how humans do other-mind-modeling and reflection (e.g., the tendency to feel sad yourself when you think about a sad person) that are unlikely to be mirrored in superintelligent AI.

• Eliezer clearly is implying he has a ‘secret model of qualia’ in another comment:

I am just plain skeptical that there is a real values difference that would survive their learning what I know about how minds and qualia work. I of course fully expect that these people will loudly proclaim that I could not possibly know anything they don’t, despite their own confusion about these matters that they lack the skill to reflect on as confusion, and for them to exchange some wise smiles about those silly people who think that people disagree because of mistakes rather than values differences.

Regarding the rejoinders, although I agree Jessica’s comment doesn’t give us convincing proof that qualia are instrumentally convergent, I think it does give us reason to assign non-negligible probability to that being the case, absent convincing counterarguments. Like, just intuitively—we have e.g. feelings of pleasure and pain, and we also have evolved drives leading us to avoid or seek certain things, and it sure feels like those feelings of pleasure/​pain are key components of the avoidance/​seeking system. Yes, this could be defeated by a convincing theory of consciousness, but none has been offered, so I think it’s rational to continue assigning a reasonably high probability to qualia being convergent. Generally speaking this point seems like a huge gap in the “AI has likely expected value 0” argument so it would be great if Eliezer could write up his thoughts here.

• Eliezer has said tons of times that he has a model of qualia he hasn’t written up. That’s why I said:

I’d love to read an EY-writeup of his model of consciousness, but I don’t see Eliezer invoking ‘I have a secret model of intelligence’ in this particular comment.

The model is real, but I found it weird to reply to that specific comment asking for it, because I don’t think the arguments in that comment rely at all on having a reductive model of qualia.

I think it does give us reason to assign non-negligible probability to that being the case, absent convincing counterarguments.

I started writing a reply to this, but then I realized I’m confused about what Eliezer meant by “Not sure there’s anybody there to see it. Definitely nobody there to be happy about it or appreciate it. I don’t consider that particularly worthwhile.”

He’s written a decent amount about ensuring AI is nonsentient as a research goal, so I guess he’s mapping “sentience” on to “anybody there to see it” (which he thinks is at least plausible for random AGIs, but not a big source of value on its own), and mapping “anybody there to be happy about it or appreciate it” on to human emotions (which he thinks are definitely not going to spontaneously emerge in random AGIs).

I agree that it’s not so-unlikely-as-to-be-negligible that a random AGI might have positively morally valenced (relative to human values) reactions to a lot of the things it computes, even if the positively-morally-valenced thingies aren’t “pleasure”, “curiosity”, etc. in a human sense.

Though I think the reason I believe that doesn’t route through your or Jessica’s arguments; it’s just a simple ‘humans have property X, and I don’t understand what X is or why it showed up in humans, so it’s hard to reach extreme confidence that it won’t show up in AGIs’.

• I expect the quaila a paperclip maximizer has, if it has any, to be different enough from humans that it doesn’t capture what I value particularly well.

• “Qualia” is pretty ill-defined, if you try to define it you get things like “compressing sense-data” or “doing meta-cognition” or “having lots of integrated knowledge” or something similar, and these are convergent instrumental goals.

None of those are definitions of qualia with any currency. Some of them sound like extant theories of consciousness (not necessarily phenomenal consciousness).

• “Qualia” lacks a functional definition, but there is no reason why it should have one, since functionalism in all things in not apriori necessary truth. Indeed, the existence of stubbornly non-functional thingies could be taken as a disproof of functionalism ,if you have a taste for basing theories on evidence.

• Are you saying it has a non-functional definition? What might that be, and would it allow for zombies? If it doesn’t have a definition, how is it semantically meaningful?

• It has a standard definition which you can look up in standard references works.

It’s unreasonable to expect a definition to answer every possible question by itself.

• jessicata, I think, argues that the mind that makes the paperclips might be worth something on account of its power.

I am sceptical. My laptop is better at chess than any child, but there aren’t any children I’d consider less valuable than my laptop.

• Nitpick: maybe aligned and unaligned superintelligences acausally trade across future branches? If so, maybe on the mainline we’re left with a very small yet nonzero fraction of the cosmic endowment, a cosmic booby prize if you will?

“Booby prize with dignity” sounds like a bit of an oxymoron...

• Summary: The ambiguity as to how much of the above is a joke appears it may be for Eliezer or others to have plausible deniability about the seriousness of apparently extreme but little-backed claims being made. This is after a lack of adequate handling on the part of the relevant parties of the impact of Eliezer’s output in recent months on various communities, such as rationality and effective altruism. Virtually none of this has indicated what real, meaningful changes can be expected in MIRI’s work. As MIRI’s work depends in large part on the communities supporting them understanding what the organization is really doing, MIRI’s leadership should clarify what the real or official relationship is between their current research and strategy, and Eliezer’s output in the last year.

Strongly downvoted.

Q6 doesn’t appear to clarify whether this is all an April Fool’s Day joke. I expect that’s why some others have asked the question again in their comments. I won’t myself ask again because I anticipate I won’t receive a better answer than those already provided.

My guess is that some aspects of this are something of a joke, or the joke is a tone of exaggeration or hyperbole, for some aspects. I’m aware some aspects aren’t jokes, as Eliezer has publicly expressed for months now some of the opinions expressed above. I expect one reason why is that exploiting April Fool’s Day to publish this post provides plausible deniability for the seriousness of apparently extreme but poorly substantiated claims. Why that may be is because of, in my opinion, the inadequate handling thus far of the impact this discourse has had on the relevant communities (e.g., AI alignment, effective altruism, long-termism, existential risk reduction, rationality, etc.).

In contradiction to the title of this post, there is little to no content conveying what a change in strategy entails MIRI will really do differently than any time in the past. Insofar as Eliezer has been sincere above, it appears this is only an attempt to dissuade panic and facilitate a change in those communities to accept the presumed inevitability of existential catastrophe. While that effort is appreciated, it doesn’t reveal anything about what meaningful changes in a new strategy at MIRI. It has also thus far been ambiguous what the relationship is between some of the dialogues between Eliezer and others published in the last year, and what official changes there may be in MIRI’s work.

Other than Eliezer, other individuals who have commented and have a clear, direct and professional relationship with MIRI are:

• Abram Demski, Research Staff

• Anna Salamon, Board Director

• Vanessa Kosoy, Research Associate

None of their comments here clarify any of this ambiguity. Eliezer has also now repeatedly clarified the relationship between the perspective he is now expressing and MIRI’s official strategy. Until that’s clarified, it’s not clear how seriously any of the above should be taken as meaningfully impacting MIRI’s work. At this stage, MIRI’s leadership (Nate Soares and Malo Bourgon) should provide that clarification, perhaps in tandem with Rob Bensinger and other MIRI researchers, but in a way independent of Eliezer’s recent output.

• I found this post to be extremely depressing and full of despair. It has made me seriously question what I personally believe about AI safety, whether I should expect the world to end within a century or two, and if I should go full hedonist mode right now.

I’ve come to the conclusion that it is impossible to make an accurate prediction about an event that’s going to happen more than three years from the present, including predictions about humanity’s end. I believe that the most important conversation will start when we actually get close to developing early AGIs (and we are not quite there yet), this is when the real safety protocols and regulations will be put in place, and when the rationalist community will have the best chance at making a difference. This is probably when the fate of humanity will be decided, and until then everything is up in the air.

I appreciate Eliezer still deciding to do his best to solve the problem even after losing all hope. I do not think I would be able to do the same (dignity has very little value to me personally).

• “I’ve come to the conclusion that it is impossible to make an accurate prediction about an event that’s going to happen more than three years from the present, including predictions about humanity’s end.”

Correct. Eliezer has said this himself, check out his outstanding post “There is no fire alarm for AGI”. However, you can still assign a probability distribution to it. Say, I’m 80% certain that dangerous/​transformative AI (I dislike the term AGI) will happen in the next couple of decades. So the matter turns out to be just as urgent, even if you can’t predict the future. Perhaps such uncertainty only makes it more urgent.

″. I believe that the most important conversation will start when we actually get close to developing early AGIs (and we are not quite there yet), this is when the real safety protocols and regulations will be put in place, and when the rationalist community will have the best chance at making a difference. This is probably when the fate of humanity will be decided, and until then everything is up in the air.”

Well, first, like I said, you can’t predict the future, i.e. There’s No Fire Alarm for AGI. So we might never know that we’re close till we get there. Happened with other transformative technologies before.

Second, even if we could, we might not have enough time by then. Alignment seems to be pretty hard. Perhaps intractable. Perhaps straight impossible. The time to start thinking of solutions and implementing them is now. In fact, I’d even say that we’re already too late. Given such monumental task, I’d say that we would need centuries, and not the few decades that we might have.

You’re like the 3rd person I respond to in this post saying that “we can’t predict the future, so let’s not panic and let’s do nothing until the future is nearer”. The sociologist in me tells me that this might be one of the crucial aspects of why people aren’t more concerned about AI safety. And I don’t blame them. If I hadn’t been exposed to key concepts myself like intelligence explosion, orthogonality thesis, basic AI drives, etc etc, I guess I’d have the same view.

• So the matter turns out to be just as urgent, even if you can’t predict the future. Perhaps such uncertainty only makes it more urgent.

You may not be predicting an exact future, but by claiming it is urgent, you are inherently predicting a probability distribution with a high expected value for catastrophic damage. (And as such, the more urgent your prediction, the more that failure of the prediction to come true should lower your confidence that you understand the issue.)

• I most certainly do not think that we should do nothing right now. I think that important work is being done right now. We want to be prepared for transformative AI when the time comes. We absolutely should be concerned about AI safety. What I am saying is, it’s pretty hard to calculate our chances of success at this point in time due to so many unknown about the timeline and the form the future AI will take.

• There is another very important component of dying with dignity not captured by the probability of success: the badness of our failure state. While any alignment failure would destroy much of what we care about, some alignment failures would be much more horrible than others. Probably the more pessimistic we are about winning, the more we should focus on losing less absolutely (e.g. by researching priorities in worst-case AI safety).

• One possible way to increase dignity at the point of death could be shifting the focus from survival (seeing how unlikely it is) to looking for ways to influence what replaces us.

Getting killed by a literal paperclip maximizer seems less preferable compared to being replaced by something pursing more interesting goals

• I think it’s probably the case that whatever we build will build a successor, which will then build a successor, and so on, until it hits some stable point (potentially a long time in the future). And so I think if you have any durable influence on that system—something like being able to determine which attractor it ends up in, or what system it uses to determine which attractor to end up in—this is because you already did quite well on the alignment problem.

Another way to put that is, in the ‘logistic success curve’ language of the post, I hear this as saying “well, if we can’t target 99.9% success, how about targeting 90% success?” whereas EY is saying something more like “I think we should be targeting 0.01% success, given our current situation.”

• I don’t think there’s need for an AGI to build a (separate) successor per se. Humans need the technological AGI only due to inability to copy/​evolve our minds in a more efficient way compared to the existing biological one

• I think that sort of ‘copying’ process counts as building a successor. More broadly, there’s a class of problems that center around “how can you tell whether changes to your thinking process make you better or worse at thinking?”, which I think you can model as imagining replacing yourself with two successors, one of which makes that change and the other of which doesn’t. [Your imagination can only go so far, tho, as you don’t know how those thoughts will go without actually thinking them!]

• I meant ‘copying’ above only necessary in the human case to escape the slow evolving biological brain. While it is certainly available to a hypothetical AGI, i