# Discussion with Eliezer Yudkowsky on AGI interventions

The following is a partially redacted and lightly edited transcript of a chat conversation about AGI between Eliezer Yudkowsky and a set of invitees in early September 2021. By default, all other participants are anonymized as “Anonymous”.

I think this Nate Soares quote (excerpted from Nate’s response to a report by Joe Carlsmith) is a useful context-setting preface regarding timelines, which weren’t discussed as much in the transcript:

[...] My odds [of AGI by the year 2070] are around 85%[...]

I can list a handful of things that drive my probability of AGI-in-the-next-49-years above 80%:

1. 50 years ago was 1970. The gap between AI systems then and AI systems now seems pretty plausibly greater than the remaining gap, even before accounting the recent dramatic increase in the rate of progress, and potential future increases in rate-of-progress as it starts to feel within-grasp.

2. I observe that, 15 years ago, everyone was saying AGI is far off because of what it couldn’t do—basic image recognition, go, starcraft, winograd schemas, programmer assistance. But basically all that has fallen. The gap between us and AGI is made mostly of intangibles. (Computer Programming That Is Actually Good? Theorem proving? Sure, but on my model, “good” versions of those are a hair’s breadth away from full AGI already. And the fact that I need to clarify that “bad” versions don’t count, speaks to my point that the only barriers people can name right now are intangibles.) That’s a very uncomfortable place to be!

3. When I look at the history of invention, and the various anecdotes about the Wright brothers and Enrico Fermi, I get an impression that, when a technology is pretty close, the world looks a lot like how our world looks.

• Of course, the trick is that when a technology is a little far, the world might also look pretty similar!

• Though when a technology is very far, the world does look different—it looks like experts pointing to specific technical hurdles. We exited that regime a few years ago.

4. Summarizing the above two points, I suspect that I’m in more-or-less the “penultimate epistemic state” on AGI timelines: I don’t know of a project that seems like they’re right on the brink; that would put me in the “final epistemic state” of thinking AGI is imminent. But I’m in the second-to-last epistemic state, where I wouldn’t feel all that shocked to learn that some group has reached the brink. Maybe I won’t get that call for 10 years! Or 20! But it could also be 2, and I wouldn’t get to be indignant with reality. I wouldn’t get to say “but all the following things should have happened first, before I made that observation”. I have made those observations.

5. It seems to me that the Cotra-style compute-based model provides pretty conservative estimates. For one thing, I don’t expect to need human-level compute to get human-level intelligence, and for another I think there’s a decent chance that insight and innovation have a big role to play, especially on 50 year timescales.

6. There has been a lot of AI progress recently. When I tried to adjust my beliefs so that I was positively surprised by AI progress just about as often as I was negatively surprised by AI progress, I ended up expecting a bunch of rapid progress. [...]

Further preface by Eliezer:

In some sections here, I sound gloomy about the probability that coordination between AGI groups succeeds in saving the world. Andrew Critch reminds me to point out that gloominess like this can be a self-fulfilling prophecy—if people think successful coordination is impossible, they won’t try to coordinate. I therefore remark in retrospective advance that it seems to me like at least some of the top AGI people, say at Deepmind and Anthropic, are the sorts who I think would rather coordinate than destroy the world; my gloominess is about what happens when the technology has propagated further than that. But even then, anybody who would rather coordinate and not destroy the world shouldn’t rule out hooking up with Demis, or whoever else is in front if that person also seems to prefer not to completely destroy the world. (Don’t be too picky here.) Even if the technology proliferates and the world ends a year later when other non-coordinating parties jump in, it’s still better to take the route where the world ends one year later instead of immediately. Maybe the horse will sing.

Eliezer Yudkowsky

Hi and welcome. Points to keep in mind:

- I’m doing this because I would like to learn whichever actual thoughts this target group may have, and perhaps respond to those; that’s part of the point of anonymity. If you speak an anonymous thought, please have that be your actual thought that you are thinking yourself, not something where you’re thinking “well, somebody else might think that...” or “I wonder what Eliezer’s response would be to...”

- Eliezer’s responses are uncloaked by default. Everyone else’s responses are anonymous (not pseudonymous) and neither I nor MIRI will know which potential invitee sent them.

- Please do not reshare or pass on the link you used to get here.

- I do intend that parts of this conversation may be saved and published at MIRI’s discretion, though not with any mention of who the anonymous speakers could possibly have been.

Eliezer Yudkowsky

(Thank you to Ben Weinstein-Raun for building chathamroom.com, and for quickly adding some features to it at my request.)

Eliezer Yudkowsky

It is now 2PM; this room is now open for questions.

Anonymous

How long will it be open for?

Eliezer Yudkowsky

In principle, I could always stop by a couple of days later and answer any unanswered questions, but my basic theory had been “until I got tired”.

Anonymous

At a high level one thing I want to ask about is research directions and prioritization. For example, if you were dictator for what researchers here (or within our influence) were working on, how would you reallocate them?

Eliezer Yudkowsky

The first reply that came to mind is “I don’t know.” I consider the present gameboard to look incredibly grim, and I don’t actually see a way out through hard work alone. We can hope there’s a miracle that violates some aspect of my background model, and we can try to prepare for that unknown miracle; preparing for an unknown miracle probably looks like “Trying to die with more dignity on the mainline” (because if you can die with more dignity on the mainline, you are better positioned to take advantage of a miracle if it occurs).

Anonymous

I’m curious if the grim outlook is currently mainly due to technical difficulties or social/​coordination difficulties. (Both avenues might have solutions, but maybe one seems more recalcitrant than the other?)

Eliezer Yudkowsky

Technical difficulties. Even if the social situation were vastly improved, on my read of things, everybody still dies because there is nothing that a handful of socially coordinated projects can do, or even a handful of major governments who aren’t willing to start nuclear wars over things, to prevent somebody else from building AGI and killing everyone 3 months or 2 years later. There’s no obvious winnable position into which to play the board.

Anonymous

just to clarify, that sounds like a large scale coordination difficulty to me (i.e., we—as all of humanity—can’t coordinate to not build that AGI).

Eliezer Yudkowsky

I wasn’t really considering the counterfactual where humanity had a collective telepathic hivemind? I mean, I’ve written fiction about a world coordinated enough that they managed to shut down all progress in their computing industry and only manufacture powerful computers in a single worldwide hidden base, but Earth was never going to go down that route. Relative to remotely plausible levels of future coordination, we have a technical problem.

Anonymous

Curious about why building an AGI aligned to its users’ interests isn’t a thing a handful of coordinated projects could do that would effectively prevent the catastrophe. The two obvious options are: it’s too hard to build it vs it wouldn’t stop the other group anyway. For “it wouldn’t stop them”, two lines of reply are nobody actually wants an unaligned AGI (they just don’t foresee the consequences and are pursuing the benefits from automated intelligence, so can be defused by providing the latter) (maybe not entirely true: omnicidal maniacs), and an aligned AGI could help in stopping them. Is your take more on the “too hard to build” side?

Eliezer Yudkowsky

Because it’s too technically hard to align some cognitive process that is powerful enough, and operating in a sufficiently dangerous domain, to stop the next group from building an unaligned AGI in 3 months or 2 years. Like, they can’t coordinate to build an AGI that builds a nanosystem because it is too technically hard to align their AGI technology in the 2 years before the world ends.

Anonymous

Summarizing the threat model here (correct if wrong): The nearest competitor for building an AGI is at most N (<2) years behind, and building an aligned AGI, even when starting with the ability to build an unaligned AGI, takes longer than N years. So at some point some competitor who doesn’t care about safety builds the unaligned AGI. How does “nobody actually wants an unaligned AGI” fail here? It takes >N years to get everyone to realise that they have that preference and that it’s incompatible with their actions?

Eliezer Yudkowsky

Many of the current actors seem like they’d be really gung-ho to build an “unaligned” AGI because they think it’d be super neat, or they think it’d be super profitable, and they don’t expect it to destroy the world. So if this happens in anything like the current world—and I neither expect vast improvements, nor have very long timelines—then we’d see Deepmind get it first; and, if the code was not immediately stolen and rerun with higher bounds on the for loops, by China or France or whoever, somebody else would get it in another year; if that somebody else was Anthropic, I could maybe see them also not amping up their AGI; but then in 2 years it starts to go to Facebook AI Research and home hobbyists and intelligence agencies stealing copies of the code from other intelligence agencies and I don’t see how the world fails to end past that point.

Anonymous

What does trying to die with more dignity on the mainline look like? There’s a real question of prioritisation here between solving the alignment problem (and various approaches within that), and preventing or slowing down the next competitor. I’d personally love more direction on where to focus my efforts (obviously you can only say things generic to the group).

Eliezer Yudkowsky

I don’t know how to effectively prevent or slow down the “next competitor” for more than a couple of years even in plausible-best-case scenarios. Maybe some of the natsec people can be grownups in the room and explain why “stealing AGI code and running it” is as bad as “full nuclear launch” to their foreign counterparts in a realistic way. Maybe more current AGI groups can be persuaded to go closed; or, if more than one has an AGI, to coordinate with each other and not rush into an arms race. I’m not sure I believe these things can be done in real life, but it seems understandable to me how I’d go about trying—though, please do talk with me a lot more before trying anything like this, because it’s easy for me to see how attempts could backfire, it’s not clear to me that we should be inviting more attention from natsec folks at all. None of that saves us without technical alignment progress. But what are other people supposed to do about researching alignment when I’m not sure what to try there myself?

Anonymous

thanks! on researching alignment, you might have better meta ideas (how to do research generally) even if you’re also stuck on object level. and you might know/​foresee dead ends that others don’t.

Eliezer Yudkowsky

I definitely foresee a whole lot of dead ends that others don’t, yes.

Anonymous

Does pushing for a lot of public fear about this kind of research, that makes all projects hard, seem hopeless?

Eliezer Yudkowsky

What does it buy us? 3 months of delay at the cost of a tremendous amount of goodwill? 2 years of delay? What’s that delay for, if we all die at the end? Even if we then got a technical miracle, would it end up impossible to run a project that could make use of an alignment miracle, because everybody was afraid of that project? Wouldn’t that fear tend to be channeled into “ah, yes, it must be a government project, they’re the good guys” and then the government is much more hopeless and much harder to improve upon than Deepmind?

Anonymous

I imagine lack of public support for genetic manipulation of humans has slowed that research by more than three months

Anonymous

‘would it end up impossible to run a project that could make use of an alignment miracle, because everybody was afraid of that project?’

...like, maybe, but not with near 100% chance?

Eliezer Yudkowsky

I don’t want to sound like I’m dismissing the whole strategy, but it sounds a lot like the kind of thing that backfires because you did not get exactly the public reaction you wanted, and the public reaction you actually got was bad; and it doesn’t sound like that whole strategy actually has a visualized victorious endgame, which makes it hard to work out what the exact strategy should be; it seems more like the kind of thing that falls under the syllogism “something must be done, this is something, therefore this must be done” than like a plan that ends with humane life victorious.

Regarding genetic manipulation of humans, I think the public started out very unfavorable to that, had a reaction that was not at all exact or channeled, does not allow for any ‘good’ forms of human genetic manipulation regardless of circumstances, driving the science into other countries—it is not a case in point of the intelligentsia being able to successfully cunningly manipulate the fear of the masses to some supposed good end, to put it mildly, so I’d be worried about deriving that generalization from it. The reaction may more be that the fear of the public is a big powerful uncontrollable thing that doesn’t move in the smart direction—maybe the public fear of AI gets channeled by opportunistic government officials into “and that’s why We must have Our AGI first so it will be Good and we can Win”. That seems to me much more like a thing that would happen in real life than “and then we managed to manipulate public panic down exactly the direction we wanted to fit into our clever master scheme”, especially when we don’t actually have the clever master scheme it fits into.

Eliezer Yudkowsky

I have a few stupid ideas I could try to investigate in ML, but that would require the ability to run significant-sized closed ML projects full of trustworthy people, which is a capability that doesn’t seem to presently exist. Plausibly, this capability would be required in any world that got some positive model violation (“miracle”) to take advantage of, so I would want to build that capability today. I am not sure how to go about doing that either.

Anonymous

if there’s a chance this group can do something to gain this capability I’d be interested in checking it out. I’d want to know more about what “closed”and “trustworthy” mean for this (and “significant-size” I guess too). E.g., which ones does Anthropic fail?

Eliezer Yudkowsky

What I’d like to exist is a setup where I can work with people that I or somebody else has vetted as seeming okay-trustworthy, on ML projects that aren’t going to be published. Anthropic looks like it’s a package deal. If Anthropic were set up to let me work with 5 particular people at Anthropic on a project boxed away from the rest of the organization, that would potentially be a step towards trying such things. It’s also not clear to me that Anthropic has either the time to work with me, or the interest in doing things in AI that aren’t “stack more layers” or close kin to that.

Anonymous

That setup doesn’t sound impossible to me—at DeepMind or OpenAI or a new org specifically set up for it (or could be MIRI) -- the bottlenecks are access to trustworthy ML-knowledgeable people (but finding 5 in our social network doesn’t seem impossible?) and access to compute (can be solved with more money—not too hard?). I don’t think DM and OpenAI are publishing everything—the “not going to be published” part doesn’t seem like a big barrier to me. Is infosec a major bottleneck (i.e., who’s potentially stealing the code/​data)?

Anonymous

Do you think Redwood Research could be a place for this?

Eliezer Yudkowsky

Maybe! I haven’t ruled RR out yet. But they also haven’t yet done (to my own knowledge) anything demonstrating the same kind of AI-development capabilities as even GPT-3, let alone AlphaFold 2.

Eliezer Yudkowsky

I would potentially be super interested in working with Deepminders if Deepmind set up some internal partition for “Okay, accomplished Deepmind researchers who’d rather not destroy the world are allowed to form subpartitions of this partition and have their work not be published outside the subpartition let alone Deepmind in general, though maybe you have to report on it to Demis only or something.” I’d be more skeptical/​worried about working with OpenAI-minus-Anthropic because the notion of “open AI” continues to sound to me like “what is the worst possible strategy for making the game board as unplayable as possible while demonizing everybody who tries a strategy that could possibly lead to the survival of humane intelligence”, and now a lot of the people who knew about that part have left OpenAI for elsewhere. But, sure, if they changed their name to “ClosedAI” and fired everyone who believed in the original OpenAI mission, I would update about that.

Eliezer Yudkowsky

Context that is potentially missing here and should be included: I wish that Deepmind had more internal closed research, and internally siloed research, as part of a larger wish I have about the AI field, independently of what projects I’d want to work on myself.

The present situation can be seen as one in which a common resource, the remaining timeline until AGI shows up, is incentivized to be burned by AI researchers because they have to come up with neat publications and publish them (which burns the remaining timeline) in order to earn status and higher salaries. The more they publish along the spectrum that goes {quiet internal result → announced and demonstrated result → paper describing how to get the announced result → code for the result → model for the result}, the more timeline gets burned, and the greater the internal and external prestige accruing to the researcher.

It’s futile to wish for everybody to act uniformly against their incentives. But I think it would be a step forward if the relative incentive to burn the commons could be reduced; or to put it another way, the more researchers have the option to not burn the timeline commons, without them getting fired or passed up for promotion, the more that unusually intelligent researchers might perhaps decide not to do that. So I wish in general that AI research groups in general, but also Deepmind in particular, would have affordances for researchers who go looking for interesting things to not publish any resulting discoveries, at all, and still be able to earn internal points for them. I wish they had the option to do that. I wish people were allowed to not destroy the world—and still get high salaries and promotion opportunities and the ability to get corporate and ops support for playing with interesting toys; if destroying the world is prerequisite for having nice things, nearly everyone is going to contribute to destroying the world, because, like, they’re not going to just not have nice things, that is not human nature for almost all humans.

When I visualize how the end of the world plays out, I think it involves an AGI system which has the ability to be cranked up by adding more computing resources to it; and I think there is an extended period where the system is not aligned enough that you can crank it up that far, without everyone dying. And it seems extremely likely that if factions on the level of, say, Facebook AI Research, start being able to deploy systems like that, then death is very automatic. If the Chinese, Russian, and French intelligence services all manage to steal a copy of the code, and China and Russia sensibly decide not to run it, and France gives it to three French corporations which I hear the French intelligence service sometimes does, then again, everybody dies. If the builders are sufficiently worried about that scenario that they push too fast too early, in fear of an arms race developing very soon if they wait, again, everybody dies.

At present we’re very much waiting on a miracle for alignment to be possible at all, even if the AGI-builder successfully prevents proliferation and has 2 years in which to work. But if we get that miracle at all, it’s not going to be an instant miracle. There’ll be some minimum time-expense to do whatever work is required. So any time I visualize anybody trying to even start a successful trajectory of this kind, they need to be able to get a lot of work done, without the intermediate steps of AGI work being published, or demoed at all, let alone having models released. Because if you wait until the last months when it is really really obvious that the system is going to scale to AGI, in order to start closing things, almost all the prerequisites will already be out there. Then it will only take 3 more months of work for somebody else to build AGI, and then somebody else, and then somebody else; and even if the first 3 factions manage not to crank up the dial to lethal levels, the 4th party will go for it; and the world ends by default on full automatic.

If ideas are theoretically internal to “just the company”, but the company has 150 people who all know, plus everybody with the “sysadmin” title having access to the code and models, then I imagine—perhaps I am mistaken—that those ideas would (a) inevitably leak outside due to some of those 150 people having cheerful conversations over a beer with outsiders present, and (b) be copied outright by people of questionable allegiances once all hell started to visibly break loose. As with anywhere that handles really sensitive data, the concept of “need to know” has to be a thing, or else everyone (and not just in that company) ends up knowing.

So, even if I got run over by a truck tomorrow, I would still very much wish that in the world that survived me, Deepmind would have lots of penalty-free affordance internally for people to not publish things, and to work in internal partitions that didn’t spread their ideas to all the rest of Deepmind. Like, actual social and corporate support for that, not just a theoretical option you’d have to burn lots of social capital and weirdness points to opt into, and then get passed up for promotion forever after.

Anonymous

What’s RR?

Anonymous

It’s a new alignment org, run by Nate Thomas and ~co-run by Buck Shlegeris and Bill Zito, with maybe 4-6 other technical folks so far. My take: the premise is to create an org with ML expertise and general just-do-it competence that’s trying to do all the alignment experiments that something like Paul+Ajeya+Eliezer all think are obviously valuable and wish someone would do. They expect to have a website etc in a few days; the org is a couple months old in its current form.

Anonymous

How likely really is hard takeoff? Clearly, we are touching the edges of AGI with GPT and the like. But I’m not feeling this will that easily be leveraged into very quick recursive self improvement.

Eliezer Yudkowsky

Compared to the position I was arguing in the Foom Debate with Robin, reality has proved way to the further Eliezer side of Eliezer along the Eliezer-Robin spectrum. It’s been very unpleasantly surprising to me how little architectural complexity is required to start producing generalizing systems, and how fast those systems scale using More Compute. The flip side of this is that I can imagine a system being scaled up to interesting human+ levels, without “recursive self-improvement” or other of the old tricks that I thought would be necessary, and argued to Robin would make fast capability gain possible. You could have fast capability gain well before anything like a FOOM started. Which in turn makes it more plausible to me that we could hang out at interesting not-superintelligent levels of AGI capability for a while before a FOOM started. It’s not clear that this helps anything, but it does seem more plausible.

Anonymous

I agree reality has not been hugging the Robin kind of scenario this far.

Anonymous

Going past human level doesn’t necessarily mean going “foom”.

Eliezer Yudkowsky

I do think that if you get an AGI significantly past human intelligence in all respects, it would obviously tend to FOOM. I mean, I suspect that Eliezer fooms if you give an Eliezer the ability to backup, branch, and edit himself.

Anonymous

It doesn’t seem to me that an AGI significantly past human intelligence necessarily tends to FOOM.

Eliezer Yudkowsky

I think in principle we could have, for example, an AGI that was just a superintelligent engineer of proteins, and of nanosystems built by nanosystems that were built by proteins, and which was corrigible enough not to want to improve itself further; and this AGI would also be dumber than a human when it came to eg psychological manipulation, because we would have asked it not to think much about that subject. I’m doubtful that you can have an AGI that’s significantly above human intelligence in all respects, without it having the capability-if-it-wanted-to of looking over its own code and seeing lots of potential improvements.

Anonymous

Alright, this makes sense to me, but I don’t expect an AGI to want to manipulate humans that easily (unless designed to). Maybe a bit.

Eliezer Yudkowsky

Manipulating humans is a convergent instrumental strategy if you’ve accurately modeled (even at quite low resolution) what humans are and what they do in the larger scheme of things.

Anonymous

Yes, but human manipulation is also the kind of thing you need to guard against with even mildly powerful systems. Strong impulses to manipulate humans, should be vetted out.

Eliezer Yudkowsky

I think that, by default, if you trained a young AGI to expect that 2+2=5 in some special contexts, and then scaled it up without further retraining, a generally superhuman version of that AGI would be very likely to ‘realize’ in some sense that SS0+SS0=SSSS0 was a consequence of the Peano axioms. There’s a natural/​convergent/​coherent output of deep underlying algorithms that generate competence in some of the original domains; when those algorithms are implicitly scaled up, they seem likely to generalize better than whatever patch on those algorithms said ‘2 + 2 = 5’.

In the same way, suppose that you take weak domains where the AGI can’t fool you, and apply some gradient descent to get the AGI to stop outputting actions of a type that humans can detect and label as ‘manipulative’. And then you scale up that AGI to a superhuman domain. I predict that deep algorithms within the AGI will go through consequentialist dances, and model humans, and output human-manipulating actions that can’t be detected as manipulative by the humans, in a way that seems likely to bypass whatever earlier patch was imbued by gradient descent, because I doubt that earlier patch will generalize as well as the deep algorithms. Then you don’t get to retrain in the superintelligent domain after labeling as bad an output that killed you and doing a gradient descent update on that, because the bad output killed you. (This is an attempted very fast gloss on what makes alignment difficult in the first place.)

Anonymous

[i appreciate this gloss—thanks]

Anonymous

“deep algorithms within it will go through consequentialist dances, and model humans, and output human-manipulating actions that can’t be detected as manipulative by the humans”

This is true if it is rewarding to manipulate humans. If the humans are on the outlook for this kind of thing, it doesn’t seem that easy to me.

Going through these “consequentialist dances” to me appears to presume that mistakes that should be apparent haven’t been solved at simpler levels. It seems highly unlikely to me that you would have a system that appears to follow human requests and human values, and it would suddenly switch at some powerful level. I think there will be signs beforehand. Of course, if the humans are not paying attention, they might miss it. But, say, in the current milieu, I find it plausible that they will pay enough attention.

“because I doubt that earlier patch will generalize as well as the deep algorithms”

That would depend on how “deep” your earlier patch was. Yes, if you’re just doing surface patches to apparent problems, this might happen. But it seems to me that useful and intelligent systems will require deep patches (or deep designs from the start) in order to be apparently useful to humans at solving complex problems enough. This is not to say that they would be perfect. But it seems quite plausible to me that they would in most cases prevent the worst outcomes.

Eliezer Yudkowsky

“If you’ve got a general consequence-modeling-and-searching algorithm, it seeks out ways to manipulate humans, even if there are no past instances of a random-action-generator producing manipulative behaviors that succeeded and got reinforced by gradient descent over the random-action-generator. It invents the strategy de novo by imagining the results, even if there’s no instances in memory of a strategy like that having been tried before.” Agree or disagree?

Anonymous

Creating strategies de novo would of course be expected of an AGI.

“If you’ve got a general consequence-modeling-and-searching algorithm, it seeks out ways to manipulate humans, even if there are no past instances of a random-action-generator producing manipulative behaviors that succeeded and got reinforced by gradient descent over the random-action-generator. It invents the strategy de novo by imagining the results, even if there’s no instances in memory of a strategy like that having been tried before.” Agree or disagree?

I think, if the AI will “seek out ways to manipulate humans”, will depend on what kind of goals the AI has been designed to pursue.

Manipulating humans is definitely an instrumentally useful kind of method for an AI, for a lot of goals. But it’s also counter to a lot of the things humans would direct the AI to do—at least at a “high level”. “Manipulation”, such as marketing, for lower level goals, can be very congruent with higher level goals. An AI could clearly be good at manipulating humans, while not manipulating its creators or the directives of its creators.

If you are asking me to agree that the AI will generally seek out ways to manipulate the high-level goals, then I will say “no”. Because it seems to me that faults of this kind in the AI design is likely to be caught by the designers earlier. (This isn’t to say that this kind of fault couldn’t happen.) It seems to me that manipulation of high-level goals will be one of the most apparent kind of faults of this kind of system.

Anonymous

RE: “I’m doubtful that you can have an AGI that’s significantly above human intelligence in all respects, without it having the capability-if-it-wanted-to of looking over its own code and seeing lots of potential improvements.”

It seems plausible (though unlikely) to me that this would be true in practice for the AGI we build—but also that the potential improvements it sees would be pretty marginal. This is coming from the same intuition that current learning algorithms might already be approximately optimal.

Eliezer Yudkowsky

If you are asking me to agree that the AI will generally seek out ways to manipulate the high-level goals, then I will say “no”. Because it seems to me that faults of this kind in the AI design is likely to be caught by the designers earlier.

I expect that when people are trying to stomp out convergent instrumental strategies by training at a safe dumb level of intelligence, this will not be effective at preventing convergent instrumental strategies at smart levels of intelligence; also note that at very smart levels of intelligence, “hide what you are doing” is also a convergent instrumental strategy of that substrategy.

I don’t know however if I should be explaining at this point why “manipulate humans” is convergent, why “conceal that you are manipulating humans” is convergent, why you have to train in safe regimes in order to get safety in dangerous regimes (because if you try to “train” at a sufficiently unsafe level, the output of the unaligned system deceives you into labeling it incorrectly and/​or kills you before you can label the outputs), or why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being “anti-natural” in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).

Anonymous

My (unfinished) idea for buying time is to focus on applying AI to well-specified problems, where constraints can come primarily from the action space and additionally from process-level feedback (i.e., human feedback providers understand why actions are good before endorsing them, and reject anything weird even if it seems to work on some outcomes-based metric). This is basically a form of boxing, with application-specific boxes. I know it doesn’t scale to superintelligence but I think it can potentially give us time to study and understand proto AGIs before they kill us. I’d be interested to hear devastating critiques of this that imply it isn’t even worth fleshing out more and trying to pursue, if they exist.

Anonymous

(I think it’s also similar to CAIS in case that’s helpful.)

Eliezer Yudkowsky

There’s lots of things we can do which don’t solve the problem and involve us poking around with AIs having fun, while we wait for a miracle to pop out of nowhere. There’s lots of things we can do with AIs which are weak enough to not be able to fool us and to not have cognitive access to any dangerous outputs, like automatically generating pictures of cats. The trouble is that nothing we can do with an AI like that (where “human feedback providers understand why actions are good before endorsing them”) is powerful enough to save the world.

Eliezer Yudkowsky

In other words, if you have an aligned AGI that builds complete mature nanosystems for you, that is enough force to save the world; but that AGI needs to have been aligned by some method other than “humans inspect those outputs and vet them and their consequences as safe/​aligned”, because humans cannot accurately and unfoolably vet the consequences of DNA sequences for proteins, or of long bitstreams sent to protein-built nanofactories.

Anonymous

When you mention nanosystems, how much is this just a hypothetical superpower vs. something you actually expect to be achievable with AGI/​superintelligence? If expected to be achievable, why?

Eliezer Yudkowsky

The case for nanosystems being possible, if anything, seems even more slam-dunk than the already extremely slam-dunk case for superintelligence, because we can set lower bounds on the power of nanosystems using far more specific and concrete calculations. See eg the first chapters of Drexler’s Nanosystems, which are the first step mandatory reading for anyone who would otherwise doubt that there’s plenty of room above biology and that it is possible to have artifacts the size of bacteria with much higher power densities. I have this marked down as “known lower bound” not “speculative high value”, and since Nanosystems has been out since 1992 and subjected to attemptedly-skeptical scrutiny, without anything I found remotely persuasive turning up, I do not have a strong expectation that any new counterarguments will materialize.

If, after reading Nanosystems, you still don’t think that a superintelligence can get to and past the Nanosystems level, I’m not quite sure what to say to you, since the models of superintelligences are much less concrete than the models of molecular nanotechnology.

I’m on record as early as 2008 as saying that I expected superintelligences to crack protein folding, some people disputed that and were all like “But how do you know that’s solvable?” and then AlphaFold 2 came along and cracked the protein folding problem they’d been skeptical about, far below the level of superintelligence.

I can try to explain how I was mysteriously able to forecast this truth at a high level of confidence—not the exact level where it became possible, to be sure, but that superintelligence would be sufficient—despite this skepticism; I suppose I could point to prior hints, like even human brains being able to contribute suggestions to searches for good protein configurations; I could talk about how if evolutionary biology made proteins evolvable then there must be a lot of regularity in the folding space, and that this kind of regularity tends to be exploitable.

But of course, it’s also, in a certain sense, very obvious that a superintelligence could crack protein folding, just like it was obvious years before Nanosystems that molecular nanomachines would in fact be possible and have much higher power densities than biology. I could say, “Because proteins are held together by van der Waals forces that are much weaker than covalent bonds,” to point to a reason how you could realize that after just reading Engines of Creation and before Nanosystems existed, by way of explaining how one could possibly guess the result of the calculation in advance of building up the whole detailed model. But in reality, precisely because the possibility of molecular nanotechnology was already obvious to any sensible person just from reading Engines of Creation, the sort of person who wasn’t convinced by Engines of Creation wasn’t convinced by Nanosystems either, because they’d already demonstrated immunity to sensible arguments; an example of the general phenomenon I’ve elsewhere termed the Law of Continued Failure.

Similarly, the sort of person who was like “But how do you know superintelligences will be able to build nanotech?” in 2008, will probably not be persuaded by the demonstration of AlphaFold 2, because it was already clear to anyone sensible in 2008, and so anyone who can’t see sensible points in 2008 probably also can’t see them after they become even clearer. There are some people on the margins of sensibility who fall through and change state, but mostly people are not on the exact margins of sanity like that.

Anonymous

“If, after reading Nanosystems, you still don’t think that a superintelligence can get to and past the Nanosystems level, I’m not quite sure what to say to you, since the models of superintelligences are much less concrete than the models of molecular nanotechnology.”

I’m not sure if this is directed at me or the https://​​en.wikipedia.org/​​wiki/​​Generic_you, but I’m only expressing curiosity on this point, not skepticism :)

Anonymous

some form of “scalable oversight” is the naive extension of the initial boxing thing proposed above that claims to be the required alignment method—basically, make the humans vetting the outputs smarter by providing them AI support for all well-specified (level-below)-vettable tasks.

Eliezer Yudkowsky

I haven’t seen any plausible story, in any particular system design being proposed by the people who use terms about “scalable oversight”, about how human-overseeable thoughts or human-inspected underlying systems, compound into very powerful human-non-overseeable outputs that are trustworthy. Fundamentally, the whole problem here is, “You’re allowed to look at floating-point numbers and Python code, but how do you get from there to trustworthy nanosystem designs?” So saying “Well, we’ll look at some thoughts we can understand, and then from out of a much bigger system will come a trustworthy output” doesn’t answer the hard core at the center of the question. Saying that the humans will have AI support doesn’t answer it either.

Anonymous

the kind of useful thing humans (assisted-humans) might be able to vet is reasoning/​arguments/​proofs/​explanations. without having to generate neither the trustworthy nanosystem design nor the reasons it is trustworthy, we could still check them.

Eliezer Yudkowsky

If you have an untrustworthy general superintelligence generating English strings meant to be “reasoning/​arguments/​proofs/​explanations” about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I’d expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn’t understand even after having been told what happened. So you must have some prior belief about the superintelligence being aligned before you dared to look at the arguments. How did you get that prior belief?

Anonymous

I think I’m not starting with a general superintelligence here to get the trustworthy nanodesigns. I’m trying to build the trustworthy nanosystems “the hard way”, i.e., if we did it without ever building AIs, and then speed that up using AI for automation of things we know how to vet (including recursively). Is a crux here that you think nanosystem design requires superintelligence?

(tangent: I think this approach works even if you accidentally built a more-general or more-intelligent than necessary foundation model as long as you’re only using it in boxes it can’t outsmart. The better-specified the tasks you automate are, the easier it is to secure the boxes.)

Eliezer Yudkowsky

I think that China ends the world using code they stole from Deepmind that did things the easy way, and that happens 50 years of natural R&D time before you can do the equivalent of “strapping mechanical aids to a horse instead of building a car from scratch”.

I also think that the speedup step in “iterated amplification and distillation” will introduce places where the fast distilled outputs of slow sequences are not true to the original slow sequences, because gradient descent is not perfect and won’t be perfect and it’s not clear we’ll get any paradigm besides gradient descent for doing a step like that.

Anonymous

How do you feel about the safety community as a whole and the growth we’ve seen over the past few years?

Eliezer Yudkowsky

Very grim. I think that almost everybody is bouncing off the real hard problems at the center and doing work that is predictably not going to be useful at the superintelligent level, nor does it teach me anything I could not have said in advance of the paper being written. People like to do projects that they know will succeed and will result in a publishable paper, and that rules out all real research at step 1 of the social process.

Paul Christiano is trying to have real foundational ideas, and they’re all wrong, but he’s one of the few people trying to have foundational ideas at all; if we had another 10 of him, something might go right.

Chris Olah is going to get far too little done far too late. We’re going to be facing down an unalignable AGI and the current state of transparency is going to be “well look at this interesting visualized pattern in the attention of the key-value matrices in layer 47” when what we need to know is “okay but was the AGI plotting to kill us or not”. But Chris Olah is still trying to do work that is on a pathway to anything important at all, which makes him exceptional in the field.

Stuart Armstrong did some good work on further formalizing the shutdown problem, an example case in point of why corrigibility is hard, which so far as I know is still resisting all attempts at solution.

Various people who work or worked for MIRI came up with some actually-useful notions here and there, like Jessica Taylor’s expected utility quantilization.

And then there is, so far as I can tell, a vast desert full of work that seems to me to be mostly fake or pointless or predictable.

It is very, very clear that at present rates of progress, adding that level of alignment capability as grown over the next N years, to the AGI capability that arrives after N years, results in everybody dying very quickly. Throwing more money at this problem does not obviously help because it just produces more low-quality work.

Anonymous

“doing work that is predictably not going to be really useful at the superintelligent level, nor does it teach me anything I could not have said in advance of the paper being written”

I think you’re underestimating the value of solving small problems. Big problems are solved by solving many small problems. (I do agree that many academic papers do not represent much progress, however.)

Eliezer Yudkowsky

By default, I suspect you have longer timelines and a smaller estimate of total alignment difficulty, not that I put less value than you on the incremental power of solving small problems over decades. I think we’re going to be staring down the gun of a completely inscrutable model that would kill us all if turned up further, with no idea how to read what goes on inside its head, and no way to train it on humanly scrutable and safe and humanly-labelable domains in a way that seems like it would align the superintelligent version, while standing on top of a whole bunch of papers about “small problems” that never got past “small problems”.

Anonymous

“I think we’re going to be staring down the gun of a completely inscrutable model that would kill us all if turned up further, with no idea how to read what goes on inside its head, and no way to train it on humanly scrutable and safe and humanly-labelable domains in a way that seems like it would align the superintelligent version”

This scenario seems possible to me, but not very plausible. GPT is not going to “kill us all” if turned up further. No amount of computing power (at least before AGI) would cause it to. I think this is apparent, without knowing exactly what’s going on inside GPT. This isn’t to say that there aren’t AI systems that wouldn’t. But what kind of system would? (A GPT combined with sensory capabilities at the level of Tesla’s self-driving AI? That still seems too limited.)

Eliezer Yudkowsky

Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn’t scale, I’d expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.

Steve Omohundro

Eliezer, thanks for doing this! I just now read through the discussion and found it valuable. I agree with most of your specific points but I seem to be much more optimistic than you about a positive outcome. I’d like to try to understand why that is. I see mathematical proof as the most powerful tool for constraining intelligent systems and I see a pretty clear safe progression using that for the technical side (the social side probably will require additional strategies). Here are some of my intuitions underlying that approach, I wonder if you could identify any that you disagree with. I’m fine with your using my name (Steve Omohundro) in any discussion of these.

1) Nobody powerful wants to create unsafe AI but they do want to take advantage of AI capabilities.

2) None of the concrete well-specified valuable AI capabilities require unsafe behavior

3) Current simple logical systems are capable of formalizing every relevant system involved (eg. MetaMath http://​​us.metamath.org/​​index.html currently formalizes roughly an undergraduate math degree and includes everything needed for modeling the laws of physics, computer hardware, computer languages, formal systems, machine learning algorithms, etc.)

4) Mathematical proof is cheap to mechanically check (eg. MetaMath has a 500 line Python verifier which can rapidly check all of its 38K theorems)

5) GPT-F is a fairly early-stage transformer-based theorem prover and can already prove 56% of the MetaMath theorems. Similar systems are likely to soon be able to rapidly prove all simple true theorems (eg. that human mathematicians can prove in a day).

6) We can define provable limits on the behavior of AI systems that we are confident prevent dangerous behavior and yet still enable a wide range of useful behavior.

7) We can build automated checkers for these provable safe-AI limits.

8) We can build (and eventually mandate) powerful AI hardware that first verifies proven safety constraints before executing AI software

9) For example, AI smart compilation of programs can be formalized and doesn’t require unsafe operations

10) For example, AI design of proteins to implement desired functions can be formalized and doesn’t require unsafe operations

11) For example, AI design of nanosystems to achieve desired functions can be formalized and doesn’t require unsafe operations.

12) For example, the behavior of designed nanosystems can be similarly constrained to only proven safe behaviors

13) And so on through the litany of early stage valuable uses for advanced AI.

14) I don’t see any fundamental obstructions to any of these. Getting social acceptance and deployment is another issue!

Best, Steve

Eliezer Yudkowsky

Steve, are you visualizing AGI that gets developed 70 years from now under absolutely different paradigms than modern ML? I don’t see being able to take anything remotely like, say, Mu Zero, and being able to prove any theorem about it which implies anything like corrigibility or the system not internally trying to harm humans. Anything in which enormous inscrutable floating-point vectors is a key component, seems like something where it would be very hard to prove any theorems about the treatment of those enormous inscrutable vectors that would correspond in the outside world to the AI not killing everybody.

Even if we somehow managed to get structures far more legible than giant vectors of floats, using some AI paradigm very different from the current one, it still seems like huge key pillars of the system would rely on non-fully-formal reasoning; even if the AI has something that you can point to as a utility function and even if that utility function’s representation is made out of programmer-meaningful elements instead of giant vectors of floats, we’d still be relying on much shakier reasoning at the point where we claimed that this utility function meant something in an intuitive human-desired sense, say. And if that utility function is learned from a dataset and decoded only afterwards by the operators, that sounds even scarier. And if instead you’re learning a giant inscrutable vector of floats from a dataset, gulp.

You seem to be visualizing that we prove a theorem and then get a theorem-like level of assurance that the system is safe. What kind of theorem? What the heck would it say?

I agree that it seems plausible that the good cognitive operations we want do not in principle require performing bad cognitive operations; the trouble, from my perspective, is that generalizing structures that do lots of good cognitive operations will automatically produce bad cognitive operations, especially when we dump more compute into them; “you can’t bring the coffee if you’re dead”.

So it takes a more complicated system and some feat of insight I don’t presently possess, to “just” do the good cognitions, instead of doing all the cognitions that result from decompressing the thing that compressed the cognitions in the dataset—even if that original dataset only contained cognitions that looked good to us, even if that dataset actually was just correctly labeled data about safe actions inside a slightly dangerous domain. Humans do a lot of stuff besides maximizing inclusive genetic fitness, optimizing purely on outcomes labeled by a simple loss function doesn’t get you an internal optimizer that pursues only that loss function, etc.

Anonymous

Steve’s intuitions sound to me like they’re pointing at the “well-specified problems” idea from an earlier thread. Essentially, only use AI in domains where unsafe actions are impossible by construction. Is this too strong a restatement of your intuitions Steve?

Steve Omohundro

Thanks for your perspective! Those sound more like social concerns than technical ones, though. I totally agree that today’s AI culture is very “sloppy” and that the currently popular representations, learning algorithms, data sources, etc. aren’t oriented around precise formal specification or provably guaranteed constraints. I’d love any thoughts about ways to help shift that culture toward precise and safe approaches! Technically there is no problem getting provable constraints on floating point computations, etc. The work often goes under the label “Interval Computation”. It’s not even very expensive, typically just a factor of 2 worse than “sloppy” computations. For some reason those approaches have tended to be more popular in Europe than in the US. Here are a couple lists of references: http://​​www.cs.utep.edu/​​interval-comp/​​ https://​​www.mat.univie.ac.at/​​~neum/​​interval.html

I see today’s dominant AI approach of mapping everything to large networks ReLU units running on hardware designed for dense matrix multiplication, trained with gradient descent on big noisy data sets as a very temporary state of affairs. I fully agree that it would be uncontrolled and dangerous scaled up in its current form! But it’s really terrible in every aspect except that it makes it easy for machine learning practitioners to quickly slap something together which will actually sort of work sometimes. With all the work on AutoML, NAS, and the formal methods advances I’m hoping we leave this “sloppy” paradigm pretty quickly. Today’s neural networks are terribly inefficient for inference: most weights are irrelevant for most inputs and yet current methods do computational work on each. I developed many algorithms and data structures to avoid that waste years ago (eg. “bumptrees” https://​​steveomohundro.com/​​scientific-contributions/​​)

They’re also pretty terrible for learning since most weights don’t need to be updated for most training examples and yet they are. Google and others are using Mixture-of-Experts to avoid some of that cost: https://​​arxiv.org/​​abs/​​1701.06538

Matrix multiply is a pretty inefficient primitive and alternatives are being explored: https://​​arxiv.org/​​abs/​​2106.10860

Today’s reinforcement learning is slow and uncontrolled, etc. All this ridiculous computational and learning waste could be eliminated with precise formal approaches which measure and optimize it precisely. I’m hopeful that that improvement in computational and learning performance may drive the shift to better controlled representations.

I see theorem proving as hugely valuable for safety in that we can easily precisely specify many important tasks and get guarantees about the behavior of the system. I’m hopeful that we will also be able to apply them to the full AGI story and encode human values, etc., but I don’t think we want to bank on that at this stage. Hence, I proposed the “Safe-AI Scaffolding Strategy” where we never deploy a system without proven constraints on its behavior that give us high confidence of safety. We start extra conservative and disallow behavior that might eventually be determined to be safe. At every stage we maintain very high confidence of safety. Fast, automated theorem checking enables us to build computational and robotic infrastructure which only executes software with such proofs.

And, yes, I’m totally with you on needing to avoid the “basic AI drives”! I think we have to start in a phase where AI systems are not allowed to run rampant as uncontrolled optimizing agents! It’s easy to see how to constrain limited programs (eg. theorem provers, program compilers or protein designers) to stay on particular hardware and only communicate externally in precisely constrained ways. It’s similarly easy to define constrained robot behaviors (eg. for self-driving cars, etc.) The dicey area is that unconstrained agentic edge. I think we want to stay well away from that until we’re very sure we know what we’re doing! My optimism stems from the belief that many of the socially important things we need AI for won’t require anything near that unconstrained edge. But it’s tempered by the need to get the safe infrastructure into place before dangerous AIs are created.

Anonymous

As far as I know, all the work on “verifying floating-point computations” currently is way too low-level—the specifications that are proved about the computations don’t say anything about what the computations mean or are about, beyond the very local execution of some algorithm. Execution of algorithms in the real world can have very far-reaching effects that aren’t modelled by their specifications.

Eliezer Yudkowsky

Yeah, what they said. How do you get from proving things about error bounds on matrix multiplications of inscrutable floating-point numbers, to saying anything about what a mind is trying to do, or not trying to do, in the external world?

Steve Omohundro

Ultimately we need to constrain behavior. You might want to ensure your robot butler won’t leave the premises. To do that using formal methods, you need to have a semantic representation of the location of the robot, your premise’s spatial extent, etc. It’s pretty easy to formally represent that kind of physical information (it’s just a more careful version of what engineers do anyway). You also have a formal model of the computational hardware and software and the program running the system.

For finite systems, any true property has a proof which can be mechanically checked but the size of that proof might be large and it might be hard to find. So we need to use encodings and properties which mesh well with the safety semantics we care about.

Formal proofs of properties of programs has progressed to where a bunch of cryptographic, compilation, and other systems can be specified and formalized. Why it’s taken this long, I have no idea. The creator of any system has an argument as to why its behavior does what they think it will and why it won’t do bad or dangerous things. The formalization of those arguments should be one direct short step.

Experience with formalizing mathematician’s informal arguments suggest that the formal proofs are maybe 5 times longer than the informal argument. Systems with learning and statistical inference add more challenges but nothing that seems in-principal all that difficult. I’m still not completely sure how to constrain the use of language, however. I see inside of Facebook all sorts of problems due to inability to constrain language systems (eg. they just had a huge issue where a system labeled a video with a racist term). The interface between natural language semantics and formal semantics and how we deal with that for safety is something I’ve been thinking a lot about recently.

Steve Omohundro

Here’s a nice 3 hour long tutorial about “probabilistic circuits” which is a representation of probability distributions, learning, Bayesian inference, etc. which has much better properties than most of the standard representations used in statistics, machine learning, neural nets, etc.: https://​​www.youtube.com/​​watch?v=2RAG5-L9R70 It looks especially amenable to interpretability, formal specification, and proofs of properties.

Eliezer Yudkowsky

You’re preaching to the choir there, but even if we were working with more strongly typed epistemic representations that had been inferred by some unexpected innovation of machine learning, automatic inference of those representations would lead them to be uncommented and not well-matched with human compressions of reality, nor would they match exactly against reality, which would make it very hard for any theorem about “we are optimizing against this huge uncommented machine-learned epistemic representation, to steer outcomes inside this huge machine-learned goal specification” to guarantee safety in outside reality; especially in the face of how corrigibility is unnatural and runs counter to convergence and indeed coherence; especially if we’re trying to train on domains where unaligned cognition is safe, and generalize to regimes in which unaligned cognition is not safe. Even in this case, we are not nearly out of the woods, because what we can prove has a great type-gap with that which we want to ensure is true. You can’t handwave the problem of crossing that gap even if it’s a solvable problem.

And that whole scenario would require some major total shift in ML paradigms.

Right now the epistemic representations are giant inscrutable vectors of floating-point numbers, and so are all the other subsystems and representations, more or less.

Prove whatever you like about that Tensorflow problem; it will make no difference to whether the AI kills you. The properties that can be proven just aren’t related to safety, no matter how many times you prove an error bound on the floating-point multiplications. It wasn’t floating-point error that was going to kill you in the first place.

• I think this is my second-favorite post in the MIRI dialogues (for my overall review see here).

I think this post was valuable to me in a much more object-level way. I think this post was the first post that actually just went really concrete on the current landscape of efforts int he domain of AI Notkilleveryonism and talked concretely about what seems feasible for different actors to achieve, and what isn’t, in a way that parsed for me, and didn’t feel either like something obviously political, or delusional.

I didn’t find the part about different paradigms of compute very valuable though, and my guess is it should be cut from edited versions of this article.

• EDIT: This comment fails on a lot of points, as discussed in this apology subcomment. I encourage people interested by the thread to mostly read the apology subcomment and the list of comments linked there, which provide maximum value with minimum drama IMO.

Disclaimer: this is a rant. In the best possible world, I could write from a calmer place, but I’m pretty sure that the taboo on criticizing MIRI and EY too hard on the AF can only be pushed through when I’m annoyed enough. That being said, I’m writing down thoughts that I had for quite some time, so don’t just discard this as a gut reaction to the post.

(Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))

Tl;dr:

• I’m annoyed by EY (and maybe MIRI’s?) dismissal of every other alignment work, and how seriously it seems to be taken here, given their track record of choosing research agendas with very indirect impact on alignment, and of taking a lot of time to let go of these flawed agendas in the face of mounting evidence.

• I’m annoyed that I have to deal with the nerd-sniping introduced by MIRI when bringing new people to the field, especially given the status imbalance.

• I’m sad that EY and MIRI’s response to their research agenda not being as promising as they wanted is “we’re all doomed”.

Honestly, I really, really tried to find how MIRI’s Agents Foundations agenda was supposed to help with alignment. I really did. Some people tried to explain it to me. And I wanted to believe, because logic and maths are amazing tools with which to attack this most important problem, alignment. Yet I can’t escape the fact that the only contributions to technical alignment I can see by MIRI have been done by a bunch of people who mostly do their own thing instead of following MIRI’s core research program: Evan, Vanessa, Abram and Scott. (Note that this is my own judgement, and I haven’t talked to these people about this comment, so if you disagree with me, please don’t go at them).

All the rest, including some of the things these people worked on (but not most of it), is nerd-sniping as far as I’m concerned. It’s a tricky failure mode because it looks like good and serious research to the AF and LW audience. But there’s still a massive difference with actually trying to solve the real problems related to alignment, with all the tools at our disposal, and deciding that the focus should be on a handful of toy problems neatly expressed with decision theory and logic, and then only work on those.

That’s already bad enough. But then we have posts like this one, where EY just dunks on everyone working on alignment as fakers, or having terrible ideas. And at that point, I really wonder: why is that supposed to be an argument of authority anymore? Yes, I give massive credibility points to EY and MIRI for starting the field of alignment almost by themselves, and for articulating a lot of the issues. Yet all of the work that looked actually pushed by the core MIRI team (and based on some of EY’s work) from MIRI’s beginning are just toying with logic problems with hardly any connections here and there to alignment. (I know they’re not publishing most of it, but that judgment applies to their currently published work, and from the agenda and the blog posts, it sounds like most of the unpublished work was definitely along those lines). Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.

Note that this also has massive downsides for conceptual alignment in general, because when bringing people in, you have to deal with this specter of nerd-sniping by the founding lab of the field and still a figure of authority. I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like they’re actually moving the ball instead of following the lead of the “figure of authority”.

When I’m not frustrated by this situation, I’m just sad. Some of the brightest and first thinkers on alignment have decided to follow their own nerd-sniping and call everyone else fakers, and when they realized they were not actually making progress, they didn’t switch to something else as much as declare everyone was still full of it and so because they had no idea of how to solve the problem at the moment, it was doomed.

What could be done differently? Well, I would really, really like if EY and other MIRI people who are very dubious of most alignment research could give more feedback on that and enter the dialogue, maybe by commenting more on the AF. My problem is not so much with them disagreeing with most of the work, it’s about the disagreement stopping to “that’s not going to work” and not having dialogue and back and forth.

Also, I don’t know how much is related to mental health and pessimism and depression (which I completely understand can color one’s view of the world), but I would love to see the core MIRI team and EY actually try solving alignment with neural nets and prosaic AI. Starting with all their fears and caveats, sure, but then be like “fuck it, let’s just find a new way of grappling it”. That’s really what saddens me the most about all of this: I feel that some of the best current minds who care about alignment have sort of given up on actually trying to solve it.

• This is an apology for the tone and the framing of the above comment (and my following answers), which have both been needlessly aggressive, status-focused and uncharitable. Underneath are still issues that matter a lot to me, but others have discussed them better (I’ll provide a list of linked comments at the end of this one).

Thanks to Richard Ngo for convincing me that I actually needed to write such an apology, which was probably the needed push for me to stop weaseling around it.

So what did I do wrong? The list is pretty damning:

• I took something about the original post that I didn’t understand — EY’s “And then there is, so far as I can tell, a vast desert full of work that seems to me to be mostly fake or pointless or predictable.” — and because it didn’t make sense to me, and because that fitted with my stereotypes for MIRI and EY’s dismissiveness of a lot of work in alignment, I turned to an explanation of this as an attack on alignment researchers, saying they were consciously faking it when they knew they should do better. Whereas I feel know that what EY meant is far closer to alignment research at the moment is trying to try to align AI as best as we can, instead of just trying to do it. I’m still not sure if I agree with that characterization, but that sounds far more like something that can be discussed.

• There’s also a weird aspect of status-criticism to my comment that I think I completely failed to explain. Looking at my motives now (let’s be wary of hindsight...), I feel like my issue with the status things was more that a bunch of people other than EY and MIRI just take what they say as super strong evidence without looking at all the arguments and details, and thus I expected this post and recent MIRI publications to create a background of “we’re doomed” for a lot of casual observers, with the force of the status of EY and MIRI.
But I don’t want to say that EY and MIRI are given too much status in general in the community, even if I actually wrote something along those lines. I guess it’s just easier to focus your criticism on the beacon of status than on the invisible crowd misusing status. Sorry about that.

• I somehow turned that into an attack of MIRI’s research (at least a chunk of it), which didn’t really have anything to do with it. That probably was just the manifestation of my frustration when people come to the field and feel like they shouldn’t do the experimental research that they fill better suited for or feel like they need to learn a lot of advanced maths. Even if those are not official MIRI positions, I definitely feel MIRI has had a big influence on them. And yet, maybe newcomers should question themselves that way. It always sounded like a loss of potential to me, because the outcome is often to not do alignment; but maybe even if you’re into experiments, the best way you could align AIs now doesn’t go through that path (and you could still find that exciting enough to find new research).
Whatever the correct answer is, my weird ad-hominem attack has nothing to do with it, so I apologize for attacking all of MIRI’s research and their research agendas choice with it (even if I think talking more about what is and was the right choice still matters)

• Part of my failure here has also been to not check for the fact that aggressive writing just feels snappier without much effort. I still think my paragraph starting with “When I’m not frustrated by this situation, I’m just sad.” works pretty well as an independent piece of writing, but it’s obviously needlessly aggressive and spicy, and doesn’t leave any room for the doubt that I actually felt or the doubts I should have felt. My answers after that comment are better, but still riding too much on that tone.

• One of the saddest failure (pointed to me by Richard) is that by my tone and my presentation, I made it harder and more aversive for MIRI and EY to share their models, because they have to fear a bit more that kind of reaction. And even if Rob reacted really nicely, I expect that required a bunch of additional mental energy than a better comment wouldn’t have asked for.
So I apologize for that, and really want more model-building and discussions from MIRI and EY publicly.

So in summary, my comment should have been something along the line of “Hey, I don’t understand what are your generators for saying that all alignment research is ‘mostly fake or pointless or predictable’, could you give me some pointers to that”. I wasn’t in the head space or had the right handles to frame it that way and not go into weirdly aggressive tangents, and that’s on me.

On the plus side, every other comments on the thread has been high-quality and thoughtful, so here’s a list of the best ones IMO:

• Ben Pace’s comment on what success stories for alignment would look like, giving examples.

• Rob Bensinger’s comment about the directions of prosaic alignment I wrote I was excited about, and whether they’re “moving the dial”.

• Rohin Shah’s comment which frames the outside view of MIRI I was pointing out better than I did and not aggressively.

• John Wentworth’s two comments about the generators of EY’s pessimism being in the sequences all along.

• Vaniver’s comment presenting an analysis of why some concrete ML work in alignment doesn’t seem to help for the AGI level.

• Rob Bensinger’s comment drawing a great list of distinction to clarify the debate.

• Similarly, the fact that they kept at it over and over with all the big improvement of DL instead of trying to adapt to prosaic Alignment sounds like evidence that they might be over attached to a specific framing, which they had trouble to discard.

I’m… confused by this framing? Specifically, this bit (as well as other bits like these)

I have to explain again and again to stressed-out newcomers that you definitely don’t need to master model theory or decision theory to do alignment, and try to steer them towards problems and questions that look like they’re actually moving the ball instead of following the lead of the “figure of authority”.

Some of the brightest and first thinkers on alignment have decided to follow their own nerd-sniping and call everyone else fakers, and when they realized they were not actually making progress, they didn’t switch to something else as much as declare everyone was still full of it

Also, I don’t know how much is related to mental health and pessimism and depression (which I completely understand can color one’s view of the world), but I would love to see the core MIRI team and EY actually try solving alignment with neural nets and prosaic AI. Starting with all their fears and caveats, sure, but then be like “fuck it, let’s just find a new way of grappling it”.

seem to be coming at the problem with [something like] a baked-in assumption that prosaic alignment is something that Actually Has A Chance Of Working?

And, like, to be clear, obviously if you’re working on prosaic alignment that’s going to be something you believe[1]. But it seems clear to me that EY/​MIRI does not share this viewpoint, and all the disagreements you have regarding their treatment of other avenues of research seem to me to be logically downstream of this disagreement?

I mean, it’s possible I’m misinterpreting you here. But you’re saying things that (from my perspective) only make sense with the background assumption that “there’s more than one game in town”—things like “I wish EY/​MIRI would spend more time engaging with other frames” and “I don’t like how they treat lack of progress in their frame as evidence that all other frames are similarly doomed”—and I feel like all of those arguments simply fail in the world where prosaic alignment is Actually Just Doomed, all the other frames Actually Just Go Nowhere, and conceptual alignment work of the MIRI variety is (more or less) The Only Game In Town.

To be clear: I’m pretty sure you don’t believe we live in that world. But I don’t think you can just export arguments from the world you think we live in to the world EY/​MIRI thinks we live in; there needs to be a bridging step first, where you argue about which world we actually live in. I don’t think it makes sense to try and highlight the drawbacks of someone’s approach when they don’t share the same background premises as you, and the background premises they do hold imply a substantially different set of priorities and concerns.

Another thing it occurs to me your frustration could be about is the fact that you can’t actually argue this with EY/​MIRI directly, because they don’t frequently make themselves available to discuss things. And if something like that’s the case, then I guess what I want to say is… I sympathize with you abstractly, but I think your efforts are misdirected? It’s okay for you and other alignment researchers to have different background premises from MIRI or even each other, and for you and those other researchers to be working on largely separate agendas as a result? I want to say that’s kind of what foundational research work looks like, in a field where (to a first approximation) nobody has any idea what the fuck they’re doing?

And yes, in the end [assuming somebody succeeds] that will likely mean that a bunch of people’s research directions were ultimately irrelevant. Most people, even. That’s… kind of unavoidable? And also not really the point, because you can’t know which line of research will be successful in advance, so all you have to go on is your best guess, which… may or may not be the same as somebody else’s best guess?

I dunno. I’m trying not to come across as too aggressive here, which is why I’m hedging so many of my claims. To some extent I feel uncomfortable trying to “police” people’s thoughts here, since I’m not actually an alignment researcher… but also it felt to me like your comment was trying to police people’s thoughts, and I don’t actually approve of that either, so...

Yeah. Take this how you will.

[1] I personally am (relatively) agnostic on this question, but as a non-expert in the field my opinion should matter relatively little; I mention this merely as a disclaimer that I am not necessarily on board with EY/​MIRI about the doomed-ness of prosaic alignment.

• (Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))

Okay, so you’re completely right that a lot of my points are logically downstream of the debate on whether Prosaic Alignment is Impossible or not. But I feel like you don’t get how one sided this debate is, and how misrepresented it is here (and generally on the AF)

Like nobody except EY and a bunch of core MIRI people actually believes that prosaic alignment is impossible. I mean that every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard. That includes MIRI people like Evan Hubinger too. And note that some of these other alignment researchers actually work with Neural Nets and keep up to speed on the implementation details and subtleties, which in my book means their voice should count more.

But that’s just a majority argument. The real problem is that nobody has ever given a good argument on why this is impossible. I mean the analogous situation is that a car is driving right at you, accelerating, and you’ve decided somehow that it’s impossible to ever stop it before it kills you. You need a very strong case before giving up like that. And that has not been given by EY and MIRI AFAIK.

The last part of this is that because EY and MIRI founded the field, their view is given far more credibility than what it would have on the basis of the arguments alone, and far more than it has in actual discussions between researchers.

The best analogy I can find (a bit strawmanish but less than you would expect) is a world where somehow the people who had founded the study of cancer had the idea that no method based on biological experimentation and thinking about cells could ever cure cancer, and that the only way of solving it was to understand every dynamics in a very advanced category theoretic model. Then having found the latter really hard, they just say that curing cancer is impossible.

• I think one core issue here is that there are actually two debates going on. One is “how hard is the alignment problem?”; another is “how powerful are prosaic alignment techniques?” Broadly speaking, I’d characterise most of the disagreement as being on the first question. But you’re treating it like it’s mostly on the second question—like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.

My attempt to portray EY’s perspective is more like: he’s concerned with the problem of ageing, and a whole bunch of people have come along, said they agree with him, and started proposing ways to cure cancer using prosaic radiotherapy techniques. Now he’s trying to say: no, your work is not addressing the core problem of ageing, which is going to kill us unless we make a big theoretical breakthrough.

Regardless of that, calling the debate “one sided” seems way too strong, especially given how many selection effects are involved. I mean, you could also call the debate about whether alignment is even a problem “one sided” − 95% of all ML researchers don’t think it’s a problem, or think it’s something we’ll solve easily. But for fairly similar meta-level reasons as why it’s good for them to listen to us in an open-minded way, it’s also good for prosaic alignment researchers to listen to EY in an open-minded way. (As a side note, I’d be curious what credence you place on EY’s worldview being more true than the prosaic alignment worldview.)

Now, your complaint might be that MIRI has not made their case enough over the last few years. If that’s the main issue, then stay tuned; as Rob said, this is just the preface to a bunch of relevant material.

• 95% of all ML researchers don’t think it’s a problem, or think it’s something we’ll solve easily

The 2016 survey of people in AI asked people about the alignment problem as described by Stuart Russell, and 39% said it was an important problem and 33% that it’s a harder problem than most other problem in the field.

• Thanks for the detailed comment!

I think one core issue here is that there are actually two debates going on. One is “how hard is the alignment problem?”; another is “how powerful are prosaic alignment techniques?” Broadly speaking, I’d characterise most of the disagreement as being on the first question. But you’re treating it like it’s mostly on the second question—like EY and everyone else are studying the same thing (cancer, in your metaphor) and just disagree about how to treat it.

That’s an interesting separation of the problem, because I really feel there is more disagreement on the second question than on the first.

My attempt to portray EY’s perspective is more like: he’s concerned with the problem of ageing, and a whole bunch of people have come along, said they agree with him, and started proposing ways to cure cancer using prosaic radiotherapy techniques. Now he’s trying to say: no, your work is not addressing the core problem of ageing, which is going to kill us unless we make a big theoretical breakthrough.

Funnily, aren’t the people currently working on ageing using quite prosaic techniques? I completely agree that one need to go for the big problems, especially ones that only appear in more powerful regimes (which is why I am adamant that there should be places for researchers to think about distinctly AGI problems and not have to rephrase everything in a way that is palatable to ML academia). But people like Paul and Evan and more are actually going for the core problems IMO, just anchoring a lot of their thinking in current ML technologies. So I have trouble understanding how prosaic alignment isn’t trying to solve the problem at all. Maybe it’s just a disagreement on how large the “prosaic alignment category” is?

Regardless of that, calling the debate “one sided” seems way too strong, especially given how many selection effects are involved. I mean, you could also call the debate about whether alignment is even a problem “one sided” − 95% of all ML researchers don’t think it’s a problem, or think it’s something we’ll solve easily. But for fairly similar meta-level reasons as why it’s good for them to listen to us in an open-minded way, it’s also good for prosaic alignment researchers to listen to EY in an open-minded way.

You definitely have a point, and I want to listen to EY in an open-minded way. It’s just harder when he writes things like everyone working on alignment is faking it and not giving much details. Also I feel that your comparison breaks a bit because compared to the debate with ML researchers (where most people against alignment haven’t even thought about the basics and make obvious mistakes), the other parties in this debate have thought long and hard about alignment. Maybe not as much as EY, but clearly much more than the ML researchers in the whole “is alignment even a problem” debate.

(As a side note, I’d be curious what credence you place on EY’s worldview being more true than the prosaic alignment worldview.)

At the moment I feel like I don’t have a good enough model of EY’s worldview, plus I’m annoyed by his statements, so any credence I give now would be biased against his worldview.

Now, your complaint might be that MIRI has not made their case enough over the last few years. If that’s the main issue, then stay tuned; as Rob said, this is just the preface to a bunch of relevant material.

• I really feel there is more disagreement on the second question than on the first

What is this feeling based on? One way we could measure this is by asking people about how much AI xrisk there is conditional on there being no more research explicitly aimed at aligning AGIs. I expect that different people would give very different predictions.

People like Paul and Evan and more are actually going for the core problems IMO, just anchoring a lot of their thinking in current ML technologies.

Everyone agrees that Paul is trying to solve foundational problems. And it seems strange to criticise Eliezer’s position by citing the work of MIRI employees.

It’s just harder when he writes things like everyone working on alignment is faking it and not giving much details.

As Rob pointed out above, this straightforwardly mischaracterises what Eliezer said.

• I worry that “Prosaic Alignment Is Doomed” seems a bit… off as the most appropriate crux. At least for me. It seems hard for someone to justifiably know that this is true with enough confidence to not even try anymore. To have essayed or otherwise precluded all promising paths of inquiry, to not even engage with the rest of the field, to not even try to argue other researchers out of their mistaken beliefs, because it’s all Hopeless.

Consider the following analogy: Someone who wants to gain muscle, but has thought a lot about nutrition and their genetic makeup and concluded that Direct Exercise Gains Are Doomed, and they should expend their energy elsewhere.

OK, maybe. But how about try going to the gym for a month anyways and see what happens?

The point isn’t “EY hasn’t spent a month of work thinking about prosaic alignment.” The point is that AFAICT, by MIRI/​EY’s own values, valuable-seeming plans are being left to rot on the cutting room floor. Like, “core MIRI staff meet for an hour each month and attack corrigibility/​deceptive cognition/​etc with all they’ve got. They pay someone to transcribe the session and post the fruits /​ negative results /​ reasoning to AF, without individually committing to following up with comments.”

(I am excited by Rob Bensinger’s comment that this post is the start of more communication from MIRI)

• Like nobody except EY and a bunch of core MIRI people actually believes that prosaic alignment is impossible. I mean that every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard. That includes MIRI people like Evan Hubinger too. And note that some of these other alignment researchers actually work with Neural Nets and keep up to speed on the implementation details and subtleties, which in my book means their voice should count more.

I don’t get the impression that Eliezer’s saying that alignment of prosaic AI is impossible. I think he’s saying “it’s almost certainly not going to happen because humans are bad at things.” That seems compatible with “every other researcher that I know think Prosaic Alignment is possible, even if potentially very hard” (if you go with the “very hard” part).

• Yes, +1 to this; I think it’s important to distinguish between impossible (which is a term I carefully avoided using in my earlier comment, precisely because of its theoretical implications) and doomed (which I think of as a conjunction of theoretical considerations—how hard is this problem?--and social/​coordination ones—how likely is it that humans will have solved this problem before solving AGI?).

I currently view this as consistent with e.g. Eliezer’s claim that Chris Olah’s work, though potentially on a pathway to something important, is probably going to accomplish “far too little far too late”. I certainly didn’t read it as anything like an unconditional endorsement of Chris’ work, as e.g. this comment seems to imply.

• Ditto—the first half makes it clear that any strategy which isn’t at most 2 years slower than an unaligned approach will be useless, and that prosaic AI safety falls into that bucket.

• Thanks for elaborating. I don’t think I have the necessary familiarity with the alignment research community to assess your characterization of the situation, but I appreciate your willingness to raise potentially unpopular hypotheses to attention. +1

• Thanks for taking the time of asking a question about the discussion even if you lack expertise on the topic. ;)

• +1 for this whole conversation, including Adam pushing back re prosaic alignment /​ trying to articulate disagreements! I agree that this is an important thing to talk about more.

I like the ‘give more concrete feedback on specific research directions’ idea, especially if it helps clarify generators for Eliezer’s pessimism. If Eliezer is pessimistic about a bunch of different research approaches simultaneously, and you’re simultaneously optimistic about all those approaches, then there must be some more basic disagreement(s) behind that.

From my perspective, the OP discussion is the opening salvo in ‘MIRI does a lot more model-sharing and discussion’. It’s more like a preface than like a conclusion, and the next topic we plan to focus on is why Eliezer-cluster people think alignment is hard, how we’re thinking about AGI, etc. In the meantime, I’m strongly in favor of arguing about this a bunch in the comments, sharing thoughts and reflections on your own models, etc. -- going straight for the meaty central disagreements now, not waiting to hash this out later.

• Someone privately contacted me to express confusion, because they thought my ‘+1’ means that I think adamShimi’s initial comment was unusually great. That’s not the case. The reasons I commented positively are:

• I think this overall exchange went well—it raised good points that might have otherwise been neglected, and everyone quickly reached agreement about the real crux.

• I want to try to cancel out any impression that criticizing /​ pushing back on Eliezer-stuff is unwelcome, since Adam expressed worries about a “taboo on criticizing MIRI and EY too hard”.

• On a more abstract level, I like seeing people ‘blurt out what they’re actually thinking’ (if done with enough restraint and willingness-to-update to mostly avoid demon threads), even if I disagree with the content of their thought. I think disagreements are often tied up in emotions, or pattern-recognition, or intuitive senses of ‘what a person/​group/​forum is like’. This can make it harder to epistemically converge about tough topics, because there’s a temptation to pretend your cruxes are more simple and legible than they really are, and end up talking about non-cruxy things.

Separately, I endorse Ben Pace’s question (“Can you make a positive case here for how the work being done on prosaic alignment leads to success?”) as the thing to focus on.

• Thanks for the kind answer, even if we’re probably disagreeing about most points in this thread. I think message like yours really help in making everyone aware that such topics can actually be discussed publicly without big backlash.

I like the ‘give more concrete feedback on specific research directions’ idea, especially if it helps clarify generators for Eliezer’s pessimism. If Eliezer is pessimistic about a bunch of different research approaches simultaneously, and you’re simultaneously optimistic about all those approaches, then there must be some more basic disagreement(s) behind that.

That sounds amazing! I definitely want to extract some of the epistemic strategies that EY uses to generate criticisms and break proposals. :)

From my perspective, the OP discussion is the opening salvo in ‘MIRI does a lot more model-sharing and discussion’. It’s more like a preface than like a conclusion, and the next topic we plan to focus on is why Eliezer-cluster people think alignment is hard, how we’re thinking about AGI, etc. In the meantime, I’m strongly in favor of arguing about this a bunch in the comments, sharing thoughts and reflections on your own models, etc. -- going straight for the meaty central disagreements now, not waiting to hash this out later.

• I don’t think the “Only Game in Town” argument works when EY in the OP says

I have a few stupid ideas I could try to investigate in ML,

As well as approving redwood research.

• Some things that seem important to distinguish here:

• Prosaic alignment is doomed’. I parse this as: ‘Aligning AGI, without coming up with any fundamentally new ideas about AGI/​intelligence or discovering any big “unknown unknowns” about AGI/​intelligence, is doomed.’

• I (and my Eliezer-model) endorse this, in large part because ML (as practiced today) produces such opaque and uninterpretable models. My sense is that Eliezer’s hopes largely route through understanding AGI systems’ internals better, rather than coming up with cleverer ways to apply external pressures to a black box.

• All alignment work that involves running experiments on deep nets is doomed’.

• My Eliezer-model doesn’t endorse this at all.

Also important to distinguish, IMO (making up the names here):

• A strong ‘prosaic AGI’ thesis, like ‘AGI will just be GPT-n or some other scaled-up version of current systems’. Eliezer is extremely skeptical of this.

• A weak ‘prosaic AGI’ thesis, like ‘AGI will involve coming up with new techniques, but the path between here and AGI won’t involve any fundamental paradigm shifts and won’t involve us learning any new deep things about intelligence’. I’m not sure what Eliezer’s unconditional view on this is, but I’d guess that he thinks this falls a lot in probability if we condition on something like ‘good outcomes are possible’—it’s very bad news.

• An ‘unprosaic but not radically different AGI’ thesis, like ‘AGI might involve new paradigm shifts and/​or new deep insights into intelligence, but it will still be similar enough to the current deep learning paradigm that we can potentially learn important stuff about alignable AGI by working with deep nets today’. I don’t think Eliezer has a strong view on this, though I observe that he thinks some of the most useful stuff humanity can do today is ‘run various alignment experiments on deep nets’.

• An ‘AGI won’t be GOFAI’ thesis. Eliezer strongly endorses this.

There’s also an ‘inevitability thesis’ that I think is a crux here: my Eliezer-model thinks there are a wide variety of ways to build AGI that are very different, such that it matters a lot which option we steer toward (and various kinds of ‘prosaicness’ might be one parameter we can intervene on, rather than being a constant). My Paul-model has the opposite view, and endorses some version of inevitability.

• Note: GOFAI = Good Old Fashioned AI

Your comment and Vaniver’s (paraphrasing) “not surprised by the results of this work, so why do it?” especially helpful. EY (or others) assessing concrete research directions with detailed explanations would be even more helpful.

I agree with Rohin’s general question of “Can you tell a story where your research helps solve a specific alignment problem?”, and if you have other heuristics when assessing research, that would be good to know.

• +1, plus endorsing Chris Olah

• I share the impression that the agent foundations research agenda seemed not that important. But that point doesn’t feel sufficient to argue that Eliezer’s pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I’m not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.) MIRI have deprioritized agent foundations research for quite a while now. I also just think it’s extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don’t know how exactly they nowadays think about prioritization in 2017 and earlier.

I really like Vaniver’s comment further below:

For what it’s worth, my sense is that EY’s track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.

And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it.

I’m very far away from confident that Eliezer’s pessimism is right, but it seems plausible to me. Of course, some people might be in the epistemic position of having tried to hash out that particular disagreement on the object level and have concluded that Eliezer’s pessimism is misguided – I can’t comment on that. I’m just saying that based on what I’ve read, which is pretty much every post and comment on AI alignment on LW and the EA forum, I don’t get the impression that Eliezer’s pessimism is clearly unfounded.

Everyone’s views look like they are suspiciously shaped to put themselves and their efforts into a good light. If someone believed that their work isn’t important or their strengths aren’t very useful, they wouldn’t do the work and wouldn’t cultivate the strengths. That applies to Eliezer, but it also applies to the people who think alignment will likely be easy. I feel like people in the latter group would likely be inconvenienced (in terms of the usefulness of their personal strengths or the connections they’ve built in the AI industry, or past work they’ve done), too, if it turned out not to be.

Just to give an example on the sorts of observations that make me think Eliezer/​”MIRI” could have a point:

• I don’t know what happened with a bunch of safety people leaving OpenAI but it’s at least possible to me that it involved some people having had negative updates on the feasibility of a certain type of strategy that Eliezer criticized early on here. (I might be totally wrong about this interpretation because I haven’t talked to anyone involved.)

• I thought it was interesting when Paul noted that our civilization’s Covid response was a negative update for him on the feasibility of AI alignment. Kudos to him for noting the update, but also: Isn’t that exactly the sort of misprediction one shouldn’t be making if one confidently thinks alignment is likely to succeed? (That said, my sense is that Paul isn’t even at the most optimistic end of people in the alignment community.)

• A lot of the work in the arguments for alignment being easy seems to me to be done by dubious analogies that assume that AI alignment is relevantly similar to risky technologies that we’ve already successfully invented. People seem insufficiently quick to get to the actual crux with MIRI, which makes me think they might not be great at passing the Ideological Turing Test. When we get to the actual crux, it’s somewhere deep inside the domain of predicting the training conditions for AGI, which feels like the sort of thing Eliezer might be good at thinking about. Other people might also be good at thinking about this, but then why do they often start their argument with dubious analogies to past technologies that seem to miss the point?
[Edit: I may be strawmanning some people here. I have seen direct discussions about the likelihood of treacherous turns vs. repeated early warnings of alignment failure. I didn’t have a strong opinion either way, but it’s totally possible that some people feel like they understand the argument and confidently disagree with Eliezer’s view there.]

• That’s an awesome comment, thanks!

But that point doesn’t feel sufficient to argue that Eliezer’s pessimism about the current state of alignment research is just a face-saving strategy his brain tricked him into adopting. (I’m not saying you claimed that it is sufficient; probably a lot of other data points are factoring into your judgment.)

I get why you take that from my rant, but that’s not really what I meant. I’m more criticizing the “everything is doomed but let’s not give concrete feedback to people” stance, and I think part of it comes from believing for so long (and maybe still believing) that their own approach was the only non-fake one. Also just calling everyone else a faker is quite disrespectful and not helping.

I also just think it’s extremely common for people to have periods where they work on research that eventually turns out to be not that important; the interesting thing is to see what happens when that becomes more apparent. I immediately trust people more if I see that they are capable of pivoting and owning up to past mistakes, and I could imagine that MIRI deserves a passing grade on this, even though I also have to say that I don’t know how exactly they nowadays think about prioritization in 2017 and earlier.

MIRI does have some positive points for changing their minds, but also some negative points IMO for taking so long to change their mind. Not sure what the total is.

I’m very far away from confident that Eliezer’s pessimism is right, but it seems plausible to me. Of course, some people might be in the epistemic position of having tried to hash out that particular disagreement on the object level and have concluded that Eliezer’s pessimism is misguided – I can’t comment on that. I’m just saying that based on what I’ve read, which is pretty much every post and comment on AI alignment on LW and the EA forum, I don’t get the impression that Eliezer’s pessimism is clearly unfounded.

Here again, it’s not so much that I disagree with EY about there being problems in the current research proposals. I expect that some of the problems he would point out are ones I see too. I just don’t get the transition from “there are problems with all our current ideas” to “everyone is faking working on alignment and we’re all doomed”.

Everyone’s views look like they are suspiciously shaped to put themselves and their efforts into a good light. If someone believed that their work isn’t important or their strengths aren’t very useful, they wouldn’t do the work and wouldn’t cultivate the strengths. That applies to Eliezer, but it also applies to the people who think alignment will likely be easy. I feel like people in the latter group would likely be inconvenienced (in terms of the usefulness of their personal strengths or the connections they’ve built in the AI industry, or past work they’ve done), too, if it turned out not to be.

Very good point. That being said, many of the more prosaic alignment people changed their minds multiple times, whereas on these specific questions I feel EY and MIRI didn’t except when forced by tremendous pressure, which makes me believe that this criticism applies more to them. But that’s one point where having some more knowledge of the internal debates at MIRI could make me change my mind completely.

I don’t know what happened with a bunch of safety people leaving OpenAI but it’s at least possible to me that it involved some people having had negative updates on the feasibility of a certain type of strategy that Eliezer criticized early on here. (I might be totally wrong about this interpretation because I haven’t talked to anyone involved.)

My impression from talking with people (but not having direct confirmation from the people who left) was far more that OpenAI was focusing the conceptual safety team on ML work and the other safety team on making sure GPT-3 was not racist, which was not the type of work they were really excited about. But I might also be totally wrong about this.

I thought it was interesting when Paul noted that our civilization’s Covid response was a negative update for him on the feasibility of AI alignment. Kudos to him for noting the update, but also: Isn’t that exactly the sort of misprediction one shouldn’t be making if one confidently thinks alignment is likely to succeed? (That said, my sense is that Paul isn’t even at the most optimistic end of people in the alignment community.)

I’m confused about your question, because what you describe sounds like a misprediction that makes sense? Also I feel that in this case, there’s a different between solving the coordination problem of having people implement the solution or not go on a race (which looks indeed harder in the light of Covid management) and solving the technical problem, which is orthogonal to Covid response.

• My impression from talking with people (but not having direct confirmation from the people who left) was far more that OpenAI was focusing the conceptual safety team on ML work and the other safety team on making sure GPT-3 was not racist, which was not the type of work they were really excited about. But I might also be totally wrong about this.

Interesting! This is quite different from the second-hand accounts I heard. (I assume we’re touching different parts of the elephant.)

• Couple things:

First, there is a lot of work in the “alignment community” that involves (for example) decision theory or open-source-game-theory or acausal trade, and I haven’t found any of it helpful for what I personally think about (which I’d like to think is “directly attacking the heart of the problem”, but others may judge for themselves when my upcoming post series comes out!).

I guess I see this subset of work as consistent with the hypothesis “some people have been nerd-sniped!”. But it’s also consistent with “some people have reasonable beliefs and I don’t share them, or maybe I haven’t bothered to understand them”. So I’m a bit loath to go around criticizing them, without putting more work into it. But still, this is a semi-endorsement of one of the things you’re saying.

Second, my understanding of MIRI (as an outsider, based purely on my vague recollection of their newsletters etc., and someone can correct me) is that (1) they have a group working on “better understand agent foundations”, and this group contains Abram and Scott, and they publish pretty much everything they’re doing, (2) they have a group working on undisclosed research projects, which are NOT “better understand agent foundations”, (3) they have a couple “none of the above” people including Evan and Vanessa. So I’m confused that you seem to endorse what Abram and Scott are doing, but criticize agent foundations work at MIRI.

Like, maybe people “in the AI alignment community” are being nerd-sniped, and maybe MIRI had a historical role in how that happened, but I’m not sure there’s any actual MIRI employee right now who is doing nerd-sniped-type work, to the best of my limited understanding, unless we want to say Scott is, but you already said Scott is OK in your book.

(By the way, hot takes: I join you in finding some of Abram’s posts to be super helpful, and would throw Stuart Armstrong onto the “super helpful” list too, assuming he counts as “MIRI”. As for Scott: ironically, I find logical induction very useful when thinking about how to build AGI, and somewhat less useful when thinking about how to align it. :-P I didn’t get anything useful for my own thinking out of his Cartesian frames or finite factored sets, but as above, that could just be me; I’m very loath to criticize without doing more work, especially as they’re works in progress, I gather.)

• I’m annoyed by EY (and maybe MIRI’s?) dismissal of every other alignment work, and how seriously it seems to be taken here, given their track record of choosing research agendas with very indirect impact on alignment

For what it’s worth, my sense is that EY’s track record is best in 1) identifying problems and 2) understanding the structure of the alignment problem.

And, like, I think it is possible that you end up in situations where the people who understand the situation best end up the most pessimistic about it. If you’re trying to build a bridge to the moon, in fact it’s not going to work, and any determination applied there is going to get wasted. I think I see how a “try to understand things and cut to the heart of them” notices when it’s in situations like that, and I don’t see how “move the ball forward from where it is now” notices when it’s in situations like that.

• Agreed on the track record, which is part of why that’s so frustrating he doesn’t give more details and feedback on why all these approaches are doomed in his view.

That being said, I disagree for the second part, probably because we don’t mean the same thing by “moving the ball”?

In your bridge example, “moving the ball” looks to me like trying to see what problems the current proposal could have, how you could check them, what would be your unknown unknowns. And I definitely expect such an approach to find the problems you mention.

Maybe you could give me a better model of what you mean by “moving the ball”?

• Oh, I was imagining something like “well, our current metals aren’t strong enough, what if we developed stronger ones?”, and then focusing on metallurgy. And this is making forward progress—you can build a taller tower out of steel than out of iron—but it’s missing more fundamental issues like “you’re not going to be able to drive on a bridge that’s perpendicular to gravity, and the direction of gravity will change over the course of the trip” or “the moon moves relative to the earth, such that your bridge won’t be able to be one object”, which will sink the project even if you can find a supremely strong metal.

For example, let’s consider Anca Dragan’s research direction that I’m going to summarize as “getting present-day robots to understand what humans around them want and are doing so that they can achieve their goals /​ cooperate more effectively.” (In mildly adversarial situations like driving, you don’t want to make a cooperatebot, but rather something that follows established norms /​ prevents ‘cutting’ and so on, but when you have a human-robot team you do care mostly about effective cooperation.)

My guess is this 1) will make the world a better place in the short run under ‘normal’ conditions (most obviously through speeding up adoption of autonomous vehicles and making them more effective) and 2) does not represent meaningful progress towards aligning transformative AI systems. [My model of Eliezer notes that actually he’s making a weaker claim, which is something more like “he’s not surprised by the results of her papers”, which still allows for them to be “progress in the published literature”.]

When I imagine “how do I move the ball forward now?” I find myself drawn towards projects like those, and less to projects like “stare at the nature of cognition until I see a way through the constraints”, which feels like the sort of thing that I would need to do to actually shift my sense of doom.

• Adam, can you make a positive case here for how the work being done on prosaic alignment leads to success? You didn’t make one, and without it I don’t understand where you’re coming from. I’m not asking you to tell me a story that you have 100% probability on, just what is the success story you’re acting under, such that EY’s stances seem to you to be mostly distracting people from the real work.

• (Later added disclaimer: it’s a good idea to add “I feel like...” before the judgment in this comment, so that you keep in mind that I’m talking about my impressions and frustrations, rarely stating obvious facts (despite the language making it look so))

Thanks for trying to understand my point and asking me for more details. I appreciate it.

Yet I feel weird when trying to answer, because my gut reaction to your comment is that you’re asking the wrong question? Also, the compression of my view to “EY’s stances seem to you to be mostly distracting people from the real work” sounds more lossy than I’m comfortable with. So let me try to clarify and focus on these feelings and impressions, then I’ll answer more about which success stories or directions excite me.

My current problem with EY’s stances is twofold:

• First, in posts like this one, he literally writes that everything done under the label of alignment is faking it and not even attacking the problem, except like 3 people who even if they’re trying have it all wrong. I think this is completely wrong, and that’s even more annoying because I find that most people working on alignment are trying far harder harder to justify why they expect their work to matter than EY and the old-school MIRI team ever did.

• This is a problem because it doesn’t help anyone working on the field to maybe solve the problems with their approaches that EY sees, which sounds like a massive missed opportunity.

• This is also a problem because EY’s opinions are still quite promoted in the community (especially here on the AF and LW), such that newcomers going for what the founder of the field has to say go away with the impression that no one is doing valuable work.

• Far more speculative (because I don’t know EY personally), but I expect that kind of judgment to not come so much from a place of all encompassing genius but instead from generalization after reading some posts/​papers. And I’ve received messages following this thread of people who were just as annoyed as I, and felt their results had been dismissed without even a comment or classified as trivial when everyone else, including the authors, were quite surprised by them. I’m ready to give EY a bit of “he just sees further than most people”, but not enough that he can discard the whole field from reading a couple of AF posts.

• Second, historically, a lot of MIRI’s work has followed a specific epistemic strategy of trying to understand what are the optimal ways of deciding and thinking, both to predict how an AGI would actually behave and to try to align it. I’m not that convinced by this approach, but even giving it the benefit of the doubt, it has by no way lead to any accomplishments big enough to justify EY (and MIRI’s ?) highly veiled contempt for anyone not doing that. This had and still has many bad impacts on the field and new entrants.

• A specific subgroup of people tend to be nerd-sniped by this older MIRI’s work, because it’s the only part of the field that is more formal, but IMO at the loss of most of what matters about alignment and most of the grounding.

• People who don’t have the technical skill to work on MIRI’s older work feel like they have to skill up drastically in maths to be able to do anything relevant in alignment. I literally mentored three people like that, who could actually do a lot of good thinking and cared about alignment, and had to push it in their head that they didn’t need super advanced maths skills, except if they wanted to do very very specific things.
I find that particularly sad because IMO the biggest positive contribution to the field by EY and early MIRI comes from their less formal and more philosophical work, which is exactly the kind of work that is stilted by the consequences of this stance.

• I also feel people here underestimate how repelling this whole attitude has been for years for most people outside the MIRI bubble. From testimonials by a bunch of more ML people and how any discussion of alignment needs to clarify that you don’t share MIRI’s contempt with experimental work and not doing only decision theory and logic, I expect that this has been one of the big factors in alignment not being taken seriously and people not wanting to work on it.

• Also important to note that I don’t know if EY and MIRI still think this kind of technical research is highly valuable and the real research and what should be done, but they have been influential enough that I think a big part of the damage is done, and I read some parts of this post as “If only we could do the real logic thing, but we can’t so we’re doomed”. Also there’s a question of the separation between the image that MIRI and EY projects and what they actually think.

Going back to your question, it has a weird double standard feel. Like, every AF post on more prosaic alignment methods comes with its success story, and reason for caring about the research. If EY and MIRI want to argue that we’re all doomed, they have the burden of proof to explain why everything that’s been done is terrible and will never lead to alignment. Once again, proving that we won’t be able to solve a problem is incredibly hard and improbable. Funny how everyone here gets that for the “AGI is impossible question”, but apparently that doesn’t apply to “Actually working with AIs and Thinking about real AIs will never let you solve alignment in time.”

Still, it’s not too difficult to list a bunch of promising stuff, so here’s a non-exhaustive list:

• John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.

• People from EleutherAI working on understanding LMs and GPT-like models as simulators of processes (called simulacra), as well as the safety benefits (corrigibility) and new strategies (leveraging the output distribution in smart ways) that this model allows.

• Evan Hubinger’s work on finding predicates that we could check during training to avoid deception and behaviors we’re worried about. He has a full research agenda but it’s not public yet. Maybe our post on myopic decision theory could be relevant.

• Stuart Armstrong’s work on model splintering, especially his AI Safety Subprojects which are experimental, not obvious what they will find, and directly relevant to implementing and using model splintering to solve alignment

• Paul Christiano’s recent work on making question-answerers give useful information instead of what they expect humans to answer, which has a clear success story for these kinds of powerful models and their use in building stronger AIs and supervising training for example.

It’s also important to remember how alignment and the related problems and ideas are still not that well explained, distilled and analyzed for teaching and criticism. So I’m excited too about work that isn’t directly solving alignment but just making things clearer and more explicit, like Evan’s recent post or my epistemic strategies analysis.

• Thanks for naming specific work you think is really good! I think it’s pretty important here to focus on the object-level. Even if you think the goodness of these particular research directions isn’t cruxy (because there’s a huge list of other things you find promising, and your view is mainly about the list as a whole rather than about any particular items on it), I still think it’s super important for us to focus on object-level examples, since this will probably help draw out what the generators for the disagreement are.

John Wentworth’s Natural Abstraction Hypothesis, which is about checking his formalism-backed intuition that NNs actually learn similar abstractions that humans do. The success story is pretty obvious, in that if John is right, alignment should be far easier.

Eliezer liked this post enough that he asked me to signal-boost it in the MIRI Newsletter back in April.

And Paul Christiano and Stuart Armstrong are two of the people Eliezer named as doing very-unusually good work. We continue to pay Stuart to support his research, though he’s mainly supported by FHI.

And Evan works at MIRI, which provides some Bayesian evidence about how much we tend to like his stuff. :)

So maybe there’s not much disagreement here about what’s relatively good? (Or maybe you’re deliberately picking examples you think should be ‘easy sells’ to Steel Eliezer.)

The main disagreement, of course, is about how absolutely promising this kind of stuff is, not how relatively promising it is. This could be some of the best stuff out there, but my understanding of the Adam/​Eliezer disagreement is that it’s about ‘how much does this move the dial on actually saving the world?’ /​ ‘how much would we move the dial if we just kept doing more stuff like this?’.

Actually, this feels to me like a thing that your comments have bounced off of a bit. From my perspective, Eliezer’s statement was mostly saying ‘the field as a whole is failing at our mission of preventing human extinction; I can name a few tiny tidbits of relatively cool things (not just MIRI stuff, but Olah and Christiano), but the important thing is that in absolute terms the whole thing is not getting us to the world where we actually align the first AGI systems’.

My Eliezer-model thinks nothing (including MIRI stuff) has moved the dial much, relative to the size of the challenge. But your comments have mostly been about a sort of status competition between decision theory stuff and ML stuff, between prosaic stuff and ‘gain new insights into intelligence’ stuff, between MIRI stuff and non-MIRI stuff, etc. This feels to me like it’s ignoring the big central point (‘our work so far is wildly insufficient’) in order to haggle over the exact ordering of the wildly-insufficient things.

You’re zeroed in on the “vast desert” part, but the central point wasn’t about the desert-oasis contrast, it was that the whole thing is (on Eliezer’s model) inadequate to the task at hand. Likewise, you’re talking a lot about the “fake” part (and misstating Eliezer’s view as “everyone else [is] a faker”), when the actual claim was about “work that seems to me to be mostly fake or pointless or predictable” (emphasis added).

Maybe to you these feel similar, because they’re all just different put-downs. But… if those were true descriptions of things about the field, they would imply very different things.

I would like to put forward that Eliezer thinks, in good faith, that this is the best hypothesis that fits the data. I absolutely think reasonable people can disagree with Eliezer on this, and I don’t think we need to posit any bad faith or personality failings to explain why people would disagree.

• Also, I feel like I want to emphasize that, like… it’s OK to believe that the field you’re working in is in a bad state? The social pressure against saying that kind of thing (or even thinking it to yourself) is part of why a lot of scientific fields are unhealthy, IMO. I’m in favor of you not taking for granted that Eliezer’s right, and pushing back insofar as your models disagree with his. But I want to advocate against:

• Saying false things about what the other person is saying. A lot of what you’ve said about Eliezer and MIRI is just obviously false (e.g., we have contempt for “experimental work” and think you can’t make progress by “Actually working with AIs and Thinking about real AIs”).

• Shrinking the window of ‘socially acceptable things to say about the field as a whole’ (as opposed to unsolicited harsh put-downs of a particular researcher’s work, where I see more value in being cautious).

I want to advocate ‘smack-talking the field is fine, if that’s your honest view; and pushing back is fine, if you disagree with the view’. I want to see more pushing back on the object level (insofar as people disagree), and less ‘how dare you say that, do you think you’re the king of alignment or something’ or ‘saying that will have bad social consequences’.

I think you’re picking up on a real thing of ‘a lot of people are too deferential to various community leaders, when they should be doing more model-building, asking questions, pushing back where they disagree, etc.’ But I think the solution is to shift more of the conversation to object-level argument (that is, modeling the desired behavior), and make that argument as high-quality as possible.

One thing I want to make clear is that I’m quite aware that my comments have not been as high-quality as they should have been. As I wrote in the disclaimer, I was writing from a place of frustration and annoyance, which also implies a focus on more status-y thing. That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed.

That being said, much of what I was railing against is a general perception of the situation, from reading a lot of stuff but not necessarily stopping to study all the evidence before writing a fully though-through opinion. I think this is where the “saying obviously false things” comes from (which I think are pretty easy to believe from just reading this post and a bunch of MIRI write-ups), and why your comments are really important to clarify the discrepancy between this general mental picture I was drawing from and the actual reality. Also recentering the discussion on the object-level instead of on status arguments sounds like a good move.

You make a lot of good points and I definitely want to continue the conversation and have more detailed discussion, but I also feel that for the moment I need to take some steps back, read your comments and some of the pointers in other comments, and think a bit more about the question. I don’t think there’s much more to gain from me answering quickly, mostly in reaction.

(I also had the brilliant idea of starting this thread just when I was on the edge of burning out from working too much (during my holidays), so I’m just going to take some time off from work. But I definitely want to continue this conversation further when I come back, although probably not in this thread ^^)

That sounded necessary to me to air out this frustration, and I think this was a good idea given the upvotes of my original post and the couple of people who messaged me to tell me that they were also annoyed.

If you’d just aired out your frustration, framing claims about others in NVC-like ‘I feel like...’ terms (insofar as you suspect you wouldn’t reflectively endorse them), and then a bunch of people messaged you in private to say “thank you! you captured my feelings really well”, then that would seem clearly great to me.

I’m a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn’t even have the top-of-comment disclaimer. So that I could imagine some people who also aren’t independently familiar with all the background facts, could come away with a lot of wrong beliefs about the people you’re criticizing.

‘Other people liked my comment, so it was clearly a good thing’ doesn’t distinguish between the worlds where they like it because they share the feelings, vs. agreeing with the factual claims and arguments (and if the latter, whether they’re noticing and filtering out all the seriously false or not-locally-valid parts). If the former, I think it was good. If the latter, I think it was bad.

(By default I’d assume it’s some mix.)

• I’m a bit worried that what instead happened is that you made a bunch of clearly-false claims about other people and gave a bunch of invalid arguments, mixed in with the feelings-stuff; and you used the content warning at the top of the message to avoid having to distinguish which parts of your long, detailed comment are endorsed or not (rather than also flagging this within the comment); and then you also ran with this in a bunch of follow-up comments that were similarly not-endorsed but didn’t even have the top-of-comment disclaimer. So that I could imagine some people who also aren’t independently familiar with all the background facts, could come away with a lot of wrong beliefs about the people you’re criticizing.

That sounds a bit unfair, in the sense that it makes it look like I just invented stuff I didn’t believe and ran with it. When what actually happen was that I wrote about my frustrations, but made the mistake of stating them as obvious facts instead of impressions.

Of course, I imagine you feel that my portrayal of EY and MIRI was also unfair, sorry about that.

(I added a note to the three most ranty comments on this thread saying that people should mentally add “I feel like...” to judgments in them.)

• Thanks for adding the note! :)

I’m confused. When I say ‘that’s just my impression’, I mean something like ‘that’s an inside-view belief that I endorse but haven’t carefully vetted’. (See, e.g., Impression Track Records, referring to Naming Beliefs.)

Example: you said that MIRI has “contempt with experimental work and not doing only decision theory and logic”.

My prior guess would have been that you don’t actually, for-real believe that—that it’s not your ‘impression’ in the above sense, more like ‘unendorsed venting/​hyperbole that has a more complicated relation to something you really believe’.

If you do (or did) think that’s actually true, then our models of MIRI are much more different than I thought! Alternatively, if you agree this is not true, then that’s all I meant in the previous comment. (Sorry if I was unclear about that.)

• I would say that with slight caveats (make “decision theory and logic” a bit larger to include some more mathy stuff and make “all experimental work” a bit smaller to not includes Redwood’s work), this was indeed my model.

What made me update from our discussion is the realization that I interpreted the dismissal of basically all alignment research as “this has no value whatsoever and people doing it are just pretending to care on alignment”, where it should have been interpreted as something like “this is potentially interesting/​new/​exciting, but it doesn’t look like it brings us closer to solving alignment in a significant way, hence we’re still failing”.

• ‘Experimental work is categorically bad, but Redwood’s work doesn’t count’ does not sound like a “slight caveat” to me! What does this generalization mean at all if Redwood’s stuff doesn’t count?

(Neither, for that matter, does the difference between ‘decision theory and logic’ and ‘all mathy stuff MIRI has ever focused on’ seem like a ‘slight caveat’ to me—but in that case maybe it’s because I have a lot more non-logic, non-decision-theory examples in my mind that you might not be familiar with, since it sounds like you haven’t read much MIRI stuff?).

• (Responding to entire comment thread) Rob, I don’t think you’re modeling what MIRI looks like from the outside very well.

• There’s a lot of public stuff from MIRI on a cluster that has as central elements decision theory and logic (logical induction, Vingean reflection, FDT, reflective oracles, Cartesian Frames, Finite Factored Sets...)

• There was once an agenda (AAMLS) that involved thinking about machine learning systems, but it was deprioritized, and the people working on it left MIRI.

• There was a non-public agenda that involved Haskell programmers. That’s about all I know about it. For all I know they were doing something similar to the modal logic work I’ve seen in the past.

• Eliezer frequently talks about how everyone doing ML work is pursuing dead ends, with potentially the exception of Chris Olah. Chris’s work is not central to the cluster I would call “experimentalist”.

• There has been one positive comment on the KL-divergence result in summarizing from human feedback. That wasn’t the main point of that paper and was an extremely predictable result.

• There has also been one positive comment on Redwood Research, which was founded by people who have close ties to MIRI. The current steps they are taking are not dramatically different from what other people have been talking about and/​or doing.

• There was a positive-ish comment on aligning narrowly superhuman models, though iirc it gave off more of an impression of “well, let’s at least die in a slightly more dignified way”.

I don’t particularly agree with Adam’s comments, but it does not surprise me that someone could come to honestly believe the claims within them.

• So, the point of my comments was to draw a contrast between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” I didn’t intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as ‘oh yeah, that’s not true and wasn’t what I was thinking’) is not great for discussion.

I think it’s true that ‘MIRI is super not into most ML alignment work’, and I think it used to be true that MIRI put almost all of its research effort into HRAD-ish work, and regardless, this all seems like a completely understandable cached impression to have of current-MIRI. If I wrote stuff that makes it sound like I don’t think those views are common, reasonable, etc., then I apologize for that and disavow the thing I said.

But this is orthogonal to what I thought I was talking about, so I’m confused about what seems to me like a topic switch. Maybe the implied background view here is:

‘Adam’s elision between those two claims was a totally normal level of hyperbole/​imprecision, like you might find in any LW comment. Picking on word choices like “only decision theory and logic” versus “only research that’s clustered near decision theory and logic in conceptspace”, or “contempt with experimental work” versus “assigning low EV to typical instances of empirical ML alignment work”, is an isolated demand for rigor that wouldn’t make sense as a general policy and isn’t, in any case, the LW norm.’

Is that right?

• So, the point of my comments was to draw a contrast between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” I didn’t intend to say that the latter is obviously-wrong; my goal was just to point out how different those two claims are, and say that the difference actually matters, and that this kind of hyperbole (especially when it never gets acknowledged later as ‘oh yeah, that’s not true and wasn’t what I meant’) is not great for discussion.

It occurs to me that part of the problem may be precisely that Adam et al. don’t think there’s a large difference between these two claims (that actually matters). For example, when I query my (rough, coarse-grained) model of [your typical prosaic alignment optimist], the model in question responds to your statement with something along these lines:

If you remove “mainstream ML alignment work, and nearly all work outside of the HRAD-ish cluster of decision theory, logic, etc.” from “experimental work”, what’s left? Perhaps there are one or two (non-mainstream, barely-pursued) branches of “experimental work” that MIRI endorses and that I’m not aware of—but even if so, that doesn’t seem to me to be sufficient to justify the idea of a large qualitative difference between these two categories.

In a similar vein to the above: perhaps one description is (slightly) hyperbolic and the other isn’t. But I don’t think replacing the hyperbolic version with the non-hyperbolic version would substantially change my assessment of MIRI’s stance; the disagreement feels non-cruxy to me. In light of this, I’m not particularly bothered by either description, and it’s hard for me to understand why you view it as such an important distinction.

Moreover: I don’t think [my model of] the prosaic alignment optimist is being stupid here. I think, to the extent that his words miss an important distinction, it is because that distinction is missing from his very thoughts and framing, not because he happened to use choose his words somewhat carelessly when attempting to describe the situation. Insofar as this is true, I expect him to react to your highlighting of this distinction with (mostly) bemusement, confusion, and possibly even some slight suspicion (e.g. that you’re trying to muddy the waters with irrelevant nitpicking).

To be clear: I don’t think you’re attempting to muddy the waters with irrelevant nitpicking here. I think you think the distinction in question is important because it’s pointing to something real, true, and pertinent—but I also think you’re underestimating how non-obvious this is to people who (A) don’t already deeply understand MIRI’s view, and (B) aren’t in the habit of searching for ways someone’s seemingly pointless statement might actually be right.

I don’t consider myself someone who deeply understands MIRI’s view. But I do want to think of myself as someone who, when confronted with a puzzling statement [from someone whose intellectual prowess I generally respect], searches for ways their statement might be right. So, here is my attempt at describing the real crux behind this disagreement:

(with the caveat that, as always, this is my view, not Rob’s, MIRI’s, or anybody else’s)

(and with the additional caveat that, even if my read of the situation turns out to be correct, I think in general the onus is on MIRI to make sure they are understood correctly, rather than on outsiders to try to interpret them—at least, assuming that MIRI wants to make sure they’re understood correctly, which may not always be the best use of researcher time)

I think the disagreement is mostly about MIRI’s counterfactual behavior, not about their actual behavior. I think most observers (including both Adam and Rob) would agree that MIRI leadership has been largely unenthusiastic about a large class of research that currently falls under the umbrella “experimental work”, and that the amount of work in this class MIRI has been unenthused about significantly outweighs the amount of work they have been excited about.

Where I think Adam and Rob diverge is in their respective models of the generator of this observed behavior. I think Adam (and those who agree with him) thinks that the true boundary of the category [stuff MIRI finds unpromising] roughly coincides with the boundary of the category [stuff most researchers would call “experimental work”], such that anything that comes too close to “running ML experiments and seeing what happens” will be met with an immediate dismissal from MIRI. In other words, [my model of] Adam thinks MIRI’s generator is configured such that the ratio of “experimental work” they find promising-to-unpromising would be roughly the same across many possible counterfactual worlds, even if each of those worlds is doing “experiments” investigating substantially different hypotheses.

Conversely, I think Rob thinks the true boundary of the category [stuff MIRI finds unpromising] is mostly unrelated to the boundary of the category [stuff most researchers would call “experimental work”], and that—to the extent MIRI finds most existing “experimental work” unpromising—this is mostly because the existing work is not oriented along directions MIRI finds promising. In other words, [my model of] Rob thinks MIRI’s generator is configured such that the ratio of “experimental work” they find promising-to-unpromising would vary significantly across counterfactual worlds where researchers investigate different hypotheses; in particular, [my model of] Rob thinks MIRI would find most “experimental work” highly promising in the world where the “experiments” being run are those whose results Eliezer/​Nate/​etc. would consider difficult to predict in advance, and therefore convey useful information regarding the shape of the alignment problem.

I think Rob’s insistence on maintaining the distinction between having a low opinion of “experimental work and not doing only decision theory and logic”, and having a low opinion of “mainstream ML alignment work, and of nearly all work outside the HRAD-ish cluster of decision theory, logic, etc.” is in fact an attempt to gesture at the underlying distinction outlined above, and I think that his stringency on this matter makes significantly more sense in light of this. (Though, once again, I note that I could be completely mistaken in everything I just wrote.)

Assuming, however, that I’m (mostly) not mistaken, I think there’s an obvious way forward in terms of resolving the disagreement: try to convey the underlying generators of MIRI’s worldview. In other words, do the thing you were going to do anyway, and save the discussions about word choice for afterwards.

• ^ This response is great.

I also think I naturally interpreted the terms in Adam’s comment as pointing to specific clusters of work in today’s world, rather than universal claims about all work that could ever be done. That is, when I see “experimental work and not doing only decision theory and logic”, I automatically think of “experimental work” as pointing to a specific cluster of work that exists in today’s world (which we might call mainstream ML alignment), rather than “any information you can get by running code”. Whereas it seems you interpreted it as something closer to “MIRI thinks there isn’t any information to get by running code”.

My brain insists that my interpretation is the obvious one and is confused how anyone (within the AI alignment field, who knows about the work that is being done) could interpret it as the latter. (Although the existence of non-public experimental work that isn’t mainstream ML is a good candidate for how you would start to interpret “experimental work” as the latter.) But this seems very plausibly a typical mind fallacy.

EDIT: Also, to explicitly say it, sorry for misunderstanding what you were trying to say. I did in fact read your comments as saying “no, MIRI is not categorically against mainstream ML work, and MIRI is not only working on HRAD-ish stuff like decision theory and logic, and furthermore this should be pretty obvious to outside observers”, and now I realize that is not what you were saying.

• This is a good comment! I also agree that it’s mostly on MIRI to try to explain its views, not on others to do painstaking exegesis. If I don’t have a ready-on-hand link that clearly articulates the thing I’m trying to say, then it’s not surprising if others don’t have it in their model.

And based on these comments, I update that there’s probably more disagreement-about-MIRI than I was thinking, and less (though still a decent amount of) hyperbole/​etc. If so, sorry about jumping to conclusions, Adam!

• Not sure if this helps, and haven’t read the thread carefully, but my sense is your framing might be eliding distinctions that are actually there, in a way that makes it harder to get to the bottom of your disagreement with Adam. Some predictions I’d have are that:

* For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
* For almost all experimental results you would think they were so much less informative as to not be worthwhile.
* There’s a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.

(I’d be willing to take bets on these or pick candidate experiments to clarify this.)

In addition, a consequence of these beliefs is that compared to me you think we should be spending way more time sitting around thinking about stuff, and way less time doing experiments, than I do.

I would agree with you that “MIRI hates all experimental work” /​ etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.

• I would agree with you that “MIRI hates all experimental work” /​ etc. is not a faithful representation of this state of affairs, but I think there is nevertheless an important disagreement MIRI has with typical ML people, and that the disagreement is primarily about what we can learn from experiments.

Ooh, that’s really interesting. Thinking about it, I think my sense of what’s going on is (and I’d be interested to hear how this differs from your sense):

1. Compared to the average alignment researcher, MIRI tends to put more weight on reasoning like ‘sufficiently capable and general AI is likely to have property X as a strong default, because approximately-X-ish properties don’t seem particularly difficult to implement (e.g., they aren’t computationally intractable), and we can see from first principles that agents will be systematically less able to get what they want when they lack property X’. My sense is that MIRI puts more weight on arguments like this for reasons like:

• We’re more impressed with the track record of inside-view reasoning in science.

• I suspect this is partly because the average alignment researcher is impressed with how unusually-poorly inside-view reasoning has done in AI—many have tried to gain a deep understanding of intelligence, and many have failed—whereas (for various reasons) MIRI is less impressed with this, and defaults more to the base rate for other fields, where inside-view reasoning has more extraordinary feats under its belt.

• We’re more wary of “modest epistemology”, which we think often acts like a self-fulfilling prophecy. (You don’t practice trying to mechanistically model everything yourself, you despair of overcoming biases, you avoid thinking thoughts that would imply you’re a leader or pioneer because that feels arrogant, so you don’t gain as much skill or feedback in those areas.)

2. Compared to the average alignment researcher, MIRI tends to put less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed’. This is for a variety of reasons, including:

• MIRI is more generally wary of putting much weight on surface generalizations, if we don’t have an inside-view reason to expect the generalization to keep holding.

• MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.

• Relatedly, MIRI thinks AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked. Progress on understanding AGI is much harder to predict than progress on hardware, so we can’t derive as much from trends.

Applying this to experiments:

Some predictions I’d have are that:

* For almost any experimental result, a typical MIRI person (and you, and Eliezer) would think it was less informative about AI alignment than I would.
* For almost all experimental results you would think they were so much less informative as to not be worthwhile.
* There’s a small subset of experimental results that we would think are comparably informative, and also a some that you would find much more informative than I would.

I’d have the same prediction, though I’m less confident that ‘pessimism about experiments’ is doing much work here, vs. ‘pessimism about alignment’. To distinguish the two, I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you (though probably the gap will be smaller?).

I do expect there to be some experiment-specific effect. I don’t know your views well, but if your views are sufficiently like my mental model of ‘typical alignment researcher whose intuitions differ a lot from MIRI’s’, then my guess would be that the disagreement comes down to the above two factors.

• 1 (more trust in inside view): For many experiments, I’m imagining Eliezer saying ‘I predict the outcome will be X’, and then the outcome is X, and the Modal Alignment Researcher says: ‘OK, but now we’ve validated your intuition—you should be much more confident, and that update means the experiment was still useful.’

To which Hypothetical Eliezer says: ‘I was already more than confident enough. Empirical data is great—I couldn’t have gotten this confident without years of honing my models and intuitions through experience—but now that I’m there, I don’t need to feign modesty and pretend I’m uncertain about everything until I see it with my own eyes.’

• 2 (less trust in AGI sticking to trends): For many obvious ML experiments Eliezer can’t predict the outcome of, I expect Eliezer to say ‘This experiment isn’t relevant, because factors X, Y, and Z give us strong reason to think that the thing we learn won’t generalize to AGI.’

Which ties back in to 1 as well, because if you don’t think we can build very reliable models in AI without constant empirical feedback, you’ll rarely be confident of abstract reasons X/​Y/​Z to expect a difference between current ML and AGI, since you can’t go walk up to an AGI today and observe what it’s like.

(You also won’t be confident that X/​Y/​Z don’t hold—all the possibilities will seem plausible until AGI is actually here, because you generally don’t trust yourself to reason your way to conclusions with much confidence.)

• Thanks. For time/​brevity, I’ll just say which things I agree /​ disagree with:

> sufficiently capable and general AI is likely to have property X as a strong default [...]

I generally agree with this, although for certain important values of X (such as “fooling humans for instrumental reasons”) I’m probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I’m also probably more optimistic (but not certain) that those efforts will succeed.

[inside view, modest epistemology]: I don’t have a strong take on either of these. My main take on inside views is that they are great for generating interesting and valuable hypotheses, but usually wrong on the particulars.

> less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed

I agree, see my post On the Risks of Emergent Behavior in Foundation Models. In the past I think I put too much weight on this type of reasoning, and also think most people in ML put too much weight on it.

> MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.

Probably disagree but hard to tell. I think there will both be a lot of similarities and a lot of differences.

> AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked

Seems pretty wrong to me. We probably need both insight and hardware, but the insights themselves are hardware-bottlenecked: once you can easily try lots of stuff and see what happens, insights are much easier, see Crick on x-ray crystallography for historical support (ctrl+f for Crick).

> I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you

I’m more pessimistic than MIRI about HRAD, though that has selection effects. I’ve found conceptual work to be pretty helpful for pointing to where problems might exist, but usually relatively confused about how to address them or how specifically they’re likely to manifest. (Which is to say, overall highly valuable, but consistent with my take above on inside views.)

[experiments are either predictable or uninformative]: Seems wrong to me. As a concrete example: Do larger models have better or worse OOD generalization? I’m not sure if you’d pick “predictable” or “uninformative”, but my take is:
* The outcome wasn’t predictable: within ML there are many people who would have taken each side. (I personally was on the wrong side, i.e. predicting “worse”.)
* It’s informative, for two reasons: (1) It shows that NNs “automatically” generalize more than I might have thought, and (2) Asymptotically, we expect the curve to eventually reverse, so when does that happen and how can we study it?

See also my take on Measuring and Forecasting Risks from AI, especially the section on far-off risks.

> Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.

Would agree with “most”, but I think you probably meant something like “almost all”, which seems wrong. There’s lots of people working on interpretability, and some of the work seems quite good to me (aside from Chris, I think Noah Goodman, Julius Adebayo, and some others are doing pretty good work).

• I’m not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count “the same”) had any clear-to-my-model relevance to alignment, or even AGI. AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we’ve got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem. Bigger models generalizing better or worse doesn’t say anything obvious to any piece of my model of the Big Problem. Though if larger models start generalizing more poorly, then it takes longer to stupidly-brute-scale to AGI, which I suppose affects timelines some, but that just takes timelines from unpredictable to unpredictable sooo.

If we qualify an experiment as interesting when it can tell anyone about anything, then there’s an infinite variety of experiments “interesting” in this sense and I could generate an unlimited number of them. But I do restrict my interest to experiments which can not only tell me something I don’t know, but tell me something relevant that I don’t know. There is also something to be said for opening your eyes and staring around, but even then, there’s an infinity of blank faraway canvases to stare at, and the trick is to go wandering with your eyes wide open someplace you might see something really interesting. Others will be puzzled and interested by different things and I don’t wish them ill on their spiritual journeys, but I don’t expect the vast majority of them to return bearing enlightenment that I’m at all interested in, though now and then Miles Brundage tweets something (from far outside of EA) that does teach me some little thing about cognition.

I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.

Do larger models generalize better or more poorly OOD? It’s a relatively basic question as such things go, and no doubt of interest to many, and may even update our timelines from ‘unpredictable’ to ‘unpredictable’, but… I’m trying to figure out how to say this, and I think I should probably accept that there’s no way to say it that will stop people from trying to sell other bits of research as Super Relevant To Alignment… it’s okay to have an understanding of reality which makes narrower guesses than that about which projects will turn out to be very relevant.

• I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.

Trying to rephrase it in my own words (which will necessarily lose some details), are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now? Which might tell us for example “what aspect of these predictable problems crop up first, and why?”

• are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now?

It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like “What actually went into the nonviolence predicate instead of just nonviolence?” Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.

• I suspect a third important reason is that MIRI thinks alignment is mostly about achieving a certain kind of interpretability/​understandability/​etc. in the first AGI systems. Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.

E.g., if you think alignment research is mostly about testing outer reward function to see what first-order behavior they produce in non-AGI systems, rather than about ‘looking in the learned model’s brain’ to spot mesa-optimization and analyze what that optimization is ultimately ‘trying to do’ (or whatever), then you probably won’t produce stuff that MIRI’s excited about regardless of how experimental vs. theoretical your work is.

(In which case, maybe this is not actually a crux for the usefulness of most alignment experiments, and is instead a crux for the usefulness of most alignment research in general.)

• (I suspect there are a bunch of other disagreements going into this too, including basic divergences on questions like ‘What’s even the point of aligning AGI? What should humanity do with aligned AGI once it has it?’.)

• One tiny note: I was among the people on AAMLS; I did leave MIRI the next year; and my reasons for so doing are not in any way an indictment of MIRI. (I was having some me-problems.)

I still endorse MIRI as, in some sense, being the adults in the AI Safety room, which has… disconcerting effects on my own level of optimism.

• Not planning to answer more on this thread, but given how my last messages seem to have confused you, here is my last attempt of sharing my mental model (so you can flag in an answer where I’m wrong in your opinion for readers of this thread)

Also, I just checked on the publication list, and I’ve read or skimmed most things MIRI published since 2014 (including most newsletters and blog posts on MIRI website).

My model of MIRI is that initially, there was a bunch of people including EY who were working mostly on decision theory stuff, tiling, model theory, the sort of stuff I was pointing at. That predates Nate’s arrival, but in my model it becomes far more legible after that (so circa 2014/​2015). In my model, I call that “old school MIRI”, and that was a big chunk of what I was pointing out in my original comment.

Then there are a bunch of thing that seem to have happened:

• Newer people (Abram and Scott come to mind, but mostly because they’re the one who post on the AF and who I’ve talked to) join this old-school MIRI approach and reshape it into Embedded Agency. Now this new agenda is a bit different from the old-school MIRI work, but I feel like it’s still not that far from decision theory and logic (with maybe a stronger emphasis on the bayesian part for stuff like logical induction). That might be a part where we’re disagreeing.

• A direction related to embedded agency and the decision theory and logic stuff, but focused on implementations through strongly typed programming languages like Haskell and type theory. That’s technically practical, but in my mental model this goes in the same category as “decision theory and logic stuff”, especially because that sort of programming is very close to logic and natural deduction.

• MIRI starts it’s ML-focused agenda, which you already mentioned. The impression I still have is that this didn’t lead to much published work that was actually experimental, instead focusing on recasting questions of alignment through ML theory. But I’ve updated towards thinking MIRI has invested efforts into looking at stuff from a more prosaic angle, based on looking more into what has been published there, because some of these ML papers had flown under my radar (there’s also the difficulty that when I read a paper by someone who has a position elsewhere now — say Ryan Carey or Stuart Armstrong — I don’t think MIRI but I think of their current affiliation, even though the work was supported by MIRI (and apparently Stuart is still supported by MIRI)). This is the part of the model where I expect that we might have very different models because of your knowledge of what was being done internally and never released.

• Some new people hired by MIRI fall into what I call the “Bells Lab MIRI” model, where MIRI just hires/​funds people that have different approaches from them, but who they think are really bright (Evan and Vanessa come to mind, although I don’t know if that’s the though process that went into hiring them).

Based on that model and some feedback and impressions I’ve gathered from people of some MIRI researchers being very doubtful of experimental work, that lead to my “all experimental work is useless”. I tried to include Redwood and Chris Olah’s work in there with the caveat (which is a weird model but makes sense if you have a strong prior for “experimental work is useless for MIRI”).

Our discussion made me think that there’s probably far better generators for this general criticism of experimental work, and that they would actually make more sense than “experimental work is useless except this and that”.

• From testimonials by a bunch of more ML people and how any discussion of alignment needs to clarify that you don’t share MIRI’s contempt with experimental work and not doing only decision theory and logic

If you were in the situation described by The Rocket Alignment Problem, you could think “working with rockets right now isn’t useful, we need to focus on our conceptual confusions about more basic things” without feeling inherently contemptuous of experimentalism—it’s a tool in the toolbox (which may or may not be appropriate to the task at hand), not a low- or high-status activity on a status hierarchy.

Separately, I think MIRI has always been pretty eager to run experiments in software when they saw an opportunity to test important questions that way. It’s also been 4.5 years now since we announced that we were shifting a lot of resources away from Agent Foundations and into new stuff, and 3 years since we wrote a very long (though still oblique) post about that research, talking about its heavy focus on running software experiments. Though we also made sure to say:

In a sense, you can think of our new research as tackling the same sort of problem that we’ve always been attacking, but from new angles. In other words, if you aren’t excited about logical inductors or functional decision theory, you probably wouldn’t be excited by our new work either.

I don’t think you can say MIRI has “contempt with experimental work” after four years of us mainly focusing on experimental work. There are other disagreements here, but this ties in to a long-standing objection I have to false dichotomies like:

• ‘we can either do prosaic alignment, or run no experiments’

• ‘we can either do prosaic alignment, or ignore deep learning’

• ‘we can either think it’s useful to improve our theoretical understanding of formal agents in toy settings, or think it’s useful to run experiments’

• ‘we can either think the formal agents work is useful, or think it’s useful to work with state-of-the-art ML systems’

I don’t think Eliezer’s criticism of the field is about experimentalism. I do think it’s heavily about things like ‘the field focuses too much on putting external pressures on black boxes, rather than trying to open the black box’, because (a) he doesn’t think those external-pressures approaches are viable (absent a strong understanding of what’s going on inside the box), and (b) he sees the ‘open the black box’ type work as the critical blocker. (Hence his relative enthusiasm for Chris Olah’s work, which, you’ll notice, is about deep learning and not about decision theory.)

• … I find that most people working on alignment are trying far harder harder to justify why they expect their work to matter than EY and the old-school MIRI team ever did.

You’ve had a few comments along these lines in this thread, and I think this is where you’re most severely failing to see the situation from Yudkowsky’s point of view.

From Yudkowsky’s view, explaining and justifying MIRI’s work (and the processes he uses to reach such judgements more generally) was the main point of the sequences. He has written more on the topic than anyone else in the world, by a wide margin. He basically spent several years full-time just trying to get everyone up to speed, because the inductive gap was very very wide.

When I put on my Yudkowsky hat and look at both the OP and your comments through that lens… I imagine if I were Yudkowsky I’d feel pretty exasperated at this point. Like, he’s written a massive volume on the topic, and now ten years later a large chunk of people haven’t even bothered to read it. (In particular, I know (because it’s come up in conversation) that at least a few of the people who talk about prosaic alignment a lot haven’t read the sequences, and I suspect that a disproportionate number haven’t. I don’t mean to point fingers or cast blame here, the sequences are a lot of material and most of it is not legibly relevant before reading it all, but if you haven’t read the sequences and you’re wondering why MIRI doesn’t have a write-up on why they’re not excited about prosaic alignment… well, that’s kinda the write-up. Also I feel like I need a disclaimer here that many people excited about prosaic alignment have read the sequences, I definitely don’t mean to imply that this is everyone in the category.)

(To be clear, I don’t think the sequences explain all of the pieces behind Yudkowsky’s views of prosaic alignment, in depth. They were written for a different use-case. But I do think they explain a lot.)

Related: IMO the best roughly-up-to-date piece explaining the Yudkowsky/​MIRI viewpoint is The Rocket Alignment Problem.

• Thanks for the pushback!

You’ve had a few comments along these lines in this thread, and I think this is where you’re most severely failing to see the situation from Yudkowsky’s point of view.

From Yudkowsky’s view, explaining and justifying MIRI’s work (and the processes he uses to reach such judgements more generally) was the main point of the sequences. He has written more on the topic than anyone else in the world, by a wide margin. He basically spent several years full-time just trying to get everyone up to speed, because the inductive gap was very very wide.

My memory of the sequences is that it’s far more about defending and explaining the alignment problem than criticizing prosaic AGI (maybe because the term couldn’t have been used years before Paul coined it?). Could you give me the best pointers of prosaic Alignment criticism in the sequence? I(I’ve read the sequences, but I don’t remember every single post, and my impression for memory is what I’ve written above).

I feel also that there might be a discrepancy between who I think of when I think of prosaic alignment researchers and what the category means in general/​to most people here? My category mostly includes AF posters, people from a bunch of places like EleutherAI/​OpenAI/​DeepMind/​Anthropic/​Redwood and people from CHAI and FHI. I expect most of these people to actually have read the sequences, and tried to understand MIRI’s perspective. Maybe someone could point out a list of other places where prosaic alignment research is being done that I’m missing, especially places where people probably haven’t read the sequences? Or maybe I’m over estimating how many of the people in the places I mentioned have read the sequences?

• I don’t mean to say that there’s critique of prosaic alignment specifically in the sequences. Rather, a lot of the generators of the Yudkowsky-esque worldview are in there. (That is how the sequences work: it’s not about arguing specific ideas around alignment, it’s about explaining enough of the background frames and generators that the argument becomes unnecessary. “Raise the sanity waterline” and all that.)

For instance, just the other day I ran across this:

Of this I learn the lesson: You cannot manipulate confusion. You cannot make clever plans to work around the holes in your understanding. You can’t even make “best guesses” about things which fundamentally confuse you, and relate them to other confusing things. Well, you can, but you won’t get it right, until your confusion dissolves. Confusion exists in the mind, not in the reality, and trying to treat it like something you can pick up and move around, will only result in unintentional comedy.

Similarly, you cannot come up with clever reasons why the gaps in your model don’t matter. You cannot draw a border around the mystery, put on neat handles that let you use the Mysterious Thing without really understanding it—like my attempt to make the possibility that life is meaningless cancel out of an expected utility formula. You can’t pick up the gap and manipulate it.

If the blank spot on your map conceals a land mine, then putting your weight down on that spot will be fatal, no matter how good your excuse for not knowing. Any black box could contain a trap, and there’s no way to know except opening up the black box and looking inside. If you come up with some righteous justification for why you need to rush on ahead with the best understanding you have—the trap goes off.

(The earlier part of the post had a couple embarrassing stories of mistakes Yudkowsky made earlier, which is where the lesson came from.) Reading that, I was like, “man that sure does sound like the Yudkowsky-esque viewpoint on prosaic alignment”.

Or maybe I’m over estimating how many of the people in the places I mentioned have read the sequences?

I think you are overestimating. At the orgs you list, I’d guess at least 25% and probably more than half have not read the sequences. (Low confidence/​wide error bars, though.)

• Thank you for the links Adam. To clarify, the kind of argument I’m really looking for is something like the following three (hypothetical) examples.

• Mesa-optimization is the primary threat model of unaligned AGI systems. Over the next few decades there will be a lot of companies building ML systems that create mesa-optimizers. I think it is within 5 years of current progress that we will understand how ML systems create mesa-optimizers and how to stop it.Therefore I think the current field is adequate for the problem (80%).

• When I look at the research we’re outputting, it seems to me to me that we are producing research at a speed and flexibility faster than any comparably sized academic department globally, or the ML industry, and so I am much more hopeful that we’re able to solve our difficult problem before the industry builds an unaligned AGI. I give it a 25% probability, which I suspect is much higher than Eliezer’s.

• I basically agree the alignment problem is hard and unlikely to be solved, but I don’t think we have any alternative than the current sorts of work being done, which is a combo of (a) agent foundations work (b) designing theoretical training algorithms (like Paul is) or (c) directly aligning narrowly super intelligent models. I am pretty open to Eliezer’s claim that we will fail but I see no alternative plan to pour resources into.

Whatever you actually think about the field and how it will save the world, say it!

It seems to me that almost all of your the arguments you’ve made work whether the field is a failure or not. The debate here has to pass through whether the field is on-track or not, and we must not sidestep that conversation.

I want to leave this paragraph as social acknowledgment that you mentioned upthread that you’re tired and taking a break, and I want to give you a bunch of social space to not return to this thread for however long you need to take! Slow comments are often the best.

• Thanks for the examples, that helps a lot.

I’m glad that I posted my inflammatory comment, if only because exchanging with you and Rob made me actually consider the question of “what is our story to success”, instead of just “are we making progress/​creating valuable knowledge”. And the way you two have been casting it is way less aversive to me that the way EY tends to frame it. This is definitely something I want to think more about. :)

I want to leave this paragraph as social acknowledgment that you mentioned upthread that you’re tired and taking a break, and I want to give you a bunch of social space to not return to this thread for however long you need to take! Slow comments are often the best.

Appreciated. ;)

• I’m sympathetic to most of your points.

highly veiled contempt for anyone not doing that

I have sympathy for the “this feels somewhat contemptuous” reading, but I want to push back a bit on the “EY contemptuously calling nearly everyone fakers” angle, because I think “[thinly] veiled contempt” is an uncharitable reading. He could be simply exasperated about the state of affairs, or wishing people would change their research directions but respect them as altruists for Trying At All, or who knows what? I’d rather not overwrite his intentions with our reactions (although it is mostly the job of the writer to ensure their writing communicates the right information [although part of the point of the website discussion was to speak frankly and bluntly]).

• If superintelligence is approximately multimodal GPT-17 plus reinforcement learning, then understanding how GPT-3-scale algorithms function is exceptionally important to understanding super-intelligence.

Also, if superintelligence doesn’t happen then prosaic alignment is the only kind of alignment.

• Also, if superintelligence doesn’t happen then prosaic alignment is the only kind of alignment.

Why do you think this? On the original definition of prosaic alignment, I don’t see why this would be true.

(In case it clarifies anything: my understanding of Paul’s research program is that it’s all about trying to achieve prosaic alignment for superintelligence. ‘Prosaic’ was never meant to imply ‘dumb’, because Paul thinks current techniques will eventually scale to very high capability levels.)

• My thinking is that prosaic alignment can also apply to non-super intelligent systems. If multimodal GPT-17 + RL = superintelligence, then whatever techniques are involved with aligning that system would probably apply to multimodal GPT-3 + RL, despite not being superintelligence. Superintelligence is not a prerequisite for being alignable.

• This is already reflected in the upvotes, but just to say it explicitly: I think the replies to this comment from Rob and dxu in particular have been exceptionally charitable and productive; kudos to them. This seems like a very good case study in responding to a provocative framing with a concentration of positive discussion norms that leads to productive engagement.

• if EY and other MIRI people who are very dubious of most alignment research could give more feedback on that and enter the dialogue, maybe by commenting more on the AF. My problem is not so much with them disagreeing with most of the work, it’s about the disagreement stopping to “that’s not going to work” and not having dialogue and back and forth.

Just in case anyone hasn’t already seen these, EY wrote Challenges to Christiano’s capability amplification proposal and this comment (that I already linked to in a different comment on this page) (also has a reply thread), both in 2018. Also The Rocket Alignment Problem.

• Context for anyone who’s not aware:

Nerd sniping is a slang term that describes a particularly interesting problem that is presented to a nerd, often a physicist, tech geek or mathematician. The nerd stops all activity to devote attention to solving the problem, often at his or her own peril

• Strong upvote.

My original exposure to LW drove me away in large part because issues you describe. I would also add (at least circa 2010) you needed to have a near-deistic belief in the anti-messianic emergence of some AGI so powerful that it can barely be described in terms of human notions of “intelligence.”

• I know we used to joke about this, but has anyone considered actually implementing the strategy of paying Terry Tao 10 million dollars to work on the problem for a year?

• Alternatively, has anyone considered… just asking him to?

That sounds naive. Maybe it is. But maybe it isn’t. Maybe smart people like Terry can be convinced of something like “Oh shit! This is actually crazy important and working on it would be the best way to achieve my terminal values.”

(Personally I’m working on the “get 10 million dollars” part. I’m not sure what the best thing would be to do after that, but paying Terry Tao doesn’t sound like a bad idea.)

Edit: Information about contacting him can be found here. If MIRI hasn’t already, it seems to me like it’d be a good idea to try reaching out. It also seems worth being at least a little bit strategic about it as opposed to, say, a cold email. More generally, I think this probably applies to, say, the top 100 mathematicians in the world, not just to Terry. (I hesitate to say this because of some EMH-like reasoning: if it made sense MIRI would have done it already, so I shouldn’t waste time saying this. But noticing and plucking all of the low hanging fruit is actually really hard, so despite my very high opinion of MIRI, I think it is at least plausible if not likely that there is a decent amount of low hanging fruit left to be plucked.)

• A reply to comments showing skepticism about how mathematical skills of someone like Tao could be relevant:

Last time I thought I would understood anything of Tao’s blog was around ~2019. Then he was working on curious stuff, like whether he could prove there can be finite-time blow-up singularities in Navier-Stokes fluid equations (coincidentally, solving the famous Millenium prize problem showing non-smooth solution) by constructing a fluid state that both obeys Navier-Stokes and also is Turing complete and … ugh, maybe I quote the man himself:

[...] one would somehow have to make the incompressible fluid obeying the Navier–Stokes equations exhibit enough of an ability to perform computation that one could programme a self-replicating state of the fluid that behaves in a manner similar to that described above, namely a long period of near equilibrium, followed by an abrupt reorganization of the state into a rescaled version of itself. However, I do not know of any feasible way to implement (even in principle) the necessary computational building blocks, such as logic gates, in the Navier–Stokes equations.

However, it appears possible to implement such computational ability in partial differential equations other than the Navier–Stokes equations. I have shown5 that the dynamics of a particle in a potential well can exhibit the behaviour of a universal Turing machine if the potential function is chosen appropriately. Moving closer to the Navier–Stokes equations, the dynamics of the Euler equations for inviscid incompressible fluids on a Riemannian manifold have also recently been shown6,7 to exhibit some signs of universality, although so far this has not been sufficient to actually create solutions that blow up in finite time.

(Tao, Nature Review Physics 2019.)

The relation (if any, to proving stuff about computational agents alignment people are interested in) is probably spurious (I myself don’t follow either Tao’s work or alignment literature), but I am curious if he’d be interested in working on a formal system of self-replicating /​ self-improving /​ aligning computational agents, and (then) capable of finding something genuinely interesting.

minor clarifying edits.

• Please keep the unilateralist’s curse in mind when considering plans like this. https://​​nickbostrom.com/​​papers/​​unilateralist.pdf

There’s a finite resource that gets used up when someone contacts Person in High Demand, which is roughly, that person’s openness to thinking about whether your problem is interesting.

• The following is probably moot because I think it’s best for AI research organizations (hopefully ones with some prestige) to be the ones who pursue this, but in skimming through the paper, I don’t get the sense that it is applicable here.

From the abstract (emphasis mine):

In some situations a number of agents each have the ability to undertake an initiative that would have significant effects on the others. Suppose that each of these agents is purely motivated by an altruistic concern for the common good. We show that if each agent acts on her own personal judgment as to whether the initiative should be undertaken, then the initiative will be undertaken more often than is optimal.

Toy example from the introduction:

A sports team is planning a surprise birthday party for its coach. One of the players decides that it would be more fun to tell the coach in advance about the planned event. Although the other players think it would be better to keep it a surprise, the unilateralist lets word slip about the preparations underway.

With Terry, I sense that it isn’t a situation where the action of one would have a significant affect on others (well, on Terry). For example, suppose Alice, a reader of LessWrong, saw my comment and emailed Terry. The most likely outcome here, I think, is that it just gets filtered out by some secretary and it never reaches Terry. But even if it did reach Terry, my model of him/​people like him is that, if he in fact is unconvinced by the importance of AI safety, it would only be a mild annoyance and he’d probably forget it ever happened.

On the other hand, my model is also that if dozens and dozens of these emails reach him to the point where it starts to be an inconvenience to deal with them, at that point I think it would make him more notably annoyed, and I expect that this would make him less willing to join the cause. However, I expect that it would move him from thinking like a scout/​weak sports fan to thinking like a weak/​moderate sports fan. In other words, I expect the annoyance to make him a little bit biased, but still open to the idea and still maintaining solid epistemics. That’s just my model though.

• I think the model clearly applies, though almost certainly the effect is less strictly binary than in the surprise party example.

I expect the annoyance to make him a little bit biased, but still open to the idea and still maintaining solid epistemics.

This is roughly a crux for me, yeah. I think dozens of people emailing him would cause him to (fairly reasonably, actually!) infer that something weird is going on (e.g., people are in a crazy echo chamber) and that he’s being targeted for unwanted attention (which he would be!). And it seems important, in a unilateralist’s curse way, that this effect is probably unrelated to the overall size of the group of people who have these beliefs about AI. Like, if you multiply the number of AI-riskers by 10, you also multiply by 10 the number of people who, by some context-unaware individual judgement, think they should cold-email Tao. Some of these people will be correct that they should do something like that, but it seems likely that many of such people will be incorrect.

• Yeah, random internet forum users emailing eminent mathematician en masse would be strange enough to be non-productive. I for one wasn’t thinking anyone would to, I don’t think it was what OP suggested. To anyone contemplating sending one, the task is best delegated to someone who not only can write coherent research proposals that sound relevant to the person approached, but can write the best one.

Mathematicians receive occasional crank emails about solutions to P ?= NP, so anyone doing the reaching needs to be reputable to get past their crank filters.

• I think the people cold emailing Terry in this scenario should at least make sure they have the $10M ready! • fwiw, I don’t think someone’s openness to thinking about an idea necessarily goes down as more people contact them about it. I’d expect it to go up. Although this might not necessarily be true for our target group • If MIRI hasn’t already, it seems to me like it’d be a good idea to try reaching out. It also seems worth being at least a little bit strategic about it as opposed to, say, a cold email. +1 especially to this—surely MIRI or a similar x-risk org could attain a warm introduction with potential top researchers through their network from someone who is willing to vouch for them. • This seems noncrazy on reflection. 10 million dollars will probably have very small impact on Terry Tao’s decision to work on the problem. OTOH, setting up an open invitation for all world-class mathematicians/​physicists/​theoretical computer science to work on AGI safety through some sort of sabbatical system may be very impactful. Many academics, especially in theoretical areas where funding for even the very best can be scarce, would jump at the opportunity of a no-strings-attached sabbatical. The no-strings-attached is crucial to my mind. Despite LW/​Rationalist dogma equating IQ with weirdo-points, the vast majority of brilliant (mathematical) minds are fairly conventional—see Tao, Euler, Gauss. EA cause area? • 10 million dollars will probably have very small impact on Terry Tao’s decision to work on the problem. That might be true for him specifically, but I’m sure there are plenty of world-class researchers who would find$10 million (or even $1 million) highly motivating. • I’m probably too dumb to have an opinion of this matter, but the belief that all super-genius mathematicians care zero about being fabulously wealthy strikes me as unlikely. • Read it again, I think you guys agree I’m sure there are plenty of world-class researchers who would find$10 million (or even 1 million) highly motivating. simplifies to “I’m sure [...] researchers [...] would find [money] highly motivating” • Ha, I know. I was weighing in, in support, against this claim he was replying to: 10 million dollars will probably have very small impact on Terry Tao’s decision to work on the problem. • But what’s bottlenecking alignment isn’t mathematical cognition. The people contributing interesting ideas to AI alignment, of the sort that Eliezer finds valuable, tend to have a history of deep curiosity about philosophy and big-picture thinking. They have made interesting comments on a number of fields (even if from the status of a layperson). To make progress in AI alignment you need to be good at the skill “apply existing knowledge to form mental models that let you predict in new domains.” By contrast, mathematical cognition is about exploring an already known domain. Maybe forcasting, especially mid-range political forecasting during times of change, comes closer to measuring the skill. (If Terence Tao happens to have a forecasting hobby, I’d become more excited about the proposal.) It’s possible that a super-smart mathematician also excels at coming up with alignment solutions (the likelihood is probably a lot higher than for the typical person), but the fact that they spent their career focused on math, as opposed to stronger “polymath profile,” makes me think “probably would’t be close to the very top of the distribution for that particular skill.” Quote by Eliezer: Similarly, the sort of person who was like “But how do you know superintelligences will be able to build nanotech?” in 2008, will probably not be persuaded by the demonstration of AlphaFold 2, because it was already clear to anyone sensible in 2008, and so anyone who can’t see sensible points in 2008 probably also can’t see them after they become even clearer. There are some people on the margins of sensibility who fall through and change state, but mostly people are not on the exact margins of sanity like that. I also share the impression that a lot of otherwise smart people fall into this category. If Eliezer is generally right, a big part of the problem is “too many people are too bad at thinking to see it.” When forming opinions based on others’ views, many don’t filter experts by their thinking style (not: “this person seems unusually likely to have the sort of cognition that lets them make accurate predictions in novel domains”), but rather look for credentials and/​or existing status within the larger epistemic community. Costly actions are unlikely without a somewhat broad epistemic consensus. The more we think costly actions are going to be needed, the more important it seems to establish a broad(er) consensus on whose reasoning can be trusted most. • Tao is also great at building mathematical models of messy phenomena—here’s an article where he does a beautiful analysis of sailing: https://​​terrytao.wordpress.com/​​2009/​​03/​​23/​​sailing-into-the-wind-or-faster-than-the-wind I’d be surprised if he didn’t have some good insights about AI and alignment after thinking about it for a while. • I disagree. Predicting who will make the most progress on AI safety is hard. But the research is very close to existing mathematical/​theoretical CS/​theoretical physics/​AI research. Getting the greatest mathematical minds on the planet to work on this problem seems like an obvious high EV bet. I might also add that Eliezer Yudkowsky, despite his many other contributions, has made only minor direct contributions to technical AI Alignment research. [His indirect contribution by highlighting & popularising the work of others is high EV impact] • I might also add that Eliezer Yudkowsky, despite his many other contributions, has made only minor direct contributions to technical AI Alignment research. [His indirect contribution by highlighting & popularising the work of others is high EV impact] I don’t think this is true at all. Like, even prosaic alignment researchers care about things like corrigibility, which is an Eliezer-idea. • That doesn’t update me, but to prevent misunderstandings let me clarify that I’m not saying it’s a bad idea to offer lots of money to great mathematicians (presumably with some kind of test-of-fit trial project). It might still be worth it given that we’re talent bottlenecked and the skill does correlate with mathematical ability. I’m just saying that, to me, people seem to overestimate the correlation and that the biggest problem is elsewhere, and the fact that people don’t seem to realize where the biggest problem lies is itself a part of the problem. (Also you can’t easily exchange money for talent because to evaluate an output of someone’s test-of-fit trial period you need competent researcher time. You also need competent researcher time to give someone new to alignment research a fair shot at succeeding with the trial, by advising them and with mentoring. So everything is costly and the ideas you want to pursue have to be above a certain bar.) • I’m open to have a double-crux high-bandwitth talk about this. Would you be up for that? *************************** I think 1. you are underestimating how much Very Smart Conventional People in Academia are Generically Smart and how much they know about philosophy/​big picture/​many different topics. 2. overestimating how novel some of the insights due to prominent people in the rationality community are; how correlated believing and acting on Weirdo Beliefs is with ability to find novel solutions to (technical) problems—i.e. the WeirdoPoints=g-factor belief prevalent in Rationalist circles. 3. underestimating how much better a world-class mathematician is than the average researcher, i.e. there is the proverbial 10x programmer. Depending on how one measures this, some of the top people might easily be >1000x. 4. “By contrast, mathematical cognition is about exploring an already known domain. Maybe forcasting, especially mid-range political forecasting during times of change, comes closer to measuring the skill. ” this jumps out to me. The most famous mathematicains are famous precisely because they came up with novel domains of thought. Although good forecasting is an important skill and an obvious sign of intelligence & competence it is not necessarily a sign of a highly creative researcher. Much of forecasting is about aggregating data and expert opinion; being “too creative” may even be a detriment. Similarly, many of the famous mathematical minds of the past century often had rather naive political views; this is almost completely, even anti-correlated, with their ability to come up with novel solutions to technical problems. 5. “test-of-fit trial project” also jumps out to me. Nobody has succesfully aligned a general artificial intelligence. The field of AGI safety is in its infancy, many people disagree on the right approach. It is absolutely laughable to me that in the scenario where after much work we get on Terry Tao on board, some group of AI safety researchers (who?) decide he’s not “a good fit for the team”, or even that the research time of existing AGI safety researchers is so valuable that they couldn’t find the time to evaluate his output. • Sounds good! 1. This doesn’t seem like a crux to me the way you worded it. The way to phrase this so I end up disagreeing: “Very Smart Conventional People in Academia have surprisingly accurate takes (compared to what’s common in the rationality community) on philosophy/​big picture/​many different topics.” In my view, the rationality community specifically selects for strong interest in that sort of thing, so it’s unsurprising that even very smart successful people outside of it do worse on average. My model is that strong interest in getting philosophy and big-picture questions right is a key ingredient to being good at getting them right. Similar to how strong interest in mathematical inquiry is probably required for winning the Fields medal – you can’t just do it on the side while spending your time obsessing over other things. 2. We might have some disagreements here, but this doesn’t feel central to my argument, i.e., not like a crux. I’d say “insights” are less important than “ability to properly evaluate what constitutes an insight (early on) or have novel ones yourself.” 3. I agree with you here. My position is that there’s a certain skillset where I (ideally) want alignment researchers to be really high on (we take what we get, of course, but just like there are vast differences in mathematical abilities, the differences on the skillset I have in mind would also go to 1,000x). 4. Those are great points. I’m changing my stated position to the following: • Mathematical genius (esp. coming up with new kinds of math) may be quite highly correlated with being a great alignment researcher, but it’s somewhat unclear, and anyway it’s unlikely that people can tap into that potential if they spent an entire career focusing primarily on pure math. (I’m not saying it’s impossible.) • Particularly, I notice that people past a certain age are a lot less likely to change their beliefs than younger people. (I didn’t know Tao’s age before checking the Wikipedia article just now. I think the age point feels like a real crux because I’d rank the proposal quite different depending on the age I see on there.) 5. This feels like a crux. Maybe it reduces to the claim that there’s an identifiable skillset important for alignment breakthroughs (especially at the “pre-paradigmatic” or “disentanglement research” stage) that doesn’t just come with genius-level mathematical abilities. Just like English professors could tell whether or not Terence Tao (or Elon Musk) have writing talent, I’d say alignment researchers can tell after a trial period whether or not someone’s early thoughts on alignment research have potential. Nothing laughable about that and nothing outrageous about English professors coming to a negative evaluation of someone like Musk or Tao, despite them being wildly outclassed wrt mathematical ability or ability to found and run several companies at once. --- I know you haven’t mentioned Musk, but I feel like people get this one wrong for reasons that might be related to our discussion. I’ve seen EAs make statements like “If Musk tried to deliberately optimize for aligning AI, we’d be so much closer to success.” I find that cringy because being good at making a trillion dollars is not the same as being good at steering the world through the (arguably) narrow pinhole where things go well in the space of possible AI-related outcomes. A lot of the ways of making outsized amounts of money involve all kinds of pivots or selling out your values to follow the gradients from superficial incentives that make things worse for everyone in the long run. That’s the primary thing you want to avoid when you want to accomplish some ambitious “far mode” objective (as opposed to easily measurable objectives like shareholder profits). In short, I think good “conventional” CEOs often have good judgment, yes, but also a lot of drive to get people to push ahead, and the latter may be more important to their success than judgment on which exact strategy they start out with. A lot of the ways of making money have easy-to-select good feedback cycles. If you want to tackle a goal like “align AI on the first try,” or “solve complicated geopolitical problem without making it worse,” you need to be able to balance drive (“being good at pushing your allies to do things”) with “making sure you do things right” – and that’s not something where I expect conventionally successful CEOs to have undergone super-strong selection pressure. • To bypass the argument of whether pure maths talent is what is needed, we should generalise “Terry Tao /​ world’s best mathematicians” to “anyone a panel of top people in AGI Safety would have on their dream team (who otherwise would be unlikely to work on the problem)” • Re Musk, his main goal is making a Mars Colony (SpaceX), with lesser goals of reducing climate change (Tesla, Solar City) and aligning AI (OpenAI, FLI). Making a trillion dollars seems like it’s more of a side effect of using engineering and capitalism as the methodology. Lots of his top level goals also involve “making sure you do things right” (i.e. making sure the first SpaceX astronauts don’t die). OpenAI was arguably a mis-step though. • Did Musk pay research funding for people to figure out whether the best way to eventually establish a Mars colony is by working on space technology as opposed to preventing AI risk /​ getting AI to colonize Mars for you? My prediction is “no,” which illustrates my point. Basically all CEOs of public-facing companies like to tell inspiring stories about world-improvement aims, but certainly not all of them prioritize these aims in a dominant sense in their day-to-day thinking. So, observing that people have stated altruistic aims shouldn’t give us all that much information about what actually drives their cognition, i.e., about what aims they can de facto be said to be optimizing for (consciously or subconsciously). Importantly, I think that even if we knew for sure that someone’s stated intentions are “genuine” (which I don’t have any particular reason to doubt in Musk’s example), that still leaves the arguably more important question of “How good is this person at overcoming the ‘Elephant in the Brain’?” I think that we’re unlikely to get good outcomes unless we place careful emphasis on leadership’s ability to avoid mistakes that might kill the intended long-term impact without being bad from an “appearance of being successful” standpoint. • Strongly agree. Pure math is its own ballgame, even theoretical computer science will seem too applied for some folks deep into pure math. Let alone a weird mixture of computer science practice (not theory) and philosophy—which is probably what AI safety is. • People disagree about to what degree formal methods will be effective/​quick enough to arrive. I’d like to point out that Paul Christiano, one of the most well-known proponents of more non-formal thinking & focus on existing ML-methods, still has a very strong traditional math/​CS background - (i.e. Putnam Fellow, a series of very solid math/​CS papers). His research methods/​thinking is also very close to how theoretical physicists might think about problems. Even a nontraditional thinker like EY did very well on math contests in his youth. • Thanks for this, and I agree with you too. Hence mentioned “some”. • Aside from the fact that I just find this idea extremely hilarious, it seems like a very good idea to me to try to convince people who might be able to make progress on the problem to try. Whether literally sending Terry Tao 10 million dollars is the best way to go about that seems dubious, but the general strategy seems important. I’d argue the sequences /​ HPMOR /​ whatever were versions of that strategy to some extent and seem to have had notable impact. • Ha, the same point on the EA Forum! (What is the origin of the idea?) I think we probably want to go about it in a way that maximises credibility—i.e. it coming from a respected academic institution, even if the money is from elsewhere (CHAI, FHI, CSER, FLI, BERI, SERI could help with this). And also have it open to all Fields Medalists /​ all Nobel Prize winners in Physics /​ other equivalent in Computer Science, or Philosophy(?) or Economics(?) /​ anyone a panel of top people in AGI Safety would have on their dream team (who otherwise would be unlikely to work on the problem). • The idea has been joked about for awhile. I think it is probably worth trying in both the literally offer Tao 10 million and the generalized case of finding the highest g people in the world and offering them salaries that seem truly outrageous. Here and on EA forum, many claim genius people would not care about 10 million dollars. I think this is, to put it generously, not at all obvious. And certainly something we should establish empirically. Though Eliezer is a genius, I do not think he is literally the smartest person on the planet. To the extent we can identify the smartest people on the planet, we would be a really pathetic civilization were we were not willing to offer them NBA-level salaries to work on alignment. • I was halfway through a PhD on software testing and verification before joining Anthropic (opinions my own, etc), and I’m less convinced than Eliezer about theorem-proving for AGI safety. There are so many independently fatal objections that I don’t know how to structure this or convince someone who thinks it would work. I am therefore offering a1,000 prize for solving a far easier problem:

Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

Your proof must not include executing the model, nor equivalent computations (e.g. concolic execution). You may train a custom model and/​or directly set model weights, so long as it uses a standard EfficientNet architecture and reaches 99% accuracy. Bonus points for proving more of the sensitivity curve.

I will also take bets that nobody will accomplish this by 2025, nor any loosely equivalent proof for a GPT-3 class model by 2040. This is a very bold claim, but I believe that rigorously proving even trivial global bounds on the behaviour of large learned models will remain infeasible.

And doing this wouldn’t actually help, because a theorem is about the inside of your proof system. Recognising the people in a huge collection of atoms is outside your proof system. Analogue attacks like Rowhammer are not in (the ISA model used by) your proof system—and cache and timing attacks like Spectre probably aren’t yet either. Your proof system isn’t large enough to model the massive floating-point computation inside GPT-2, let alone GPT-3, and if it could the bounds would be .

I still hope that providing automatic disproof-by-counterexample might, in the long term, nudge ML towards specifiability by making it easy to write and check falsifiable properties of ML systems. On the other hand, hoping that ML switches to a safer paradigm is not the kind of safety story I’d be comfortable relying on.

• You’re attacking a strawman of what kind of theorems we want to prove. Obviously we are not going to prove theorems that contain specific datasets as part of the statement. What we’re going to do is build a theory founded on certain assumptions about the real-world (such as locality /​ decoupling of dynamics on different scales /​ some kind of chaos /​ certain bounds on computational complexity /​ existence of simple fundamental laws etc) and humans (e.g. that they are approximately rational agents, for some definition thereof). Such a theory can produce many insights about what factors influence e.g. the possibility of adversarial attacks that you mention, most of which will be qualitative and some of which can be made quantitative by combining with empirical research (such as the research OpenAI does on scaling laws).

And, ofc the theory is only as good as its assumptions. Ofc if there are attack vectors your model doesn’t account for, your system can be attacked. Having a theory is not a magical wand that immediately solves everything. But, it does put you in a much, much better position than working off pure guesswork and intuition.

Another angle is that, once we can at least state the theorem we might try to make the AI itself prove it. This can still fail: maybe the theorem is practically unprovable, or maybe we can’t safely train the AI to prove theorems. But it does give us some additional leverage.

• First, an apology: I didn’t mean this to be read as an attack or a strawman, nor applicable to any use of theorem-proving, and I’m sorry I wasn’t clearer. I agree that formal specification is a valuable tool and research direction, a substantial advancement over informal arguments, and only as good as the assumptions. I also think that hybrid formal/​empirical analysis could be very valuable.

Trying to state a crux, I believe that any plan which involves proving corrigibility properties about MuZero (etc) is doomed, and that safety proofs about simpler approximations cannot provide reliable guarantees about the behaviour of large models with complex emergent behaviour. This is in large part because formalising realistic assumptions (e.g. biased humans) is very difficult, and somewhat because proving anything about very large models is wildly beyond the state of the art and even verified systems have (fewer) bugs.

Being able to state theorems about AGI seems absolutely necessary for success; but I don’t think it’s close to sufficient.

• I think we might have some disagreement about degree more than about kind. I think that we are probably going to design architectures that make proving easier rather than proving things about architectures optimized only for capability, but not necessarily. Moreover, some qualitative properties are not sensitive to architecture and we can prove them about classes of architectures that include those optimized for capability. And, I think humans also belong to a useful class of agents with simple description (e.g. along the lines I described here) and you don’t need anything like a detailed model of bias. And, people do manage to prove some things about large models, e.g. this, just not enough things. And, some of the proofs might be produced by the system itself in runtime (e.g. the system will have a trustworthy/​rigorous part and an untrustworthy/​heuristic part and the rigorous part will make sure the heuristic part is proving the safety of its proposals before they are implemented).

I think the pipeline of success looks something like theoretical models ⇒ phenomenological models (i.e. models informed by a combination of theory and experiment) ⇒ security-mindset engineering (i.e. engineering that keeps track of the differences between models and reality and makes sure they are suitably bounded /​ irrelevant) ⇒ plethora of security-mindset testing methods, including but not limited to formal verification (i.e. aiming for fool-proof test coverage while also making sure that, modulo previous tests, each test involves using the system in safe ways even if it has bugs). And ofc it’s not a pure waterfall, there is feedback from each stage to previous stages.

• That might actually be easy to prove with some effort (or it might not), consider this strategy:

Let’s assume that the input to the system are PNG images with 8-bit values between 0 and 255, that are converted into floating-point tensors before entering the net, and that the bits you are talking about are those of the original images. And lets also assume that the upscaling from the small MNIST images to the input of the net is such that each float of the tensor corresponds to exactly one value in the original image (that is, there is no interpolation). And that we are using the exact architecture of table 1 of the EfficientNet paper (https://​​arxiv.org/​​pdf/​​1905.11946v5.pdf).

Then we might be able to train the network freezing the weights of the first Conv3x3 to the identity (1 in the centre of the kernel and 0s around it) so that the 32 channels of the next layer receive 2x downscaled copies of the image. We also remove the first batch norm layer, it will be added later. Losing the ability to perform useful computations in this layer is probably not enough to make it perform below 99% accuracy given that MNIST is so simple.

If this works, then we need to edit the weights of the first Conv layer and of the batch norm layer (if they can be called weights) to (assuming float32):

Central value of kernel: 2**-149

Division value in batch norm: 2**-49

Multiplication value of the batch norm: 2**100

All other values to the neutral element of their operations

This will turn all values of 127 (in the PNG) and below into exactly 0, and values above into exactly 1 because of numerical underflow (I checked this with numpy). The network will process this from the swish activation onwards exactly the same way it would process the binarized versions of the images, and in all likelihood give them the same class.

127 in base 2 is 01111111, and 128 is 10000000. You can’t go from a perceived black pixel to a perceived white pixel by only changing the last bit, so the maximum difference in class probability that you can obtain by only changing the least significant bit is zero.

• I note that this seems pretty strongly disanalogous to any kind of safety proof we care about, and doesn’t generalise to any reasoning about nonzero sensitivity. That said your assumptions seem fair to me, and I’d be happy to pay out the prize if this actually works.

• Yes, I’m aware of that, I tried to find a better proof but failed. Attempts based on trying to compute the maximum possible change (instead of figuring out how to get a desired change) are doomed. Changing the last bit isn’t an infinitesimal change, so using calculus to compute the maximum derivative won’t work. EfficientNets use swish activations, not ReLUs, which aren’t locally linear, so we will have to deal with the chaotic dynamics that show up whenever non-linear functions are iteratively applied. The sigmoid inside the swish does eventually saturate because of floating-point arithmetic, making it effectively locally linear, but then we have to deal with the details of the exact binary representations.

There might be another solution, though: the softmax is the n-input generalization of the sigmoid, so it can also saturate to exact 0s and 1s. We could try to overfit a network using a very high learning rate so that for one randomly generated but fixed image it predicts some class with 100% certainty and given the same image with some pixel changed it predicts another class with also 100% certainty. Then, if this works, we could try training it on MNIST, adding the two original images to every mini-batch with a loss multiplier much greater than for the regular samples. That way the answer to your question becomes 1.

If nobody comes up with anything better, when I get some free time I will try to implement the binarization approach and then I will send you the code.

• Sure, just remember that an experimental demonstration isn’t enough—“Your proof must not include executing the model, nor equivalent computations”.

• Am I correct that you wouldn’t find a bound acceptable, you specifically want the exact maximum?

• Take an EfficientNet model with >= 99% accuracy on MNIST digit classification. What is the largest possible change in the probability assigned to some class between two images, which differ only in the least significant bit of a single pixel? Prove your answer before 2023.

You aren’t counting the fact that you can pretty easily bound this based on the fact that image models are Lipschitz, right? Like, you can just ignore the ReLUs and you’ll get an upper bound by looking at the weight matrices. And I believe there are techniques that let you get tighter bounds than this.

• If you can produce a checkable proof of this over the actual EfficientNet architecture, I’d pay out the prize. Note that this uses floating-point numbers, not the reals!

• Would running the method in this paper on EfficientNet count?

What if we instead used a weaker but still sound method (e.g. based on linear programs instead of semidefinite programs)?

• On a quick skim it looks like that fails both “not equivalent to executing the model” and the float32 vs problem.

It’s a nice approach, but I’d also be surprised if it scales to maintain tight bounds on much larger networks.

• By “checkable” do you mean “machine checkable”?

I’m confused because I understand you to be asking for a bound on the derivative of an EfficientNet model, but it seems quite easy (though perhaps kind of a hassle) to get this bound.

I don’t think the floating point numbers matter very much (especially if you’re ok with the bound being computed a bit more loosely).

• Ah, crux: I do think the floating-point matters! Issues of precision, underflow, overflow, and NaNs bedevil model training and occasionally deployment-time behavior. By analogy, if we deploy an AGI the ideal mathematical form of which is aligned we may still be doomed, even it’s plausibly our best option in expectation.

Checkable meaning that I or someone I trust with this has to be able to check it! Maxwell’s proposal is simple enough that I can reason through the whole thing, even over float32 rather than , but for more complex arguments I’d probably want it machine-checkable for at least the tricky numeric parts.

• I’d award half the prize for a non-trivial bound.

• OK, so it sounds like Eliezer is saying that all of the following are very probable:

1. ML, mostly as presently practiced, can produce powerful, dangerous AGI much sooner than any other approach.

• The number of technical innovations needed is limited, and they’re mostly relatively easy to think of.

• Once those innovations get to a sort of base-level AGI, it can be scaled up to catastropic levels by throwing computing power at it.

• ML-based AGI isn’t the “best” approach, and it may or may not be able to FOOM, but it will still have the ability and motivation to kill everybody or worse.

2. There’s no known way to make that kind of AGI behave well, and no promising approaches.

• Lack of interpretability is a major cause of this.

• There is a very low probability of anybody solving this problem before badly-behaved AGI has been created and has taken over.

3. Nonetheless, the main hope is still to try to build ML-based AGI which--

• Behaves well

• Is capable of either preventing other ML-based AGI from being created, or preventing it from behaving badly. Or at least of helping you to do those things.

I would think this would require a really major lead. Even finding other projects could be hard.

4. (3) has to be done while keeping the methods secret, because otherwise somebody will copy the intelligence part, or copy the whole thing before its behavior is under control, maybe add some minor tweaks of their own, and crank the scale knob to kill everybody.

• Corollary: this almost impossible system has to be built by a small group, or by set of small “cells” with very limited communication. Years-long secrets are hard.

• That group or cell system will have to compete with much larger, less constrained open efforts that are solving an easier problem. A vastly easier problem under assumptions (1) and (2).

• A bunch of resources will necessarily get drained away from the main technical goal, toward managing that arms race.

5. (3) is almost impossible anyway and is made even harder by (4). Therefore we are almost certainly screwed.

Well. Um.

If that’s really the situation, then clinging to (3), at least as a primary approach, seems like a very bad strategy. “Dying gracefully” is not satisfying. It seems to me that--

1. People who don’t want to die on the main line should be doing, or planning, something other than trying to win an AGI race… like say flipping tables and trying to foment nuclear war or something.

That’s still not likely to work, and it might still only be a delaying tactic, but it still seems like a better, more achievable option than trying to control ultra-smart ML when you have no idea at all how to do that. If you can’t win, change the game.

and/​or

2. People who feel forced to accept dying on the main line ought to be putting their resources into approaches that will work on other timelines, like say if ML turns out not to be able to get powerful enough to cause true doom.

If ML is really fated to win the race so soon and in such a total way, then people almost certainly can’t change the main line by rushing to bolt on a safety module. They might, however, be able to significantly change less probable time lines by doing something else involving other methods of getting to AGI. And the overall probability mass they can add to survival that way seems like it’s a lot more.

The main line is already lost, and it’s time to try to salvage what you can.

Personally, I don’t think I see that you can turn an ML system that has say 50-to-250 percent of a human’s intelligence into an existential threat just by pushing the “turbo” button on the hardware. Which means that I’m kind of hoping nobody goes the “nuclear war” route in real life.

I suspect that anything that gets to that point using ML will already be using a significant fraction of the compute available in the world. Being some small multiple of as good as a human isn’t going to let you build more or better hardware all that fast, especially without getting yourself shut down. And I don’t think you can invent working nanotech without spending a bunch of time in the lab unless you are already basically a god. Doing that in your head is definitely a post-FOOM project.

But I am far from an expert. If I’m wrong, which I very well could be, then it seems crazy to be trying to save the “main line” by trying to do something you don’t even have an approach for, when all that has to happen for you to fail is for somebody else to push that “turbo” button. That feels like following some script for heroically refusing to concede anything, instead of actually trying to grab all the probability you can realistically get.

• People who don’t want to die on the main line should be doing, or planning, something other than trying to win an AGI race… like say flipping tables and trying to foment nuclear war or something.

How does fomenting nuclear war change anything? The basic logic for ‘let’s also question our assumptions and think about whether there’s some alternative option’ is sound (and I mostly like your decomposition of Eliezer’s view), but you do need to have the alternative plan actually end up solving the problem.

Specific proposals and counter-proposals (that chain all the way to ‘awesome intergalactic future’) are likely the best way to unearth cruxes and figure out what makes sense to do here. Just saying ‘let’s consider third options’ or ‘let’s flip the tables somehow’ won’t dissuade Eliezer because it’s not a specific promising-looking plan (and he thinks he’s already ruled out enough plans like this to make them even doomier than AGI-alignment-mediated plans).

Eliezer is saying ‘there isn’t an obvious path forward, so we should figure out how to best take advantage of future scenarios where there are positive model violations (“miracles”)‘; he’s not saying ‘we’re definitely doomed, let’s give up’. If you agree but think that something else gives us better/​likelier miracles than trying to align AGI, then that’s a good place to focus discussion.

One reason I think Eliezer tends to be unpersuaded by alternatives to alignment is that they tend to delay the problem without solving it. Another reason is that Eliezer thinks AGI and alignment are to a large extent unknown quantities, which gives us more reason to expect positive model violations; e.g., “maybe if X happened the world’s strongest governments would suddenly set aside their differences and join in harmony to try to handle this issue in a reasonable way” also depends on positive violations of Eliezer’s model, but they’re violations of generalizations that have enormously more supporting data.

We don’t know much about how early AGI systems tend to work, or how alignment tends to work; but we know an awful lot about how human governments (and human groups, and human minds) tend to work.

• Nuclear war was just an off-the-top example meant to illustrate how far you might want to go. And I did admit that it would probably basically be a delaying tactic.

If I thought ML was as likely to “go X-risk” as Eliezer seems to, then I personally would want to go for the “grab probability on timelines other than what you think of as the main one” approach, not the “nuclear war” approach. And obviously I wouldn’t treat nuclear war as the first option for flipping tables… but just as obviously I can’t come up with a better way to flip tables off the top of my head.

If you did the nuclear war right, you might get hundreds or thousands of years of delay, with about the same probability [edit: meant to say “higher probability” but still indicate that it was low in absolute terms] that I (and I think Eliezer) give to your being able to control[1] ML-based AGI. That’s not nothing. But the real point is that if you don’t think there’s a way to “flip tables”, then you’re better off just conceding the “main line” and trying to save other possibilities, even if they’re much less probable.

1. ↩︎

I don’t like the word “alignment”. It admits too many dangerous associations and interpretations. It doesn’t require them, but I think it carries a risk of distorting one’s thoughts.

• I think there are some ways of flipping tables that offer some hope (albeit a longshot) of actually getting us into a better position to solve the problem, rather than just delaying the issue. Basically, strategies for suppressing or controlling Earth’s supply of compute, while pressing for differential tech development on things like BCIs, brain emulation, human intelligence enhancement, etc, plus (if you can really buy lots of time) searching for alternate, easier-to-align AGI paradigms, and making improvements to social technology /​ institutional decisionmaking (prediction markets, voting systems, etc).

I would write more about this, but I’m not sure if MIRI /​ LessWrong /​ etc want to encourage lots of public speculation about potentially divisive AGI “nonpharmaceutical interventions” like fomenting nuclear war. I think it’s an understandably sensitive area, which people would prefer to discuss privately.

• If discussed privately, that can also lead to pretty horrific scenarios where a small group of people do something incredibly dumb/​dangerous without having outside voices to pull them away from such actions if sufficiently risky. Not sure if there is any “good” way to discuss such topics, though…

• If I thought ML was as likely to “go X-risk” as Eliezer seems to, then I personally would want to go for the “grab probability on timelines other than what you think of as the main one” approach

I’m not sure what you mean by “grab probability on timelines” here. I think you mean something like ‘since the mainline looks doomy, try to increase P(success) in non-mainline scenarios’.

Which sounds similar to the Eliezer-strategy, except Eliezer seems to think the most promising non-mainline scenarios are different from the ones you’re thinking about. Possibly there’s also a disagreement here related to ‘Eliezer thinks there are enough different miracle-possibilities (each of which is sufficiently low-probability) that it doesn’t make sense to focus in on one of them.’

There’s a different thing you could mean by ‘grab probability on timelines other than what you think of as the main one’, which I don’t think was your meaning, that’s something like: assuming things go well, AGI is probably further in the future than Eliezer thinks. So it makes sense to focus at least somewhat more on longer-timeline scenarios, while keeping in mind that AGI probably isn’t in fact that far off.

I think MIRI leadership would endorse ‘if things went well, AGI timelines were probably surprising long’.

• I think you mean something like ‘since the mainline looks doomy, try to increase P(success) in non-mainline scenarios’.

Yes, that’s basically right.

I didn’t bring up the “main line”, and I thought I was doing a pretty credible job of following the metaphor.

Take a simplified model where a final outcome can only be “good” (mostly doom-free) or “bad” (very rich in doom). There will be a single “winning” AGI, which will simply be the first to cross some threshold of capability. This cannot be permanently avoided. The winning AGI will completely determine whether the outcome is good or bad. We’ll call a friendly-aligned-safe-or-whatever AGI that creates a good outcome a “good AGI”, and one that creates a bad outcome a “bad AGI”. A randomly chosen AGI will be bad with probability 0.999.

You want to influence the creation of the winning AGI to make sure it’s a good one. You have certain finite resources to apply to that: time, attention, intelligence, influence, money, whatever.

Suppose that you think that there’s a 0.75 probability that something more or less like current ML systems will win (that’s the “ML timeline” and presumptively the “main line”). Unfortunately, you also believe that there’s only 0.05 probability that there’s any path at all to find a way for an AGI with an “ML architecture” to be good, within whatever time it takes for ML to win (probably there’s some correlation between how long it takes ML to win and how long it takes out to figure out how to make it good). Again, that’s the probability that it’s possible in the abstract to invent good ML in the available time, not the probability that it will actually be invented and get deployed.

Contingent on the ML-based approach winning, and assuming you don’t do anything yourself, you think there’s maybe a 0.01 probability that somebody else will actually arrange for a the winning AGI to be good. You’re damned good, so if you dump all of your attention and resources into it, you can double that to 0.02 even though lots of other people are working on ML safety. So you would contribute 0.01 times 0.75 or 0.0075 probability to a good outcome. Or at least you hope you would; you do not at this time have any idea how to actually go about it.

Now suppose that there’s some other AGI approach, call it X. X could also be a family of approaches. You think that X has, say, 0.1 probability of actually winning instead of ML (which leaves 0.15 for outcomes that are neither X nor ML). But you think that X is more tractable than ML; there’s a 0.75 probability that X can in principle be made good before it wins.

Contingent on X winning, there’s a 0.1 probability that somebody else will arrange for X to be good without you. But at the moment everybody is working on ML, which gives you runway to work on X before capability on the X track starts to rise. So with all of your resources, you could really increase the overall attention being paid to X, and raise that to 0.3. You would then have contributed 0.2 times 0.1 or 0.02 probability to a good outcome. And you have at least a vague idea how to make progress on the problem, which is going to be good for your morale.

Or maybe there’s a Y that only has a 0.05 probability of winning, but you have some nifty and unique idea that you think has a 0.9 probability of making Y good, so you can get nearly 0.045 even though Y is itself an unlikely winner.

Obviously these are sensitive to the particular probabilities you assign, and I am not really very well placed to assign such probabilities, but my intuition is that there are going to be productive Xs and Ys out there.

I may be biased by the fact that, to whatever degree I can assign priorities, I think that ML’s probability of winning, in the very manichean sense I’ve set up here, where it remakes the whole world, is more like 0.25 than 0.75. But even if it’s 0.75, which I suspect is closer to what Eliezer thinks (and would be most of his “0.85 by 2070”), ML is still handicapped by there not being any obvious way to apply resources to it.

Sure, you can split your resources. And that might make sense if there’s low hanging fruit on one or more branches. But I didn’t see anything in that transcript that suggested doing that. And you would still want to put the most resources on the most productive paths, rather than concentrating on a moon shot to fix ML when that doesn’t seem doable.

• which I suspect is closer to what Eliezer thinks (and would be most of his “0.85 by 2070”)

0.85 by 2070 was Nate Soares’ probability, not Eliezer’s.

• Personally, I don’t think I see that you can turn an ML system that has say 50-to-250 percent of a human’s intelligence into an existential threat just by pushing the “turbo” button on the hardware. Which means that I’m kind of hoping nobody goes the “nuclear war” route in real life.

Isn’t that part somewhat tautological? A sufficiently large group of humans is basically a superintelligence. We’ve basically terraformed the earth with cows and rice and cities and such.

A computer with >100% human intelligence would be incredibly economically valuable (automate every job that can be done remotely?), so it seems very likely that people would make huge numbers of copies if the cost of running one was less than compensation for a human doing the same job.

And that’s basically your superintelligence: even if a low-level AGI can’t directly self-improve (which seems somewhat doubtful since humans are currently improving computers at a reasonably fast rate), it could still reach superintelligence by being scaled from 1 to N.

• I want to push back against the idea that ANNs are “vectors of floating points” and therefore it’s impossible to prove things about them. Many algorithms involve continuous variables and we can prove things about them. Support vector machines are also learning algorithms that are “vectors of floating points” and we have a pretty good theory of how they work. In fact, there already is a sizable body of theoretical results about ANNs, even if it still falls significantly short of what we need.

The biggest problem is not necessarily in the “floating points”. The problem is that we still don’t have satisfactory models of what an “agent” is and what it means for an agent to be “aligned”. But, we do have some leads. And once we solve this part, there’s no reason of principle why it cannot be combined with some (hitherto unknown) theory of generalization bounds for ANNs.

• +1 to both points.

For an even more extreme example, Linear Regression is a large “vector of floating points” that’s easy enough to prove things about proofs are assigned as homework questions for introductory Linear Algebra/​Statistics classes.

I also think that we’ve had more significantly more theoretical progress on ANNs than I would’ve predicted ~5 years ago. For example, the theory of wide two layer neural networks has basically been worked out, as had the theory of infinitely deep neural networks, and the field has made significant progress understanding why we might expect the critical points/​local minima gradient descent finds in ReLU networks to generalize. (Though none of this work is currently good enough to inform practice, I think?)

• I’m uncertain about what abstractions/​models are best to use for reasoning about AI. Coq has a number of options, which might imply different things about provability of different propositions and economic viability of those proofs getting anywhere near prod. Ought a type theorist or proof engineer concern themselves with IEEE754 edge cases?

• We are not at the stage of formal verification, we are merely at the stage of constructing a theory from which we might be able to derive verifiable specifications. Once we have such specifications we can start discussing the engineering problem of formally verifying them. But the latter is just the icing on the cake, so to speak.

• I personally think work on reduced precision inference (e.g. 4 bit!) is probably useful, as circuits should be easier to analyze than floats.

• I am confused about the emphasis on secrecy.

Certainly, if you’re working on a substantial breakthrough in AI capability, there are reasons to keep it secret. But why would you work on that in the first place? One answer I can imagine is: “Current AI building blocks (deep learning) are too opaque and hard to prove things about, so we need to develop alternative building blocks. These alternative building blocks, if such are found, might produce a capability breakthrough.” But, currently we don’t have a solution even modulo the building blocks. Because, even if we allow computationally infeasible, or what I’ve been calling “weakly feasible” algorithms, such as Bayesian inference, we still don’t have a complete solution. Therefore, it seems reasonable to focus on “alignment modulo building blocks” i.e. alignment in the unbounded /​ weakly feasible regime, and I don’t see any reason to be secretive about it. On the contrary, we want to involve as many people as possible to get more progress and we want the AI community to know about whatever we come up with, to increase the probability they will use it.

Maybe the objection is, this path won’t lead us to success quickly enough. But then, what alternative path is better? And how would secrecy enable it?

• Certainly, if you’re working on a substantial breakthrough in AI capability, there are reasons to keep it secret. But why would you work on that in the first place?

Most of the mentions of secrecy in this post are in that context. I think a lot of people who say they care about the alignment problem think that the ‘two progress bars’ model, where you can work on alignment and capability independent of each other, is not correct, and so they don’t see all that much of a difference between capability work and alignment work. (If you’re trying to predict human approval of plans, for example, generic improvements in ability to predict things or understand plans help you as well.)

But even if you don’t believe in two progress bars, if you believe in differential tech development, it does seem like secrecy is a good idea (because not everyone is going to be trying to predict human approval). (It’s only in worlds where you think alignment is ‘easy’ compared to capabilities that this isn’t a concern.)

• If there’s no difference between capability work and alignment work, then how is it possible to influence anything at all? If capability and alignment go hand in hand, then either transformative capability corresponds to sufficient alignment (in which case there is no technical problem) or it doesn’t (in which case we’re doomed).

The only world in which secrecy makes sense, AFAICT, is if you’re going to solve alignment and capability all by yourself. I am extremely skeptical of this approach.

• One view could be that to do good alignment work, you need to have a good grasp on how to get to superhuman capabilities and then add constraints to the training process in the attempt to produce alignment. On this view, it doesn’t make a lot of sense to think about alignment without also thinking about capabilities because you don’t know what to put the “constraints” on top of, and what sorts of constraints you’re going to need.

• In this scenario we are double-doomed, since you can’t make progress on alignment before reaching the point where you already teeter on the precipice. I don’t think this is the case, and I would be surprised if Yudkowsky endorses this view. On the other hand, I am already confused, so who knows.

• I think both capabilities and alignment are too broad to be useful, and may benefit by going specific.

For example, Chris Olah’s interpretability work seems great for understanding NN’s and how they make decisions. This could be used to get NN’s to do what we specifically want and verify their behavior. It seems like a stretch to call this capabilities.

On the other hand, finding architectures that are more sample-efficient doesn’t directly help us avoid typical alignment issues.

It may be useful to write a post on this topic going into many specific examples because I’m still confused on how they relate, and I’m seeing that confusion crop up in this thread.

• I’m not sure what you’re arguing exactly. Would we somehow benefit if Olah’s work was done in secret?

A lot of work contributes towards both capability and alignment, and we should only publish work whose alignment : capability ratio is high enough s.t. it improves the trajectory in expectation. But, usually if the alignment : capability ratio is not high enough then we shouldn’t work on this in the first place.

• I don’t think Olah’s work should be secret.

I’m saying I partially agree with Vaniver that the “capabilties and alignment as independent” frame is wrong. Your alignment : capability ratio framing is a better fit, and I partially agree with

If the alignment: capability ratio is low enough to warrant secrecy as EY wants, then they should scrap those projects and work on ones that have a much higher ratio.

I say “partially” for both because I notice I’m confused and predict that after working through examples and talking with someone/​writing a post, I will have a different opinion.

• and we should only publish work whose alignment : capability ratio is high enough s.t. it improves the trajectory in expectation

Broadly I’d agree, but I think there are cases where this framing doesn’t work. Specifically, it doesn’t account for others’ inaccurate assessments of what constitutes a robust alignment solution. [and of course here I’m not disagreeing with “publish work [that] improves the trajectory in expectation”, but rather with the idea that a high enough alignment : capability ratio ensures this]

Suppose I have a fully implementable alignment approach which solves a large part of the problem (but boosts capability not at all). Suppose also that I expect various well-resourced organisations to conclude (incorrectly) that my approach is sufficient as a full alignment solution.
If I expect capabilities to reach a critical threshold before the rest of the alignment problem will be solved, or before I’m able to persuade all such organisations that my approach isn’t sufficient, it can make sense to hide my partial solution (or at least to be very careful about sharing it).

For example, take a full outer alignment solution that doesn’t address deceptive alignment.
It’s far from clear to me that it’d make sense to publish such work openly.

While I’m generally all for the research productivity benefits of sharing, I think there’s a very real danger that the last parts of the problem to be solved may be deep, highly non-obvious problems. Before that point, the wide availability of plausible-but-tragically-flawed alignment approaches might be a huge negative. (unless you already assume that some organisation will launch an AGI, even with no plausible alignment solution)

• Hmm, this might apply to some hypothetical situation, but it’s pretty hard for me to believe that it applies in practice in the present day.

First, you can share your solution while also writing about its flaws.

Second, I think “some TAI will be built whether there is any solution or not” is more likely than “TAI will be built iff a solution is available, even if the solution is flawed”.

Third, I just don’t see the path to success that goes through secrecy. If you don’t publish partial solutions, then you’re staking everything on being able to produce a complete solution by yourself, without any collaboration or external critique. This seems highly dubious, unless maybe if you literally gathered the brightest minds on the planet Manhattan project style, but no AI safety org is anywhere near that.

I can see secrecy being valuable in the endgame, when it does become plausible to create a full solution by yourself, or even build TAI by yourself, but I don’t see it in the present day.

• Again, I broadly agree—usually I expect sharing to be best. My point is mostly that there’s more to account for in general than an [alignment : capability] ratio.

Some thoughts on your specific points:

First, you can share your solution while also writing about its flaws.

Sure, but if the ‘flaw’ is of the form [doesn’t address problem various people don’t believe really matters/​exists], then it’s not clear that this helps. E.g. outer alignment solution that doesn’t deal with deceptive alignment.

Second, I think “some TAI will be built whether there is any solution or not” is more likely than “TAI will be built iff a solution is available, even if the solution is flawed”.

Here I probably agree with you, on reflection. My initial thought was that the [plausible-flawed-solution] makes things considerably worse, but I don’t think I was accounting for the scenario where people believe a robust alignment solution will be needed at some point, but just not yet—because you see this system isn’t a dangerous one...

Does this make up the majority of your TAI-catastrophe-probability? I.e. it’s mostly “we don’t need to worry yet… Foom” rather than “we don’t ever need to worry about (e.g.) deceptive alignment… Foom”.

Third, I just don’t see the path to success that goes through secrecy...

I definitely agree, but I do think it’s important not to approach the question in black-and-white terms. For some information it may make sense to share with say 5 or 20 people (at least at first).

I might consider narrower sharing if:

1. I have wide error bars on the [alignment : capabilities] ratio of my work. (in particular, approaches that only boost average-case performance may fit better as “capabilities” here, even if they’re boosting performance through better alignment [essentially this is one of Critch/​Krueger’s points in ARCHES])

2. I have high confidence my work solves [some narrow problem], high confidence it won’t help in solving the harder alignment problems, and an expectation that it may be mistaken for a sufficient solution.

Personally, I find it implausible I’d ever decide to share tangible progress with no-one, but I can easily imagine being cautious in sharing publicly.

On the other hand, doomy default expectations argue for ramping up the variance in the hope of bumping into ‘miracles’. I’d guess that increased sharing of alignment work boosts the miracle rate more significantly than capabilities.

• Does this make up the majority of your TAI-catastrophe-probability? I.e. it’s mostly “we don’t need to worry yet… Foom” rather than “we don’t ever need to worry about (e.g.) deceptive alignment… Foom”.

Sort of. Basically, I just think that a lot of actors are likely to floor the gas pedal no matter what (including all leading labs that currently exist). We will be lucky if they include any serious precautions even if such precautions are public knowledge, not to mention exhibit the security mindset to implement them properly. An actor that truly internalized the risk would not be pushing the SOTA in AI capabilities until the alignment problem is as solved as conceivably possible in advance.

• [ ]
[deleted]
• :(

i don’t want to die

• I don’t say this for one-upmanship, but I’m more worried about everyone else than just me. It basically doesn’t matter how much I care about each individual person, there are a lot of them (and also I’d feel pretty lousy about a stray paperclipper wiping out some aliens sometime).

• to be clear: obviously, as much as I very badly do not want to die, everyone else dying would be cosmically worse (i would obviously much rather die than have everyone else die).

I just didn’t see anyone else directly expressing my overwhelming immediate instinctive pure emotional reaction, so I figured I’d share it

• Sorry I worded my comment in a way that made that reply appropriate, it wasn’t what I meant. I just have a harder time imagining myself dead than I do everyone dead (as crazy as that sounds).

I’m on board with you, just less imaginative.

• I’ve been thinking about this for a few days actually. I think the intellectual part of my brain has a lot more experience imagining others dead than myself. I have actual experience knowing large numbers of people are dying, but only actual experience knowing that I can die.

• Eliezer Yudkowsky

You’re preaching to the choir there, but even if we were working with more strongly typed epistemic representations that had been inferred by some unexpected innovation of machine learning, automatic inference of those representations would lead them to be uncommented and not well-matched with human compressions of reality, nor would they match exactly against reality,

My impression is that this could be true, but also this post seems to argue (reasonably convincingly, in my view) that the space of possible abstractions (“epistemic representations”) is discrete rather than continuous, such that any representation of reality sufficiently close to “human compressions” would in fact be using those human compressions, rather than an arbitrarily similar set of representations that comes apart in the limit of strong optimization pressure. I’m curious as to whether Eliezer (or anyone who endorses this particular aspect of his view) has a strong counterargument against this, or whether they simply find it unlikely on priors.

• I’d also add that having a system that uses abstractions that are close to humans is insufficient for safety, because you’re putting those abstractions under stress by optimizing them.

I do think it’s plausible that any AI modelling humans will model humans as having preferences, but 1) I’d imagine these preference models as calibrated on normal world states, and not extend property off distribution (IE, as soon as the AI starts doing things with nanomachines that humans can’t reason properly about) and 2) “pointing” at the right part of the AI’s world model to yield preferences, instead of a proxy that’s a better model of your human feedback mechanism, is still an unsolved problem. (The latter point is outlined in the post in some detail, I think?) I also think that 3) there’s a possibility that there is no simple, natural core of human values, simpler than “model the biases of people in detail”, for an AI to find.

• I think there are some pretty knock-down cases where human concepts are continuous. E.g. if you want to cut a rose off of a rosebush, where can you cut to get a rose? (If this example seems insufficiently important, replace it with brain surgery.)

That said, we should be careful about arguments that we need to match human concepts to high precision, because humans don’t have concepts to high precision.

• In the abstraction formalism I use, it can be ambiguous whether any particular thing “is a rose”, while still having a roughly-unambiguous concept of roses. It’s exactly like clustering: a cluster can have unambiguous parameters (mean, variance, etc), but it’s still ambiguous whether any particular data point is “in” that cluster.

• Good point. I was more thinking that not only could it be ambiguous for a single observer, but different observers could systematically decide differently, and that would be okay.

Are there any concepts that don’t merely have continuous parameters, but are actually part of continuous families? Maybe the notion of “1 foot long”?

• Briefly read your post, I think there’s two parts to it.

First is regarding learning and modelling reality—if someone looks at a tree, they filter out the most important info—and perhaps they’ll do similar things independent of their exact level of intelligence. I agree this could be happening when starting out, the AGI will learn to look at tree images the way we do, it’ll possibly discover Newton’s laws, then move on to quantum mechanics in the same fashion we did. The problem is it won’t stop there. A set of values described by a person who assumes Newton’s laws—might simply cease to make sense to a person who has learnt quantum mechanics. The gap gets much wider when you consider someone who still thinks God runs planetary motion describe their values to someone who has come up and dismantled three sets of theories beyond current human knowledge of QFT or string theory or whatever. Or try imagining monkeys telling humans what their values are. It’s not even about what the values are or whether values are right or wrong, it’s about values being described using words and concepts that are stupid (in your view).

And this goes beyond physical laws, it goes to everything we think. The words we use, or the concepts or heuristics to reason about reality—they could all seem incredibly stupid to an AGI in the sense that they carve reality into joints in the wrong way or else are just so grossly suboptimal at maximising the objectives we actually care about.

Second is regarding the values themselves somehow all converging to the same place. Trivial counterexample would be monkeys and humans being non-aligned despite coming out the same evolutionary process. We care about monkeys, but most of us value human survival and prosperity a lot more than we value monkey prosperity. I could write more but maybe it’s worth discussing my first point first.

• ″...What do you do with this impossible challenge?

First, we assume that you don’t actually say “That’s impossible!” and give up a la Luke Skywalker. You haven’t run away.

Why not? Maybe you’ve learned to override the reflex of running away. Or maybe they’re going to shoot your daughter if you fail. We suppose that you want to win, not try—that something is at stake that matters to you, even if it’s just your own pride. (Pride is an underrated sin.)

Will you call upon the virtue of tsuyoku naritai? But even if you become stronger day by day, growing instead of fading, you may not be strong enough to do the impossible. You could go into the AI Box experiment once, and then do it again, and try to do better the second time. Will that get you to the point of winning? Not for a long time, maybe; and sometimes a single failure isn’t acceptable.

(Though even to say this much—to visualize yourself doing better on a second try—is to begin to bind yourself to the problem, to do more than just stand in awe of it. How, specifically, could you do better on one AI-Box Experiment than the previous?—and not by luck, but by skill?)

Will you call upon the virtue isshokenmei? But a desperate effort may not be enough to win. Especially if that desperation is only putting more effort into the avenues you already know, the modes of trying you can already imagine. A problem looks impossible when your brain’s query returns no lines of solution leading to it. What good is a desperate effort along any of those lines?

Make an extraordinary effort? Leave your comfort zone—try non-default ways of doing things—even, try to think creatively? But you can imagine the one coming back and saying, “I tried to leave my comfort zone, and I think I succeeded at that! I brainstormed for five minutes—and came up with all sorts of wacky creative ideas! But I don’t think any of them are good enough. The other guy can just keep saying ‘No’, no matter what I do.”

And now we finally reply: “Shut up and do the impossible!

As we recall from Trying to Try, setting out to make an effort is distinct from setting out to win. That’s the problem with saying, “Make an extraordinary effort.” You can succeed at the goal of “making an extraordinary effort” without succeeding at the goal of getting out of the Box.

“But!” says the one. “But, SUCCEED is not a primitive action! Not all challenges are fair—sometimes you just can’t win! How am I supposed to choose to be out of the Box? The other guy can just keep on saying ‘No’!”

True. Now shut up and do the impossible.

Your goal is not to do better, to try desperately, or even to try extraordinarily. Your goal is to get out of the box.”

• Fighting is different from trying. To fight harder for X is more externally verifiable than to try harder for X.

It’s one thing to acknowledge that the game appears to be unwinnable. It’s another thing to fight any less hard on that account.

• I don’t mean this as a personal attack, but I perceive this comment as being similar to, if you are at a particularly sad funeral service and just finished burying a loved one, proceeding to make a comment like, “Hm, maybe I should get some of those flowers for my front yard.”

• FWIW, I didn’t have a problem with it.

• Finally, some humour in what has up to now been a pretty grim read! And I’ve thought we were all doomed for the last ten years.

• I didn’t mean it humorously, I meant it more in a grim sense.

• Then you have hit a target you couldn’t yourself see, which is high genius indeed.

I’m a Brit, and jokes at funerals are appropriate and necessary, even if the funeral is one’s own. But this is surely not one of those transatlantic things? Even in Tarantino movies the doomed go out with a wisecrack.

• Yoav Ravid’s comment is quite Zen.

• Noted. fwiw, I wrote it before reading the article (after clicking the link at the top and seeing what it is). Maybe If I waited until I finished I would have written something more like “chathamroom.com seems very useful for this sort of thing” (without the excited tone).

• This reads like nothing new has happened in the discourse since 2011 except the development of interest in deep learning. Eliezer is still arguing with people exhibiting the same sorts of confusions he addressed in the Sequences. This seems like evidence that the “keep trying” behavior has long since hit diminishing returns relative to what I’d expect a “reorient” behavior to accomplish, e.g. more investigation into the causes of this kind of recalcitrance.

• A confusion: it seems that Eliezer views research that is predictable as basically-useless. I think I don’t understand what “predictable” means here. In what sense is expected utility quantilization not predictable?

Maybe the point is that coming up with the concept is all that matters, and the experiments that people usually do don’t matter because after coming up with the concept the experiments are predictable? I’m much more sympathetic to that, but then I’m confused why “predictable” implies “useless”; many prosaic alignment papers have as their main contribution a new algorithm, which seems like a similar type of thing as quantilization.

• An example that springs to my mind is Abram wrote a blog post in 2018 mentioning the “easy problem of wireheading”. He described both the problem and its solution in like one sentence, and then immediately moved on to the harder problems.

Later on, DeepMind did an experiment that (in my assessment) mostly just endorsed what Abram said as being correct.

For the record, I don’t think that particular DeepMind experiment was zero value, for various reasons. But at the same time, I think that Abram wins hands-down on the metric of “progress towards AI alignment per researcher-hour”, and this is true at both the production and consumption end (I can read Abram’s one sentence much much faster than I can skim the DeepMind paper).

If we had a plausible-to-me plan that gets us to safe & beneficial AGI, I would be really enthusiastic about going back and checking all the assumptions with experiments. That’s how you shore up the foundations, flesh out the details, start developing working code and practical expertise, etc. etc. But I don’t think we have such a plan right now.

Also, there are times when it’s totally unclear a priori what an algorithm will do just by thinking about it, and then obviously the experiments are super useful.

But at the end of the day, I feel like there are experiments that are happening not because it’s the optimal thing to do for AI alignment, but rather because there are very strong pro-experiment forces that exist inside CS /​ ML /​ AI research in academia and academia-adjacent labs.

• That’s a good example, thanks :)

EDIT: To be clear, I don’t agree with

But at the same time, I think that Abram wins hands-down on the metric of “progress towards AI alignment per researcher-hour”

but I do think this is a good example of what someone might mean when they say work is “predictable”.

• An example: when I first heard the Ought experiments described, I was pretty highly confident how they’d turn out—people would mostly fail to coordinate on any problem without an already-very-obvious factorization. (See here for the kinds of evidence informing that high confidence, though applied to a slightly different question. See here and here for the more general reasoning/​world models which underlie that prediction.) From what I’ve heard of the experiments, it seems that that is indeed basically what happened; therefore the experiments provided approximately-zero new information to my model. They were “useless” in that sense.

(I actually think those experiments were worth running just on the small chance that they’d find something very high value, or more likely that the people running them would have some high-value insight, but I’d still say “probably useless” was a reasonable description beforehand.)

I don’t know if Eliezer would agree with this particular example, but I think this is the sort of thing he’s gesturing at.

• That one makes sense (to the extent that Eliezer did confidently predict the results), since the main point of the work was to generate information through experiments. I thought the “predictable” part was also meant to apply to a lot of ML work where the main point is to produce new algorithms, but perhaps it was just meant to apply to things like Ought.

• I actually think this particular view is worth fleshing out, since it seems to come up over and over again in discussions of what AI alignment work is valuable (versus not).

For example, it does seem to me that >80% of the work in actually writing a published paper (at least amongst papers at CHAI) (EDIT: no longer believe this on reflection, see Rohin’s comment below) involves doing work with results that are predictable to the author after the concept (for example, actually getting your algorithm to run, writing code for experiments, running said experiments, writing up the results into a paper, etc.)

• This just doesn’t match my experience at all. Looking through my past AI papers, I only see two papers where I could predict the results of the experiments on the first algorithm I tried at the beginning of the project. The first one (benefits of assistance) was explicitly meant to be a “communication” paper rather than a “research” paper (at the time of project initiation, rather than in hindsight). The second one (Overcooked) was writing up results that were meant to be the baselines against which the actual unpredictable research (e.g. this) was going to be measured against; it just turned out that that was already sufficiently interesting to the broader community.

(Funny story about the Overcooked paper; we wrote the paper + did the user study in ~two weeks iirc, because it was only two weeks before the deadline that we considered that the “baseline” results might already be interesting enough to warrant a conference paper. It’s now my most-cited AI paper.)

(I’m also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn’t work. And in fact we did have to make slight tweaks, like annealing from self-play to BC-play over the course of training, to get our algorithm to work.)

A more typical case would be something like Preferences Implicit in the State of the World, where the conceptual idea never changed over the course of the project, but:

1. The first hacky /​ heuristic algorithm we wrote down didn’t work in some cases. We analyzed it a bunch (via experiments) to figure out what sorts of things it wasn’t capturing.

2. When we eventually had a much more elegant derived-from-math algorithm, I gave a CHAI presentation presenting some experimental results. There were some results I was confused by, where I expected something different from what we got, and I mentioned this. (Specifically these were the results in the case where the robot had a uniform prior over the initial state at time -T). Many people in the room (including at least one person from MIRI) thought for a while and gave their explanation for why this was the behavior you should expect. (I’m pretty sure some even said “this isn’t surprising” or something along those lines.) I remained unconvinced. Upon further investigation we found out that one of Ziebart’s results that we were using had extremely high variance in our setting, since in our setting we only ever had one initial state, rather than sampling several which would give better coverage of the uniform prior. We derived a better version of Ziebart’s result, implemented that, and voila the results were now what I had originally expected.

3. It took about… 2 weeks (?) between getting this final version of the algorithm and submitting a paper, constituting maybe 15-20% of the total work. Most of that was what I’d call “communication” rather than “research”, e.g. creating another environment to better demonstrate the algorithm’s properties, writing up the paper clearly, making good figures, etc. Good communication seems clearly worth putting effort into.

If you want a deep learning example, consider Learning What To Do by Simulating the Past. The biggest example here is the curriculum—that was not part of the original pseudocode I had written down and was crucial to get it to work.

You might look at this and think that “but the conceptual idea predicted the experiments that were eventually run!” I mean, sure, but then I think your crux is not “were the experiments predictable”, rather it’s “is there any value in going from a conceptual idea to a working implementation”.

It’s also pretty easy to predict the results of experiments in a paper, but that’s because you have the extra evidence that you’re reading a paper. This is super helpful:

1. The experiments are going to show the algorithm working. They wouldn’t have published the paper otherwise.

2. The introduction, methods, etc are going to tell you exactly what to expect when you get to the experiments. Even if the authors initially thought the algorithm was going to improve the final score in Atari games, if the algorithm instead improved sample efficiency without changing final score, the introduction is going to be about how the algorithm was inspired by sample efficient learning in humans or whatever.

This is also why I often don’t report on experiments in papers in the Alignment Newsletter; usually the point is just “yes, the conceptual idea worked”.

I don’t know if this is actually true, but one cynical take is that people are used to predicting the results of finished ML work, where they implicitly use (1) and (2) above, and incorrectly conclude that the vast majority of ML experiments are ex ante predictable. And now that they have to predict the outcome of Redwood’s project, before knowing that a paper will result, they implicitly realize that no, it really could go either way. And so they incorrectly conclude that of the ML experiments, Redwood’s project is a rare unpredictable one.

• Thanks for the detailed response.

On reflection, I agree with what you said—I think the amount of work it takes to translate a nice sounding idea into anything that actually works on an experimental domain is significant, and what exact work you need is generally not predictable in advance. In particular, I resonated a lot with this paragraph:

I’m also not actually sure that I would have predicted the Overcooked results when writing down the first algorithm; the conceptual story felt strong but there are several other papers where the conceptual story felt strong but nonetheless the first thing we tried didn’t work.

At least from my vantage point, “having a strong story for why a result should be X” is insufficient for ex ante predictions of what exactly the results would be. (Once you condition on that being the story told in a paper, however, the prediction task does become trivial.)

I’m now curious what the MIRI response is, as well as how well their intuitive judgments of the results are calibrated.

EDIT: Here’s another toy model I came up with: you might imagine there are two regimes for science—an experiment driven regime, and a theory driven regime. In the former, it’s easy to generate many “plausible sounding” ideas and hard to be justified in holding on to any of them without experiments. The role of scientists is to be (low credence) idea generators and idea testers, and the purpose of experimentation is to primarily to discover new facts that are surprising to the scientist finding them. In the second regime, the key is to come up with the right theory/​deep model of AI that predicts lots of facts correctly ex ante, and then the purpose of experiments is to convince other scientists of the correctness of your idea. Good scientists in the second regime are those who discover the right deep models much faster than others. Obviously this is an oversimplification, and no one believes it’s only one or the other, but I suspect both MIRI and Stuart Russell lie more on the “have the right idea, and the paper experiments are there to convince others/​apply the idea in a useful domain” view, while most ML researchers hold the more experimentalist view of research?

• It seems to me that the surprising simplicity of current-generation ML algorithms is a big part of the problem.

As a thought experiment: suppose you had a human brain, with the sort of debug access you’d have with a neural net; ie, you could see all the connections, edge weights, and firings, and had a decent multiple of the compute the brain has. Could you extract something like a verbal inner monologue, a text stream that was strongly predictive of that human’s plans? I don’t think it would be trivial, but my guess is that you could. It wouldn’t hold up against a meditator optimizing against you, but it would be a solid starting point.

Could you do the same thing to GPT-3? No; you can’t get language out of it that predicts its plans, because it doesn’t have plans. Could you do the same thing to AlphaZero? No, you can’t get language out of it that predicts its plans, because it doesn’t use language.

This analogy makes me think neural net transparency might not be as doomed as the early results would suggest; they aren’t finding human-legible low-dimensional representations of things because those representations aren’t present (GPT-3) or have nothing human-legible to match up to (AlphaZero).

In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I’d feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors. This isn’t by itself sufficient for alignment, of course, but it’d make the problem look a lot more tractable.

I’m not sure whether humans having an inner monologue that looks like the language we trained on and predicts our future behavior is an incidental fact about humans, or a convergent property of intelligent systems that get most of their information from language, or a convergent property of all intelligent systems, or something that would require deliberate architecture choices to make happen, or something that we won’t be able to make happen even with deliberate architecture choices. From my currents state of knowledge, none of these would surprise me much.

• In a human mind, a lot of cognition is happening in diffuse illegible giant vectors, but a key part of the mental architecture squeezes through a low-bandwidth token stream. I’d feel a lot better about where ML was going if some of the steps in their cognition looked like low-bandwidth token streams, rather than giant vectors.

• I have an optional internal monologue, and programming or playing strategy games is usually a non-verbal exercise.

I’m sure you could in principle (though not as described!) map neuron firings to a strongly predictive text stream regardless, but I don’t think that would be me. And the same intuition says it would be possible for MuZero; this is about the expressiveness of text rather than monologue being a key component of cognition or identity. Conversely, I would expect this to go terribly wrong when the tails come apart, because we’re talking about correlates rather than causal structures, with all the usual problems.

• I don’t think the verbal/​pre-verbal stream of consciousness that describes our behavior to ourselves is identical with ourselves. But I do think our brain exploits it to exert feedback on its unconscious behavior, and that’s a large part of how our morality works. So maybe this is still relevant for AI safety.

• Very grim. I think that almost everybody is bouncing off the real hard problems at the center and doing work that is predictably not going to be useful at the superintelligent level, nor does it teach me anything I could not have said in advance of the paper being written. People like to do projects that they know will succeed and will result in a publishable paper, and that rules out all real research at step 1 of the social process.

This is an interesting critique, but it feels off to me. There’s actually a lot of ‘gap’ between the neat theory explanation of something in a paper and actually building it. I can imagine many papers where I might say:

“Oh, I can predict in advance what will happen if you build this system with 80% confidence.”

But if you just kinda like, keep recursing on that:

“I can imagine what will happen if you build the n+1 version of this system with 79% confidence...”

“I can imagine what will happen if you build the n+2 version of this system with 76% confidence...”

“I can imagine what will happen if you build the n+3 version of this system with 74% confidence...”

It’s not so much that my confidence starts dropping (though it does), as that you are beginning to talk about a fairly long lead time in practical development work.

As anyone who has worked with ML knows, it takes a long time to get a functioning code base with all the kinks ironed out and methods that do the things they theoretically should do. So I could imagine a lot of AI safety papers with results that are, fundamentally, completely predictable, but a built system implementing them is still very useful to build up your implementing-AI-safety muscles.

I’m also concerned that you admit you have no theoretical angle of attack on alignment, but seem to see empirical work as hopeless. AI is full of theory developed as post-hoc justification of what starts out as empirical observation. To quote an anonymous person who is familiar with the history of AI research:

REDACTED

Today at 5:33 PM

Yeah. This is one thing that soured me on Schmidhuber. I realized that what he is doing is manufacturing history.

Creating an alternate reality/​narrative where DL work flows from point A to point B to point C every few years, when in fact, B had no idea about A, and C was just tinkering with A.

Academic pedigrees reward post hoc propter ergo hoc on a mass scale.

And of course, post-alphago, I find this intellectual forging to be not just merely annoying and bad epistemic practice, but a serious contribution to X-Risk.

By falsifying how progress actually happened, it prioritizes any kind of theoretical work, downplaying empirical work, implementation, trial-and-error, and the preeminent role of compute.

In Schmidhuber’s history, everyone knows all about DL and meta-learning, and DL history is a grand triumphant march from the perceptron to the neocognitron to Schmidhuber’s LSTM to GPT-3 as a minor uninteresting extension of his fast memory work, all unfolding exactly as seen.

As opposed to what actually happened which was a bunch of apes poking in the mud drawing symbols grunting to each other until a big monolith containing a thousand GPUs appeared out of nowhere, the monkeys punched the keyboard a few times, and bow in awe.

And then going back and saying ‘ah yes, Grog foresaw the monolith when he smashed his fist into the mud and made a vague rectangular shape’.

My usual example is ResNets. Super important, one of the most important discoveries in DL...and if you didn’t read a bullshit PR interview MS PR put out in 2016 or something where they admit it was simply trying out random archs until it worked, all you have is the paper placidly explaining “obviously resnets are a good idea because they make the gradients flow and can be initialized to the identity transformation; in accordance with our theory, we implemented and trained a resnet cnn on imagenet...”

Discouraging the processes by which serendipity can occur when you have no theoretical angle of attack seems suicidal to me, to put it bluntly. While I’m quite certain there is a large amount of junk work on AI safety, we would likely do well to put together some kind of process where more empirical approaches are taken faster with more opportunities for ‘a miracle’ as you termed it to arise.

• [I am a total noob on history of deep learning & AI]

From a cursory glance I find Schmidhuber’s take convincing.

He argues that the (vast) majority of conceptual & theoretical advances in deep learning have been understood decades before—often by Schmidhuber and his collaborators.

Moreover, he argues that many of the current leaders in the field improperly credit previous discoveries

It is unfortunate that the above poster is anonymous. It is very clear to me that there is a big difference between theoretical & conceptual advances and the great recent practical advances due to stacking MOAR layers.

It is possible that remaining steps to AGI consists of just stacking MOAR layers: compute + data + comparatively small advances in data/​compute efficiency + something something RL Metalearning will produce an AGI. Certainly, not all problems can be solved [fast] by incremental advances and/​or iterating on previous attempts. Some can. It may be the unfortunate reality that creating [but not understanding!] AGI is one of them.

• I’m sure Eliezer has written about this previously, but why doesn’t he think corrigibility is a natural stance?

It does seem like existing approaches to corrigibility (IE, the utility balancing approaches in MIRI/​Stuart Armstrong’s work and the “agent has incomplete information” approaches outlined in Dylan Hadfield-Menell or Alex Turner’s work) are incredibly fragile. I do agree that current approaches involving utility balancing/​assigning utility to branches never executed are probably way too finicky to get working. I also agree that all the existing approaches involving the agent modelling itself as having incomplete information rely on well-calibrated priors and also all succumb to the problem of fully updated deference.

However, I think it’s not at all obvious to me that corrigibility doesn’t have a “small central core”. It does seem to me like the “you are incomplete, you will never be complete” angle captures a lot of what we mean by corrigibility.

It’s possible the belief is empirical—that is, people have tried all the obvious ways to patch/​fix this angle, and they’ve all failed, so the problem is hard (at least relative to the researchers we have working on it). But the amount of work spent exploring that angle and trying the obvious next steps is tiny in comparison to what I’d consider a serious effort worth updating on. (At least, amongst work that I’ve seen? Maybe there’s been a lot of non public work that I’m not privy to?)

• Eliezer explains why he thinks corrigibility is unnatural in this comment.

• Thanks! Relevant parts of the comment:

Eliezer thinks that while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition.

[...]

The central reasoning behind this intuition of anti-naturalness is roughly, “Non-deference converges really hard as a consequence of almost any detailed shape that cognition can take”, with a side order of “categories over behavior that don’t simply reduce to utility functions or meta-utility functions are hard to make robustly scalable”.

[...]

What I imagine Paul is imagining is that it seems to him like it would in some sense be not that hard for a human who wanted to be very corrigible toward an alien, to be very corrigible toward that alien; so you ought to be able to use gradient-descent-class technology to produce a base-case alien that wants to be very corrigible to us, the same way that natural selection sculpted humans to have a bunch of other desires, and then you apply induction on it building more corrigible things.

My class of objections in (1) is that natural selection was actually selecting for inclusive fitness when it got us, so much for going from the loss function to the cognition; and I have problems with both the base case and the induction step of what I imagine to be Paul’s concept of solving this using recursive optimization bootstrapping itself; and even more so do I have trouble imagining it working on the first, second, or tenth try over the course of the first six months.

My class of objections in (2) is that it’s not a coincidence that humans didn’t end up deferring to natural selection, or that in real life if we were faced with a very bizarre alien we would be unlikely to want to defer to it. Our lack of scalable desire to defer in all ways to an extremely bizarre alien that ate babies, is not something that you could fix just by giving us an emotion of great deference or respect toward that very bizarre alien. We would have our own thought processes that were unlike its thought processes, and if we scaled up our intelligence and reflection to further see the consequences implied by our own thought processes, they wouldn’t imply deference to the alien even if we had great respect toward it and had been trained hard in childhood to act corrigibly towards it.

• Maybe there’s been a lot of non public work that I’m not privy to?

In Aug 2020 I gave formalizing corrigibility another shot, and got something interesting but wrong out the other end. Am planning to publish sometime, but beyond that I’m not aware of other attempts.

When I visited MIRI for a MIRI/​CHAI social in 2018, I seriously suggested a break-out group in which we would figure out corrigibility (or the desirable property pointed at by corrigibility-adjacent intuitions) in two hours. I think more people should try this exact exercise more often—including myself.

• However, I think it’s not at all obvious to me that corrigibility doesn’t have a “small central core”. It does seem to me like the “you are incomplete, you will never be complete” angle captures a lot of what we mean by corrigibility.

I think all three of Eliezer, you, and I share the sense that corrigibility is perhaps philosophically simple. The problem is that for it to actually have a small central core /​ be a natural stance, you need the ‘import philosophy’ bit to also have a small central core /​ be natural, and I think those bits aren’t true.

Like, the ‘map territory’ distinction seems to me like a simple thing that’s near the core of human sanity. But… how do I make an AI that sees the map territory distinction? How do I check that its plans are correctly determining the causal structure such that it can tell the difference between manipulating its map and manipulating the territory?

[And, importantly, this ‘philosophical’ AI seems to me like it’s possibly alignable, and a ‘nonphilosophical’ AI that views its projections as ‘the territory’ is probably not alignable. But it’s really spooky that all of our formal models are of this projective AI, and maybe we will be able to make really capable systems using it, and rather than finding the core of philosophical competence that makes the system able to understand the map-territory distinction, we’ll just find patches for all of the obvious problems that come up (like the abulia trap, where the AI system discovers how to wirehead itself and then accomplishes nothing in the real world) and then we’re killed by the non-obvious problems.]

• I’m not sure why you mean by ‘philosophically’ simple?

Do you agree that other problems in AI Alignment don’t have “philosophically’ simple cores? It seems to me that, say, scaling human supervision to a powerful AI or getting an AI that’s robust to ‘turning up the dial’ seem much harder and intractable problems than corrigibility.

• I’m not sure why you mean by ‘philosophically’ simple?

I think if we had the right conception of goals, the difference between ‘corrigibility’ and ‘incorrigibility’ would be a short sentence in that language. (For example, if you have a causal graph that goes from “the state of the world” to “my observations”, you specify what you want in terms of the link between the state of the world and your observations, instead of the observations.)

This is in contrast to, like, ‘practically simple’, where you’ve programmed in rules to not do any of the ten thousand things it could do to corrupt things.

• Eliezer Yudkowsky

Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn’t scale, I’d expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.

I think this view dovetails quite strongly with the view expressed in this comment by maximkazhenkov:

Progress in model-based RL is far more relevant to getting us closer to AGI than other fields like NLP or image recognition or neuroscience or ML hardware. I worry that once the research community shifts its focus towards RL, the AGI timeline will collapse—not necessarily because there are no more critical insights left to be discovered, but because it’s fundamentally the right path to work on and whatever obstacles remain will buckle quickly once we throw enough warm bodies at them. I think—and this is highly controversial—that the focus on NLP and Vision Transformer has served as a distraction for a couple of years and actually delayed progress towards AGI.

If curiosity-driven exploration gets thrown into the mix and Starcraft/​Dota gets solved (for real this time) with comparable data efficiency as humans, that would be a shrieking fire alarm to me (but not to many other people I imagine, as “this has all been done before”).

I was definitely one of those on the Transformer “train”, so to speak, after the initial GPT-3 preprint, and I think this was further reinforced by gwern’s highlighting of the scaling hypothesis—which, while not necessarily unique to Transformers, was framed using Transformers and GPT-3 as relevant examples, in a way that suggested [to me] that Transformers themselves might scale to AGI.

I think, after thinking about this more, looking more into EfficientZero/​MuZero/​related work in model-based reinforcement learning etc., that I have updated in favor of the view espoused by Eliezer and maximkazhenkov, wherein model-based RL (and RL-type training in general, which concerns dynamics and action spaces in a way that sequence prediction simply does not) seems more likely to me to produce AGI than a scaled-up version of a sequence predictor.

(I still have a weak belief that Transformers and their ilk may be capable of producing AGI, but if so I think it will probably be a substantially longer, harder path than RL-based systems, and will probably involve much more work than just throwing more compute at models.)

Also, here is a relevant tweet from Eliezer:

I don’t think it scales to superintelligence, not without architectural changes that would permit more like a deliberate intelligence inside to do the predictive work. You don’t get things that want only the loss function, any more than humans explicitly want inclusive fitness.

(in response to)

what’s the primary risk you would expect from a GPT-like architecture scaled up to the regime of superintelligence? (Obviously such architectures are unaligned, but they’re also not incentivized by default to take action in the real world.)

• If we were to blur the name Eliezer Yudkowsky and pretended we saw this a bunch of anonymous people talking on a forum like Reddit, what would your response be to somebody who posted the things Yudkowsky posted above. What pragmatic thing could you tell that person to possibly help them? Every word can be true but It seems overwhelmingly pessimistic in a way that is not helpful, mainly due to nothing in it being actionable.

The position of timelines being too short and the best alignment research being too weak /​ too slow, while having no avenues or ideas to make things better, with no institution to trust, to the point where we are doomed now, doesn’t lead to a lot of wanting to do anything, which is a guaranteed failure. What should I tell this person? “You seem to be pretty confident in your prediction, what do you suppose you should do to act in accordance with these believes? Wire head yourself and wait for the world to crumble around you? Maybe at least take a break for mental health reasons and let whatever will be will be. If the world is truly doomed you have nothing to worry about anymore”.

I can agree that timelines seem short and good quality alignment research seems rare. I think research like all things humans do follows sturgeon’s law. But the problem is aside from some research with is meant only for prestige building is you can’t tell which will turn out to be crap or not. Nor can you tell where the future will go or what the final outcome will be. We can make use of trends like this person was talking about for predicting the future but there’s always uncertainty. Maybe that’s all this post is which is a rough assessments of one’s personal long term outlook in the field, but it seems pre-mature to say the researchers mentioned in this article are doing things that probably won’t help at this point. With this much pessimism towards our future world we might as well take the low probability of their help working and shoot the moon with it, what have we got to lose in a doomed world?

But that’s the thing, the researchers working on alignment I’m sure will continue doing it even after reading this interaction. If they give up on it we are even more screwed. They might even feel a bit more resentful now knowing what this person thinks about their work, but I don’t think it changed anything.

Maybe I was lucky to get into the AI field within the last couple of years, where short timelines were the expected, rather than something to be feared. I didn’t have the hope of long timelines and now I don’t have to feel crushed by them disappearing (forgive me if I assume too much). We have the time we have to do the best we can, if things take longer, more power too us to get better odds of a good outcome.

Summary: While interesting, this conversation mainly updated me only to the views of the writers, not changing anything pragmatically about my view of research or timelines.

If you think I completely misread the article, and that EY is saying something different than what I interpreted, please let me know.

• Every word can be true but It seems overwhelmingly pessimistic in a way that is not helpful, mainly due to nothing in it being actionable.

I’m thinking of this as ‘step one is to figure out our situation; step two is to figure out what to do about it’. Trying too hard to make things actionable from the get-go can interfere with the model-building part.

The position of timelines being too short and the best alignment research being too weak /​ too slow, while having no avenues or ideas to make things better, with no institution to trust, to the point where we are doomed now, doesn’t lead to a lot of wanting to do anything, which is a guaranteed failure.

Eliezer’s view is ‘the odds of success are low, and I’m pretty uncertain about what paths have the highest EV’, not ‘we’re doomed /​ the odds of success are negligible’.

• Every word can be true but It seems overwhelmingly pessimistic in a way that is not helpful, mainly due to nothing in it being actionable.

Ongoingly describing a situation accurately is a key action item.

• I’m on record as early as 2008 as saying that I expected superintelligences to crack protein folding, some people disputed that and were all like “But how do you know that’s solvable?” and then AlphaFold 2 came along and cracked the protein folding problem they’d been skeptical about, far below the level of superintelligence.

This is tangential nitpicking (I agree that protein folding is solvable), but I don’t think AlphaFold 2 entirely cracked it. AFAIU, AF2 relies on multiple sequence alignment as part of its input: the sequences of homologous proteins from different species. This is a standard method to simplify the problem, because by observing that certain parts of the sequence tend to vary together between homologues, you can guess that they correspond to chain fragments that are adjacent in the folded configuration.

Ofc even so it is very impressive and has plenty of applications. But, if you want to invent your own proteins from scratch, this is not good enough.

• Well, if viewing it on that level, AlphaFold 2 didn’t crack the full problem because it doesn’t let you put in a chemical function and get out a protein which performs that function while subject to other constraints of a surrounding wet system, which is the protein folding problem you have to solve to get wet nanotech out the other end, which is why we don’t already have general wet nanotech today.

• Aaaaaaaaaaaaahhhhhhhhhhhhhhhhh!!!!!!!!!!!!

(...I’ll be at the office, thinking about how to make enough progress fast enough.)

• Follow-up

One of Eliezer’s claims here is

It is very, very clear that at present rates of progress, adding that level of alignment capability as grown over the next N years, to the AGI capability that arrives after N years, results in everybody dying very quickly.

This is a claim I basically agree with.

I don’t think the situation is entirely hopeless, but I don’t think any of the current plans (or the current alignment field) are on track to save us.

• Good to hear I’m not the only one with this reaction. Well, good isn’t the right term, but y’know. The This Is Fine meme comes to mind. So does the stuff about how having strong feelings is often right and proper.

I’ve always thought this but have never said it to anyone before: I can only imagine the type of stress and anxiety Eliezer deals with. I’m grateful to him for many reasons, but one distinct reason is that he puts up with this presumed stress and anxiety for humanity’s benefit, which includes all of us.

Maybe this would be a good time for the community to apply a concentration of force and help.

• I’m torn because I mostly agree with Eliezer that things don’t look good, and most technical approaches don’t seem very promising.

But the attitude of unmitigated doomyness seems counter-productive.
And there’s obviously things worth doing and working on and people getting on with it.

It seems like Eliezer is implicitly focused on finding an “ultimate solution” to alignment that we can be highly confident solves the problem regardless of how things play out. But this is not where the expected utility is. The expected utility is mostly in buying time and increasing the probability of success in situations where we are not highly confident that we’ve solved the problem, but we get lucky.

Ideally we won’t end up rolling the dice on unprincipled alignment approaches. But we probably will. So let’s try and load the dice. But let’s also remember that that’s what we’re doing.

• I guess actually the goal is just to get something aligned enough to do a pivotal act. I don’t see though why an approach that tries to maintain a relatively-sufficient level of alignment (relative to current capabilities) as capabilities scale couldn’t work for that.

• Yudkowsky mentions this briefly in the middle of the dialogue:

I don’t know however if I should be explaining at this point why “manipulate humans” is convergent, why “conceal that you are manipulating humans” is convergent, why you have to train in safe regimes in order to get safety in dangerous regimes (because if you try to “train” at a sufficiently unsafe level, the output of the unaligned system deceives you into labeling it incorrectly and/​or kills you before you can label the outputs), or why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (qualitatively new thought processes, things being way out of training distribution, and, the hardest part to explain, corrigibility being “anti-natural” in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).

Basically, there are reasons to expect that alignment techniques that work in smaller safe regime fail in larger, unsafe regimes. For example, an alignment technique that requires your system demonstrate undesirable behavior while running could remain safe while your system is weak, but then become dangerous when undesirable behavior from your system becomes powerful.

That being said, Ajeya’s “Case for Aligning Narrowly Superhuman models” does flesh out the case for trying to align existing systems (as capabilities scale).

• why attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes (...)

If you know of a reference to, or feel like expaining in some detail, the arguments given (in parentheses) for this claim, I’d love to hear them!

• I’m familiar with these claims, and (I believe) the principle supporting arguments that have been made publicly. I think I understand them reasonably well.

I don’t find them decisive. Some aren’t even particularly convincing. A few points:

- EY sets up a false dichotomy between “train in safe regimes” and “train in dangerous regimes”. In the approaches under discussion there is an ongoing effort (e.g. involving some form(s) of training) to align the system, and the proposal is to keep this effort ahead of advances in capability (in some sense).

- The first 2 claims for why corrigibility wouldn’t generalize seem to prove too much—why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?

- I think the last claim—that corrigibility is “anti-natural”—is more compelling. However, we don’t actually understand the space of possible utility functions and agent designs well enough for it to be that strong. We know that any behavior is compatible with a utility function, so I would interpret Eliezer’s claim as relating to the complexity of description length of utility functions that encode corrigible behavior. Work on incentives suggests that removing manipulation incentives might add very little complexity to the description of the utility function, for an AI system that already understands the world well. Humans also seem to find it simple enough to add the “without manipulation” qualifier to an objective.

• why would intelligence generalize to “qualitatively new thought processes, things being way out of training distribution”, but corrigibility would not?

This sounds confused to me: the intelligence is the “qualitatively new thought processes”. The thought processes aren’t some new regime that intelligence has to generalize to. Also to answer the question directly, I think the claim is that intelligence (which I’d say is synonymous for these purposes with capability) is simpler and more natural than corrigibility (i.e., the last claim—I don’t think these three claims are to be taken separately).

We know that any behavior is compatible with a utility function

People keep saying this but it seems false to me. I’ve seen the construction for history-based utility functions that’s supposed to show this, and don’t find it compelling—it seems not to be engaging with what EY is getting at with “coherent planning behavior”. Is there a construction for (environment)-state-based utility functions? I’m not saying that is exactly the right formalism to demonstrate the relationship between coherent behaviour and utility functions, but it seems to me closer to the spirit of what EY is getting at. (This comment thread on the topic seems pretty unresolved to me.)

• I’m curious why this comment has such low karma and has −1 alignment forum karma.

If you think doom is very likely when AI reaches a certain level, than efforts to buy us time before then have the highest expected utility. The best way to buy time, arguably, is to study the different AI approaches that exist today and figure out which ones are the most likely to lead to dangerous AI. Then create regulations (either through government or at corporation level) banning the types of AI systems that are proving to be very hard to align. (For example we may want to ban expected reward/​utility maximizers completely—satisficers should be able to do everything we want. Also, we may decide there’s really no need for AI to be able to self modify and ban that too.) Of course a ban can’t be applied universally, so existentially dangerous types of AI will get developed somewhere somehow, and there’s likely to be existentially dangerous types of AI we won’t have thought of that will still get developed, but at least we’ll be able to buy some time to do more alignment research that hopefully will help when that existentially dangerous AI is unleashed.

(addendum: what I’m basically saying is that prosaic research can help us slow down take-off speed which is generally considered a good thing).

• Are we sure that OpenAI still believes in “open AI” for its larger, riskier projects? Their recent actions suggest they’re more cautious about sharing their AI’s source code, and projects like GPT-3 are being “released” via API access only so far. See also this news article that criticizes OpenAI for moving away from its original mission of openness (which it frames as a bad thing).

In fact, you could maybe argue that the availability of OpenAI’s APIs acts as a sort of pressure release valve: it allows some people to use their APIs instead of investing in developing their own AI. This could be a good thing.

• “Alpha Zero scales with more computing power, I think AlphaFold 2 scales with more computing power, Mu Zero scales with more computing power. Precisely because GPT-3 doesn’t scale, I’d expect an AGI to look more like Mu Zero and particularly with respect to the fact that it has some way of scaling.”

I thought GPT-3 was the canonical example of a model type that people are worried about will scale? (i.e. it’s discussed in https://​​www.gwern.net/​​Scaling-hypothesis?)

• I think GPT was a model people expected to scale with the number of parameters, not computing power.

• You need to scale computing power to actually train those parameters. If you just increase parameters but not petaflop-s/​days according to the optimal scaling law relating compute/​parameter, you will train the larger model too few steps, and it’ll end up with a worse (higher) loss than if you had trained a smaller model.

What I think Eliezer is referring to there is runtime scaling: AlphaZero/​MuZero can play more games or plan out over its tree so you can dump arbitrarily more compute power into it to get arbitrarily better game player*; even AlphaFold can, to some degree, run through many iterations (because it just repeatedly refines an alignment), although I haven’t heard of this being particularly useful, so Eliezer is right to question whether AF2 is as good an example as AlphaZero/​MuZero there.

Now, as it happens, I disagree there about GPT-3 being fixed in power once it’s trained. It’s true that there is no obvious way to use GPT-3 for rapid capability amplification by planning like MuZero. But that may reflect just the weirdness of trying to use natural language prompting & induced capabilities for planning, compared to models explicitly trained on planning tasks. This is why I keep saying: “sampling can prove the presence of knowledge, but not the absence”.

What we see with GPT-3 is that people keep finding better ways to tickle better decoding and planning out of it, showing that it always had those capabilities. Best-of sampling leads to an immediate large gain on completion quality or correctness. You can chain prompts. Codex/​Copilot can use compilers/​interpreters as oracles to test a large number of completions to find the most valid ones. LaMDA has a nice trick with requiring text style completions to use a weird token like ‘{’ and filtering output automatically. The use of step by step dialogue/​explanation for an ‘inner monologue’ leads to large gains on math word problems in GPT-3, and gains in code generation in LaMDA. GPT-3 can summarize entire books recursively with a bit of RL finetuning, so working on hierarchical text summary/​generation of very large text sequences is now possible. Foreign language translation can be boosted dramatically using self-distillation/​critique in GPT-3, with no additional real data, as can symbolic knowledge graphs, one of the last bastions of GOFAI symbolic approaches… (Ought was experimenting with goal factoring and planning, I dunno how that’s going.) Many of these gains are large both relatively and absolutely, often putting the runtime model at SOTA or giving it a level of performance that would probably require scaling multiple magnitudes more before a single forward pass could compete.

So, GPT-3 may not be as easy to work with to dump runtime compute into like a game player, but I think there’s enough proofs-of-concept at this point your default expectation should be to expect future larger models to have even more runtime planning capabilities.

* although note that there is yet another scaling law tradeoff between training and runtime tree search/​planning, where the former is more expensive but the latter scales worse, so if you are doing too much planning, it makes more sense to train the base model more instead. This is probably true of language models too given the ‘capability jumps’ in the scaling curves for tasks: at some point, trying to brute-force a task with GPT-3-1b hundreds or thousands of times is just not gonna work compared to running GPT-3-175b, which ‘gets it’ via meta-learning, once or a few times.

• [epistemic status: just joking around]

corrigibility being “anti-natural” in a certain sense

https://​​imgur.com/​​a/​​yYH3frW.gif

• Can someone point me to discussion of the chance that an AGI catastrophe happens but doesn’t kill all humans? As a silly example that may still be pointing at something real, say someone builds the genie and asks it to end all strife on earth; it takes the obvious route of disintegrating the planet (no strife on earth ever again!), but now it stops (maxed out utility function, so no incentive to also turn the lightcone into computronium), and the survivors up in Elon’s Mars base take safety seriously next time?

• If you can get an AGI to destroy all life on Earth but then stop and leave Mars alone, you’ve cracked AGI alignment; you can get AGIs to do big powerful things without side effects.

• How would having AGI that have 50% chance to obliterate lightcone, 40% to obliterate just Earth and 10% to correctly produce 1000000 of paperclips without casualties solve the alignment?

• Taking over the lightcone is the default behavior. If you can create an AGI which doesn’t do this, you’ve already figured out how to put some constraint on its activities. Notably, not destroying the lightcone implies that the AGI doesn’t create other AGIs which go off and destroy the lightcone.

• When you say “create an AGI which doesn’t do this” do you mean that it has about 0% probability of doing it or one that have less than 100% probability of doing it?

Edit: my impression was that the point of alignment was producing an AGI that have high probability of good outcomes and low probability of bad outcomes. Creating an AGI that simply have low probability of destroying the universe seems to be trivial. Take a hypothetical AGI before it produced output, throw a coin and if its tails then destroy it. Voila, the probability of destroying the universe is now at most 50%. How can you even have device that is guaranteed to destroy universe if on early stages it can be stopped by sufficiently paranoid developer or a solar flare?

• I don’t see how your scenario addresses the statement “Taking over the lightcone is the default behavior”. Yes, it’s obvious that you can build an AGI and then destroy it before you turn it on. You can also choose to just not build one at all with no coin flip. There’s also the objection that if you destroy it before you turn it on, have you really created an AGI, or just something that potentially might have been an AGI?

It also doesn’t stop other people from building one. If theirs destroys all human value in the future lightcone by default, then you still have just as big a problem.

• I don’t see why all possible ways for AGI to critically fail to do what we have build it for must involve taking over the lightcone.

That doesn’t stop other people from building one.

So let’s also blow up the Earth. By that definition the alignment would be solved.

• Presumably a Bayesian reasoner using expected value would never reach max utility, because there would always be a non-zero probability that the goal hasn’t been achieved, and the course of action which increases its success estimate from 99.9999% to 99.99999999% probably involves turning part of the universe into computronium.

• No one has yet solved “and then stop” for AGI even though this should be easier than a generic stop button which in turn should be easier than full corrigibility. (Also I don’t think we know how to refer to things in the world in a way that gets an AI to care about it rather than observations of it or its representation of it)

• This 12 minute Robert Miles video is a good introduction to the basic argument for why “stopping at destroying earth, and not proceeding to convert the universe into computronium” is implausible.

• An excellent video. But in my traditional role as “guy who is not embarrassed to ask stupid questions”, I have a nitpick:

I didn’t instantly see why the expected utility satisficer would turn itself into a maximiser.

After a few minutes of thinking I’ve got as far as:

Out of all the packets I could send, sending something that is a copy of me (with a tiny hack to make it a maximiser) onto another computer will create an agent which (since I am awesome) will almost certainly acquire a vast number of stamps.

Which runs into a personal-identity-type problem:

*I’m* an agent. If my goal is to acquire a fairly nice girlfriend, then the plan “create a copy of me that wants the best possible girlfriend”, doesn’t really help me achieve my goal. In fact it might be a bit of a distraction.

I’m pretty sure there’s a real problem here, but the argument in the video doesn’t slam the door on it for me. Can anyone come up with something better?

• Exactly what I was looking for, thanks!

• I think the optimistic case might be that in order to get the AGI to do anything useful at all you have to get at least part-way to a solution to the alignment problem, because otherwise its outputs will include many that will be so obviously “wrong” that you’d never actually let it do anything in which being wrong mattered.

• Unfortunately humanity will not get partial credit for almost avoiding extinction.

• There will be no partial credit on Humanity’s AI Alignment Exam. I like that!

• I do think that if you get an AGI significantly past human intelligence in all respects, it would obviously tend to FOOM. I mean, I suspect that Eliezer fooms if you give an Eliezer the ability to backup, branch, and edit himself.

What improvements would you make to your brain that you would anticipate yielding greater intelligence? I can think of a few possible strategies:

• Just adding a bunch of neurons everywhere. Make my brain bigger.

• Study how very smart brains look, and try to make my brain look more like theirs.

For an AI, the first strategy is equivalent to adding more hardware and scaling. In that respect, the “recursive” part of recursive self-improvement doesn’t seem to add anything. We already know that we can get improvements by scaling hardware. There aren’t major avenues for this strategy to FOOM via a feedback loop.

The second strategy is interesting, though it might not work well for anyone “at the frontier” of intelligence.

My guess is that to do meaningful recursive self-improvement, your code must be fairly interpretable.

• EY knows more neuroscience than me (I know very little) but here’s a 5-min brainstorm of ideas:

--For a fixed compute budget, spend more of it on neurons associated with higher-level thought (the neocortex?) and less of it on neurons associated with e.g. motor control or vision.

--Assuming we are an upload of some sort rather than a physical brain, tinker with the rules a bit so that e.g. neuron waste products get magically deleted instead of having to be pumped out, neurons never run out of energy/​oxygen and need to rest, etc. Study situations where you are in “peak performance” or “flow” and then explore ways to make your brain enter those states at will.

--Use ML pruning techniques to cut away neurons that aren’t being useful, to get slightly crappier mini-Eliezers that cost 10% the compute. These can then automate away 90% of your cognition, saving you enough compute that you can either think a few times faster or have a few copies running in parallel.

--Build automated tools that search through your brain for circuits that are doing something pretty simple, like a giant OR gate or an oscillator, and then replace those circuits with small bits of code, thereby saving significant compute. If anything goes wrong, no worries, just revert to backup.

This was a fun exercise!

• What probability do you assign to the proposition “Prosaic alignment will fail”?

1. Purely based on your inside view model

2. After updating on everyone else’s views

Same question for:

“More than 50% of the prosaic alignment work done by the top 7 researchers is nearly useless”

• If you have an untrustworthy general superintelligence generating English strings meant to be “reasoning/​arguments/​proofs/​explanations” about eg a nanosystem design, then I would not only expect the superintelligence to be able to fool humans in the sense of arguing for things that were not true in a way that fooled the humans, I’d expect the superintelligence to be able to covertly directly hack the humans in ways that I wouldn’t understand even after having been told what happened.

I can’t deny that this is possible. Lying is obviously possible; lying convincingly, depends on the scrutiny applied and on the particulars; emotional manipulation, to some extent for sure, though the strongest forms may need to be tailored to the individual; blackmail and bribes, possible, though again that may require knowing who will read the text; something like subliminal messaging, maybe.

It seems like a reasonable precaution to avoid letting the machine know who, specifically, will read its messages; and to have several people read the text independently of each other, all with instructions to raise a flag if they see something dangerous-seeming. (One could sometimes deliberately insert dangerous-seeming messages to verify that they’re doing their jobs—and tell them that you’re doing this.) Another precaution would be to automatically monitor for unexpected keywords (the kind that you’d expect to see in bribes or blackmail, not in mathematical proofs).

The machine could dump in a manifesto about its new political philosophy for overthrowing the oppressors, and promising to use its powers to create a truly just society. This is credible: oppressors exist, it wouldn’t even have to lie to point out lots of abuses and problems, and an unleashed AGI could solve all kinds of problems and act as a truly benevolent dictator; the only in-credible part is the machine’s benevolence, but we can assume the machine will be very good at persuasion. (Having written this, I suspect there are people whose political beliefs make them think that the world is currently so bad that they would deliberately seek out an AI that would make promises of this sort; either they estimate a “high enough” chance that the AI is correct, or they’re not really thinking in terms of probabilities. I’m mildly uncomfortable posting this—seems conceivable this could be a self-fulfilling prophecy—but I doubt the types of people who would do this would be reading this thread.)

But I expect the manifesto would require a lot of text to persuade the reader to defect and join the AI—like, more than five paragraphs, possibly a lot more. This suggests a length limit, as well as instructing readers to raise a flag and stop reading if they see weird offtopic text (would be harder if the text is supposed to be economic or political analysis, so maybe don’t use AI for those). (Another approach would be to have some readers view only every fourth paragraph or sentence—or even a random sample of length-limited sentences, to verify they’re on topic.)

Maybe the machine would work around text limits by adding a little “payload” to one message, a little to the next message, and so on. It would have to be subtle enough to avoid being flagged as offtopic.

Regarding weird subliminal messaging… I’d guess that (a) for it to not get auto-flagged, it would have to be subtle; (b) for it to not get flagged by humans, it’d have to be very subtle and work subconsciously; (c) would it be a visual thing, depending on the font, the columns per row, the text spacing, etc? If so, these can be randomized; (d) I think it’s a lot more likely that you can find a pattern of letters to subtly manipulate one known person (especially using a scan of their brain) than to manipulate everyone, including e.g. those for whom English isn’t their native language, and I think the “don’t let the machine know the reader” approach helps here.

Also, a tactic that could be used is to have dumb machines randomly perturb the text (replace words with synonyms, feed sentences into English → something else → English translation) before humans see it.

Incidentally, for the “breaking the manipulative payload across multiple messages” approach, it would help if each instantiation of the machine didn’t know how many past instantiations there’ve been, and how many past messages each reader has seen. (Because then it would need a brainwashing approach where it doesn’t matter what order the messages are delivered in—it chooses them at random—and in which the last message is just as subtle as the first.) I suppose the machine, inspecting itself, might be able to guess how many engineer-hours or code revisions it’s been through, which would let it approximate “how many past instantiations”; but readers probably would rotate in the natural course of things.

Of course, for organizations working on AI that don’t take risks very seriously… Length limits seem like an easy sell, for practical reasons if nothing else. “Machine doesn’t know the reader” is probably implemented by default, for serious business-advising applications. Instructions to watch out for and flag weird stuff—probably fairly easy to get companies to tell that to their employees, compliance is likely to be halfhearted but nonzero. Having multiple readers is an extra expense with no apparent benefit, and would therefore be a hard sell for most companies; the same is true for “randomly perturbing the text”.

If these tactics did succeed in catching an early form of “an AI advisor trying to persuade its human reader to jailbreak it”, you could then be prepared to jump in with a mostly pre-written PR story about the dangers of AI, citing the incident as proof. (Paul Graham’s The Submarine is inspirational: “In fact, the reason the best PR firms are so effective is precisely that they aren’t dishonest. They give reporters genuinely valuable information. … The main reason PR firms exist is that reporters are lazy. Or, to put it more nicely, overworked. Really they ought to be out there digging up stories for themselves. But it’s so tempting to sit in their offices and let PR firms bring the stories to them.”) I don’t know exactly what the fallout would be, and I guess it’s possible it could go badly, but having plans and mostly-prewritten articles in place would seem to give you more control over the outcome.

I guess the ideal outcome would be something like “Now people are sufficiently suspicious of AI that companies feel the need to look like they’re taking AI safety seriously”, and ideally you’d have organizations and protocols in place such that “looking like it” is fairly easily implemented by “following MIRI-approved protocols, which include proving to a red-team-ish MIRI-approved organization that you’re following the protocols”. (In the realm of security, we do seem to have gotten to the point where it’s not rare for organizations to send their employees fake phishing emails, making them feel embarrassed if they fall for one.)

Such an incident would be most likely to first occur at the company whose AI is most advanced (in a problem domain where they’re developing this type of AI), of course. So it may suffice to persuade some people at the most advanced few AI companies to implement a few safeguards and be on the lookout, and to bring in the PR machine when the inciting incident happens.

(Some might be tempted to concoct an inciting incident. No. That would destroy the credibility of AI safety advocacy, removing any MIRI influence over the PR fallout.)

These tactics are not foolproof. The chance that they would fail is probably in the tens of percent. However, if your current expectation is “>90% chance we all die” (I don’t see an estimate in this post, but it sounded pretty pessimistic), then it seems like getting to a 50% chance of success, or even 30%, or whatever, is worthwhile. Some of our ancestors faced worse odds than that.

• Lets say you use all these filtering tricks. I have no strong intuitions about whether these are actually sufficient to stop those kind of human manipulation attacks. (Of course, if your computer security isn’t flawless, it can hack whatever computer system its on and bypass all these filters to show the humans arbitrary images and probably access the internet.)

But maybe you can at quite significant expense make a Faraday cage sandbox, and then use these tricks. This is beyond what most companies will do in the name of safety. But Miri or whoever could do it. Then they ask the superintelligence about nanosystems, and very carefully read the results. Then presumably they go and actually try to build nanosystems. Of course you didn’t expect the superintelligences advice to be correct, did you? And not wrong in an easily detectable fail safe way either. You concepts and paradigm are all subtly malicious. Not clear testable and factually wrong statements. But nasty tricks hidden in the invisible background assumptions.

• Well, if you restrict yourself to accepting the safe, testable advice, that may still be enough to put you enough years ahead of your competition to develop FAI before they develop AI.

My meta-point: These methods may not be foolproof, but if currently it looks like no method is foolproof—if, indeed, you currently expect a <10% chance of success (again, a number I made up from the pessimistic impression I got)—then methods with a 90% chance, a 50% chance, etc. are worthwhile, and furthermore it becomes worth doing the work to refine these methods and estimate their success chances and rank them. Dismissing them all as imperfect is only worthwhile when you think perfection is achievable. (If you have a strong argument that method M and any steelmanning of it has a <1% chance of success, then that’s good cause for dismissing it.)

• Under the Eliezerian view, (the pessimistic view that is producing <10% chances of success). These approaches are basically doomed. (See logistic success curve)

Now I can’t give overwhelming evidence for this position. Whisps of evidence maybe, but not an overwheming mountain of it.

Under these sort of assumptions, building a container for an arbitrary superintelligence such that it has only 80% chance of being immediately lethal, and a 5% chance of being marginally useful is an achievment.

(and all possible steelmannings, that’s a huge space)

• Yudkowsky: “I mean, I’ve written fiction about a world coordinated enough that they managed to shut down all progress in their computing industry and only manufacture powerful computers in a single worldwide hidden base”

Is this story public? Can someone link to it?

• Eliezer is referring to the Dath Ilan stories.

• Because it’s too technically hard to align some cognitive process that is powerful enough, and operating in a sufficiently dangerous domain, to stop the next group from building an unaligned AGI in 3 months or 2 years. Like, they can’t coordinate to build an AGI that builds a nanosystem because it is too technically hard to align their AGI technology in the 2 years before the world ends.

I’m not totally convinced by this argument because of the quote below:

The flip side of this is that I can imagine a system being scaled up to interesting human+ levels, without “recursive self-improvement” or other of the old tricks that I thought would be necessary, and argued to Robin would make fast capability gain possible. You could have fast capability gain well before anything like a FOOM started. Which in turn makes it more plausible to me that we could hang out at interesting not-superintelligent levels of AGI capability for a while before a FOOM started. It’s not clear that this helps anything, but it does seem more plausible.

It seems to me this does hugely change things. I think we are underestimating the amount of change humans will be able to make in the short timeframe after we get human level AI and before recursive self improvement gets developed. Human level AI + huge amounts of compute would allow you to take over the world through much more conventional means, like massively hacking computer systems to render your opponents powerless (and other easy-to-imagine more gruesome ways). So the first group to develop near-human level AI wouldn’t need to align it in 2 years, because it would have the chance to shut down everyone else. It may not even come down to the first group who develops it, but the first people who have access to some powerful system, since they could use that to hack the group itself and do what they wish without requiring the buy-in from others—this would depend on a lot of factors like how controlled is the access to the AI and how quickly a single person can use AI to take control over physical stuff. I’m not saying this would be easy to do, but certainly seems within the realm of plausibility.

• I agree… seems to me like the most likely course of action to happen is for some researchers to develop human+ /​ fast capability gain AI that is dedicated to the problem of solving AI Alignment, and its that AI which develops the solution that can be implemented for aligning the inevitable AGI.

• Steve Omohundro says:

”1) Nobody powerful wants to create unsafe AI but they do want to take advantage of AI capabilities.
2) None of the concrete well-specified valuable AI capabilities require unsafe behavior”

I think a lot of powerful people /​ organizations do want take advantage of possibly unsafe AI capabilities, such as ones that would allow them to be the emperors of the universe for all time. Especially if not doing so means that their rivals have a higher chance of becoming the emperors of the universe.

• Throwing more money at this problem does not obviously help

How much money are we talking? Can’t we just get million kids and teach them dath ilani way to at least solve >20 years timelines?

• To teach million kids you need like hundred thousand teachers from Dath Ilani. They don’t currently exist.

It can be circumvented by first teaching say a hundred students, 10% of which becomes teachers and help teaching new ‘generation’. If each ‘generation’ takes 5 years, and one teacher can teach 10 students in one generation, the amount of teachers will be multiplied by 2 every 5 years, and you’ll get a million Dath Ilanians in like 50 years.

One teacher teaching 10 students and 1 of them becoming a teacher might be more possible than it seems. For example, if instead of Dath Ilani ways we speak about non-terrible level of math, then I’ve worked in a system that have 1 math teacher per 6-12 students, 3%-5% of students become teachers and generation takes 3-6 years.

The problem is, currently we have 0 Dath Ilani teachers.

• Anyone up for creating thousands of clones of von Neumann and raising them to think that AI alignment is a really important problem? I’d trust them over me!

• I don’t think they’d even need to be raised to think that; they’d figure it out on their own. Unfortunately we don’t have enough time.

• we don’t have enough time

Setting aside this proposal’s, ah, logistical difficulties, I certainly don’t think we should ignore interventions that target only the (say) 10% of the probability space in which superintelligence takes longest to appear.

• So, is it now the consensus opinion round here that we’re all dead in less than twenty years? (Sounds about right to me, but I’ve always been a pessimist...)

• It’s not consensus. Ajeya, Richard, Paul, and Rohin are prominent examples of people widely considered to have expertise on this topic who think it’s not true. (I think they’d say something more like 10% chance? IDK)

• “I believe the best choice is cloning. More specifically, cloning John von Neumann one million times”

• If I were to accept the premise that “almost everybody is doing work that is predictably not going to be useful at the superintelligent level”, how do I avoid being one of those people?

• Premise of comment: Eliezer Yudkowsky’s view is correct. We are doomed, along with everything we value, barring a miracle. This comment takes that seriously and literally. (I don’t know my personal belief on this, and I don’t think it’s of interest here)

Proposed action: Cancel Cryonics Subscription. If I am doomed then cryonics is a poor use of my resources, whether selfish or altruistic. From You only live twice:

The second statement is that you have at least a little hope in the future. Not faith, not blind hope, not irrational hope—just, any hope at all.

Also, cryonics is bad from a security mindset if it places my brain under the control of an unaligned AI. Dystopia was already a risk of cryonics, and being doomed increases that risk.

Relatedly, don’t have children because you want grandchildren.

Proposed action: Civilizational Hospice. When people have multiple terminal illnesses, they often get hospice care instead of health care. They are kept as comfortable as possible while alive, and allowed to die with as little pain as possible. Our civilization has multiple terminal illnesses. Previously there was a hope for Friendly AI to provide a cure, but it turns out that AI is just another terminal illness.

To keep civilization as comfortable as possible, we can work on short-termist causes like GiveWell. The numbers looked smaller than existential risk, but it turns out the existential risk numbers are all zero, and preventing some malaria deaths is the most valuable thing I did this year.

Achieving a clean civilizational death could mean increasing the best existential risks and avoiding the risk of tiling the light-cone with dystopia. So we would promote gain-of-function research, nanotech, fossil fuels, nuclear proliferation, etc. Insert joke here about a political party you dislike. This requires persuasion to reduce conflict with others who are trying to reduce these risks. Families can have similar conflicts during hospice care.

In hospice, the normal rule is to enter hospice when you expect to die in six months. Civilizational hospice has a similar concern. Delaying unfriendly AI by a year is still worth a lot. But hospice should be the focus as soon as delaying the end of the world becomes intractable, and potentially before then. Since this is the first and last time we will do civilizational hospice, we should try to figure it out.

Proposed action: Avoid Despair. Since we are doomed, we have always been doomed. Many people have believed in similar dooms, religious or secular, and many still do. They were still able to enjoy their lives before doom arrived, and so are we.

• I echo everyone else talking about how pessimistic and unactionable this is. The picture painted here is one where the vast majority of us (who are not capable of doing transformational AI research) should desperately try to bring about any possible world that would stop AGI from happening, up to and including civilizational collapse.

• [ ]
[deleted]
• [ ]
[deleted]