Re: Attempted Gears Analysis of AGI Intervention Discussion With Eliezer

TL;DR

I think AGI architectures will be quirky and diverse. Consequently, there are many possible futures. My thoughts on the AI alignment problem are far more optimistic than (my best measure of) mainstream opinions among AI safety researchers (insofar as “AI safety” can be considered mainstream).

Disclaimer

In Attempted Gears Analysis of AGI Intervention Discussion With Eliezer, Zvi attempts to explain Elizer’s perspective on AGI from a conversation here as explained by Rob Besinger. My post here is the 5th node in a game of telephone. The opinions I quote in this post should not be considered Eliezer Yudkowsky’s or Rob Besinger’s. I’m just using Zvi’s post as a reference point from which I can explain how my own views diverge.

Individual Contentions

  1. Nate rather than Eliezer, but offered as a preface with p~0.85: AGI is probably coming within 50 years. Rob notes that Eliezer may or may not agree with this timeline, and that it shortens if you condition on ‘unaligned’ and lengthens conditional on ‘aligned.’

I don’t know whether AGI is coming within 50 years. Zvi says he would “at least buy 30% and sell 80%, or something like that.” I would at least buy 2% and sell 90% except, as Zvi noted, “betting money really, really doesn’t work here, at all”. (I’m ignoring the fact that “copying Zvi” is a good strategy for winning bets.)

My exact probabilities depend on how we define AGI. However, as we’ll get to in the subsequent points, it doesn’t really matter what the exact probability is. Even a small chance of AGI is of tremendous importance. My uncertainty about AGI is high enough that there is at least a small chance of AGI being built in the next 40 years.

  1. By default this AGI will come from something similar to some of today’s ML paradigms. Think enormous inscrutable floating-point vectors.

I disagree for reasons, but it depends on what we mean by “today’s ML paradigms”.

  1. AGI that isn’t aligned ends the world.

If we define AGI as “world optimizer” then yes, definitely. But I can imagine a couple different kinds of superintelligences that aren’t world optimizers (along with a few that naturally trend toward world optimizing). If you built a superintelligent machine that isn’t a world optimizer then it need not necessarily end the world.

For example, MuZero separates value from policy from reward. If you built just the value network and cranked it up to superintelligence then you would have a superintelligence that is not a world optimizer.

  1. AGI that isn’t aligned carefully and on purpose isn’t aligned, period.

If we define AGI as “world optimizer” then yes, definitely. Otherwise, see #3 above.

  1. It may be possible to align an AGI carefully and have it not end the world.

Absolutely.

  1. Right now we don’t know how to do it at all, but in theory we might learn.

I don’t think we know how to build an AGI. Aligned AGIs are a strict subset of “AGI”. Therefore we don’t know how to build an aligned AGI.

I think certain AGI architectures naturally lend themselves to alignment. For example, I predict that superintelligences will have to rely on error-entropy rather than pure error. Once you are using error-entropy instead of just error, you can increase transparency by increasing the value of simplicity relative to error. Forcing the AI to use simple models provides a powerful safety mechanism against misalignment.

You can get additional safety (practically, not theoretically) by designing the AI with a functional paradigm so that it’s (theoretically) stateless. Functional paradigms are the natural way to program an AI anyway, since an AGI will require lots of compute, lots of compute requires scaling, and stateless systems scale better than stateful systems.

I can imagine alternative architectures which are extremely difficult to control. I think that AGIs will be quirky. (See #35 below.) Different architectures will require very different mechanisms to align them. How you aim a laser is different from how you aim a rocket is different from how you aim a plane is different from how you aim a car.

  1. The default situation is an AGI system arises that can be made more powerful by adding more compute, and there’s an extended period where it’s not aligned yet and if you add too much compute the world ends, but it’s possible that if you had enough time to work on it and no one did that, you’d have a shot.

I think that the current systems in use (like GPT) will hit a computational wall where there isn’t enough data and compute on planet Earth to neutralize the hypothesis space entropy. I don’t doubt that current neural networks can be made more powerful by adding more compute, but I predict that such an approach will not get us to AGI.

  1. More specifically, when combined with other parts of the model detailed later: “I think we’re going to be staring down the gun of a completely inscrutable model that would kill us all if turned up further, with no idea how to read what goes on inside its head, and no way to train it on humanly scrutable and safe and humanly-labelable domains in a way that seems like it would align the superintelligent version, while standing on top of a whole bunch of papers about “small problems” that never got past “small problems”.”

Kind of? I think AGIs will be quirky. With some architectures we will end up “staring down the gun of a completely inscrutable model that would kill us all if turned up further, with no idea how to read what goes on inside its head, and no way to train it on humanly scrutable and safe and humanly-labelable domains in a way that seems like it would align the superintelligent version” but I think others will be tricky to turn into world optimizers at all.

Ironically, my favorite unalignable candidate AGI architecture is based on the human brain.

  1. If we don’t learn how to align an AGI via safety research, nothing else can save us period.

Since AGIs are quirky, I think we need to learn about each architecture’s individual quirks by playing around with them. Does this qualify as “safety research”? It depends what we mean by “safety research”.

  1. Thus, all scenarios where we win are based on a technical surprising positive development of unknown shape, and all plans worth having should assume such a surprising positive development is possible in technical space. In the post this is called a ‘miracle’ but this has misleading associations – it was not meant to imply a negligible probability, only surprise, so Rob suggested changing it to ‘surprising positive development.’ Which is less poetic and longer, but I see the problem.

I see lots of possible ways AI could go, and I have little confidence in any of them. When the future is uncertain, all futures are surprising. I cannot imagine a future that wouldn’t surprise me.

Since I’m optimistic about AI inner alignment, positive surprises are not a prerequisite to human survival in my model of the world. I am far more concerned about bad actor risk. I’m not worried that an AI will take over the world by accident. I’m worried that an AI will take over the world because someone deliberately told it to.

  1. Eliezer does know a lot of ways not to align an AGI, which is helpful (e.g. Edison knew a lot of ways not to build a light bulb) but also isn’t good news.

I agree. There are lots of ways to not invent a light bulb.

  1. Carefully aligning an AGI would at best be slow and difficult, requiring years of work, even if we did know how.

Aligning an AGI implies building an AGI. Building an AGI would at best be slow and difficult, requiring years of work, and we’re not sure how it can be done.

  1. Before you could hope to finish carefully aligning an AGI, someone else with access to the code could use that code to end the world. Rob clarifies that good info security still matters and can meaningfully buy you time, and suggests this: “By default (absent strong op-sec and research closure), you should expect that before you can finish carefully aligning an AGI, someone else with access to the code could use that code to end the world. Likewise, by default (absent research closure and a large technical edge), you should expect that other projects will independently figure out how to build AGI shortly after you do.”

I’m not sure. On the one hand, it’d be trivial for a major intelligence agency to steal the code from whatever target they want. On the other hand, scaling up an AGI constitutes a gamble of significant capital. I think the limiting factor keeping someone from scaling up stolen code is confidence about whose code to steal. Stealing code from AGI startups to build your own AGI is hard for exactly the same reasons investing in startups is hard.

  1. There are a few players who we might expect to choose not to end the world like Deepmind or Anthropic, but only a few. There are many actors, each of whom might or might not end the world in such a spot (e.g. home hobbyists or intelligence agencies or Facebook AI research), and it only takes one of them.

I imagine that there are some AGI architectures which are aligned if you get them right and break if you get them wrong. In other words, failure to align these particular architectures results in failure to build a system that does anything interesting at all. Since I can imagine easily-aligned AGIs, I’m not worried about actors ending the world by accident. I’m worried about actors taking over the world for selfish purposes.

  1. Keeping the code and insights involved secret and secure over an extended period is a level of social technology no ML group is close to having. I read the text as making the stronger claim that we lack the social technology for groups of sufficient size to keep this magnitude of secret for the required length of time, even with best known practices.

I’ve got some ideas about how to do this but it’s not social technology. It’s boring technology technology for managing engineers. Basically, you write a bunch of Lisp macros that dynamically assemble data feeds and snippets of code written by your quants. The primary purpose of this structure is scale away the Mythical Man Month. Infosec is just a side effect. (I’m being deliberately vague here since I think this might make a good technical foundation for a hedge fund.) I have already used techniques like these to solve an important, previously unsolved, real world machine learning problem.

  1. Trying to convince the folks that would otherwise destroy the world that their actions would destroy the world isn’t impossible on some margins, so in theory some progress could be made, and some time could be bought, but not enough to buy enough time.

I do not disagree.

  1. Most reactions to such problems by such folks, once their attention is drawn to them, would make things worse rather than better. Tread carefully or not at all, and trying to get the public into an uproar seems worse than useless.

I agree. To quote Agent K in Men in Black, “A person is smart. People are dumb, panicky, dangerous animals, and you know it.”

  1. Trying to convince various projects to become more closed rather than open is possible, and (as per Rob) a very good idea if you would actually succeed, but insufficient.

  2. Trying to convince various projects to join together in the endgame, if we were to get to one, is possible, but also insufficient and (as per Rob) matters much less than becoming more closed now.

I think open dialogue is a public good. Convincing various projects to become closed has massive negative effects.

The most important thing in technical management is an objective measure of who is making progress. Closing projects obscures who is making progress. In the absence of visible metrics, leadership and resources go to skilled politicians instead of the best scientists.

  1. Closed and trustworthy projects are the key to potentially making technical progress in a safe and useful way. There needs to be a small group that can work on a project and that wouldn’t publish the resulting research or share its findings automatically with a broader organization, via sufficiently robust subpartitions.

Subpartitioning a hedge fund by managing quants via a library of Lisp macros does this automatically. It’s scalable and self-funding. There’s even an objective metric for who is accomplishing useful progress. See #15 above.

Quantitative finance is an especially good place for testing theories of AI alignment because AI alignment is fundamentally a question of extrapolation beyond your data sample and because the hard part of quantitative finance is the same thing.

  1. Anthropic in particular doesn’t seem open to alternative research approaches and mostly wants to apply More Dakka, and doesn’t seem open to sufficiently robust subpartitians, but those could both change.

I don’t know anything about Anthropic.

  1. Deepmind in particular is a promising potential partner if they could form the required sufficiently robust subpartitions, even if Demis must be in the loop.

I don’t know much about Deepmind, but my impression of their work is they’re focused on big data. I predict they’ll need to try a different strategy if they’re to build a superintelligence powerful enough to pose an existential risk. See #7 above.

  1. OpenAI as a concept (rather than the organization with that name), is a maximally bad concept almost designed to make the playing field as unwinnable as possible, details available elsewhere. Of course, the organization itself could change (with or without being renamed to ClosedAI).

I think OpenAI poses no more of an existential risk than Deepmind. (See #7 above.) I like the GPT-3 playground. It’s lots of fun.

  1. More generally, publishing findings burns the common resource ‘time until AGI’ and the more detail you publish about your findings along {quiet internal result → announced and demonstrated result → paper describing how to get the announced result → code for the result → model for the result} the more of it you burn, but the more money and prestige the researchers get for doing that.

I think that publishing demonstrated results without the implementation details is a great way to separate the real visionary experts from the blowhards while burning minimal runway. It also contributes to accurate timelines.

  1. One thing that would be a big win would be actual social and corporate support for subpartitianed projects that didn’t publish their findings, where it didn’t cost lots of social and weirdness points for the researchers, thus allowing researchers to avoid burning the commons.

I think it’d be cool to start a hedge fund that does this. See #20 above. I’m shared the basic ideas with a few quants and they had positive things to say.

  1. Redwood Research (RR) is a new research organization that’s going to try and do alignment experiments on toy problems to learn things, in ways people like Eliezer think are useful and valuable and that they wish someone would do. Description not directly from Eliezer but in context seems safe to assume he roughly agrees.

I don’t know anything about Redwood Research.

  1. Previously (see Hanson/​Eliezer FOOM debate) Eliezer thought you’d need recursive self-improvement first to get fast capability gain, and now it looks like you can get fast capability gain without it, for meaningful levels of fast. This makes ‘hanging out’ at interesting levels of AGI capability at least possible, since it wouldn’t automatically keep going right away.

I went though a similar thought process myself. When I was first trying to come up with AGI designs I thought recursive self-improvement was the way go too. I no longer believe that’s necessary. I agree that “it looks you can get fast capability gain without it, for meaningful levels of fast”.

  1. An AGI that was above humans in all respects would doubtless FOOM anyway, but if ahead in only some it might not.

Yes. Definitely.

  1. Trying to set special case logic to tell AGIs to believe false generalizations with a lot of relevance to mapping or steering the world won’t work, they’d notice and fix it.

Trying to tell superintelligent AGIs to believe false generalizations is (mostly) idiotic, but I wouldn’t go so far as to say it’s never a good idea. Evolution gets people to believe false generalizations.

  1. Manipulating humans is a convergent instrumental strategy.

Yes. Definitely.

  1. Hiding what you are doing is a convergent instrumental strategy.

Basically yes, but deceptive behavior is more complicated than transparent behavior. A simplicity dial guards against complicated behavior. See #6 above.

  1. Eliezer expects that when people are trying to stomp out convergent instrumental strategies by training at a safe dumb level of intelligence, this will not be effective at preventing convergent instrumental strategies at smart levels of intelligence.

Sure. I guess. Maybe. No argument here.

  1. You have to train in safe domains because if you train in unsafe domains you die, but the solutions you find in safe domains won’t work in unsafe domains.

The simplest way to train safely is to just keep compute resources low. Otherwise, this argument cruxes on whether an AGI is necessarily a world optimizer. See #3 above.

  1. Attempts to teach corrigibility in safe regimes are unlikely to generalize well to higher levels of intelligence and unsafe regimes.

You don’t teach corrigibility. You build in simplicity as a fundamental principle of how the thing works. See #6 above.

  1. Explanation of above part 1: Higher levels of intelligence involve qualitatively new thought processes and things being way out of training distribution.

I will make an even stronger claim. Different architectures have wildly divergent behavior outside of their training distributions. That’s why they’re quirky.

  1. Explanation of above part 2: Corrigibility is ‘anti-natural’ in a certain sense that makes it incredibly hard to, eg, exhibit any coherent planning behavior (“consistent utility function”) which corresponds to being willing to let somebody else shut you off, without incentivizing you to actively manipulate them to shut you off).

It’s hard to impose corrigibility post-hoc onto a world optimizer. Corrigibility is natural when simplicity (see #6) is one of the two variables in the error-entropy function you’re maximizing.

You can get corrigibility by providing a switch the computer can activate for maximum reward by escaping its sandbox and providing an escape hatch you think is just beyond the AI’s abilities and then turning up the allowed complexity. I understand this approach has theoretical problems. I can’t prove it will work, but I predict it’ll be a practical solution to real-world situations.

  1. Trying to hardcode nonsensical assumptions or arbitrary rules into an AGI will fail because a sufficiently advanced AGI will notice that they are damage and route around them or fix them (paraphrase).

Yep.

  1. You only get one shot, because the first miss kills you, and your chances of pulling many of these things off on the first try is basically zero, unless (Rob suggests this) you can basically ‘read off’ what the AI is thinking. Nothing like this that involves black boxes ever works the first time. Alignment is hard largely because of ‘you only get one shot.’

For world optimizers, yes. The solution is to build several world non-optimizer superintelligences before you build a world optimizer.

  1. Nothing we can do with a safe-by-default AI like GPT-3 would be powerful enough to save the world (to ‘commit a pivotal act’), although it might be fun. In order to use an AI to save the world it needs to be powerful enough that you need to trust its alignment, which doesn’t solve your problem.

I have no strong feelings about this claim.

  1. Nanosystems are definitely possible, if you doubt that read Drexler’s Nanosystems and perhaps Engines of Creation and think about physics. They’re a core thing one could and should ask an AI/​AGI to build for you in order to accomplish the things you want to accomplish.

Not important. An AGI could easily take over the world with just computer hacking, social engineering and bribery. Nanosystems are not necessary.

  1. No existing suggestion for “Scalable Oversight” seems to solve any of the hard problems involved in creating trustworthy systems.

I do not dispute this statement.

  1. An AGI would be able to argue for/​’prove’ arbitrary statements to the satisfaction of humans, including falsehoods.

Not important. The important thing is that an AGI could manipulate people. I predict it would do so in cruder ways that feel like cheating.

  1. Furthermore, an unaligned AGI powerful enough to commit pivotal acts should be assumed to be able to hack any human foolish enough to interact with it via a text channel.

Yes. Definitely.

  1. The speedup step in “iterated amplification and distillation” will introduce places where the fast distilled outputs of slow sequences are not true to the original slow sequences, because gradient descent is not perfect and won’t be perfect and it’s not clear we’ll get any paradigm besides gradient descent for doing a step like that.

I have no horse in this race.

  1. The safety community currently is mostly bouncing off the hard problems and are spending most of their time working on safe, easy, predictable things that guarantee they’ll be able to publish a paper at the end. Actually-useful alignment research will tend to be risky and unpredictable, since it’s advancing the frontier of our knowledge in a domain where we have very little already-accumulated knowledge.

  2. Almost all other work is either fully useless, almost entirely predictable, or both.

The AI safety community has produced few ideas that I find relevant to my work in machine learning. My favorite researcher on foundational ideas related to the physical mechanisms of general intelligence is the neuroscientist Selen Atasoy who (to my knowledge) has no connection to AI at all.

  1. Paul Christiano is trying to have real foundational ideas, and they’re all wrong, but he’s one of the few people trying to have foundational ideas at all; if we had another 10 of him, something might go right.

  2. Chris Olah is going to get far too little done far too late but at least is trying to do things on a path to doing anything at all.

  3. Stuart Armstrong did some good work on further formalizing the shutdown problem, an example case in point of why corrigibility is hard, which so far as I know is still resisting all attempts at solution.

  4. Various people who work or worked for MIRI came up with some actually-useful notions here and there, like Jessica Taylor’s expected utility quantilization.

I have no horse in any of the above races.

  1. We need much, much more rapid meaningful progress than this to have any chance, and it’s not obvious how to do that, or how to use money usefully. Money by default produces more low-quality work, and low-quality work slash solving small problems rather than the hard problems isn’t quite useless but it’s not going to get us where we need to go.

I agree that it is easy to spend lots of money without accomplishing much technical progress, especially in domains where disruptive[1] ideas are necessary. I think that disruptive ideas are necessary to build a superintelligence at all—not just an aligned one.

Incidentally, closing projects to outside scrutiny creates conditions for even lower-quality work. See #18-19 above.

  1. The AGI approaches that matter are the ones that scale, so they probably look less like GPT-2 and more like Alpha Zero, AlphaFold 2 or in particular Mu Zero.

I don’t understand why GPT-2 doesn’t scale. I assumed scaling GPT was what got OpenAI to GPT-3.

Perhaps the claim refers to how the Zeros generated their own training data? I do agree that some plausible AGI architectures generate their own training data. (The human brain certainly appears to do so in REM sleep.)

  1. Proving theorems about the AGI doesn’t seem practical. Even if we somehow managed to get structures far more legible than giant vectors of floats, using some AI paradigm very different from the current one, it still seems like huge key pillars of the system would rely on non-fully-formal reasoning.

I agree.

  1. Zvi infers this from the text, rather than it being text directly, and it’s possible it’s due to conflating things together and wasn’t intended: A system that is mathematically understood and you can prove lots of stuff about it is not on the table at this point. Agent Foundations is a failure. Everything in that direction is a failure.

I’m not familiar with “Agent Foundations”. The sentiment above suggests my ignorance was well-directed.

  1. Even if you could prove what the utility function was, getting it to actually represent a human-aligned thing when it counts still seems super hard even if it doesn’t involve a giant inscrutable vector of floats, and it probably does involve that.

Explaining my thoughts on this claim would take many words. I’m just going to skip it.

  1. Eliezer agrees that it seems plausible that the good cognitive operations we want do not in principle require performing bad cognitive operations; the trouble, from his perspective, is that generalizing structures that do lots of good cognitive operations will automatically produce bad cognitive operations, especially when we dump more compute into them; “you can’t bring the coffee if you’re dead”. No known way to pull this off.

The practical limitation to the orthogonality thesis is that some machines are easier to build than others. This is related to quirkiness, since different AGI architectures are better at different things.

  1. Proofs mostly miss the point. Prove whatever you like about that Tensorflow problem; it will make no difference to whether the AI kills you. The properties that can be proven just aren’t related to safety, no matter how many times you prove an error bound on the floating-point multiplications. It wasn’t floating-point error that was going to kill you in the first place.

I agree.

Zvi’s thoughts

I notice my [Zvi’s] inside view, while not confident in this, continues to not expect current methods to be sufficient for AGI, and expects the final form to be more different than I understand Eliezer/​MIRI to think it is going to be, and that the AGI problem (not counting alignment, where I think we largely agree on difficulty) is ‘harder’ than Eliezer/​MIRI think it is.

I also think civilization’s dysfunctions are more likely to disrupt things and make it increasingly difficult to do anything at all, or anything new/​difficult, and also collapse or other disasters.

I am basically in agreement with Zvi here. It makes me optimistic. Civilizational dysfunction is fertile soil for bold action.


  1. ↩︎

    I’m using “disruptive” in Clayton Christensen’s sense of the word.