# What’s Up With Confusingly Pervasive Consequentialism?

Fictionalized/​Paraphrased version of a real dialog between me and John Wentworth.

Fictionalized Me: So, in the Eliezer/​Richard dialogs, Eliezer is trying to get across this idea that consequentialism deeply permeates optimization, and this is important, and that’s one[1] reason why Alignment is Hard. But something about it is confusing and slippery, and he keeps trying to explain it and it keeps not-quite-landing.

I think I get it, but I’m not sure I could explain it. Or, I’m not sure who to explain it to. I don’t think I could tell who was making a mistake, where “consequentialism is secretly everywhere” is a useful concept for realizing-the-mistake.

Fictionalized John: [stares at me]

Me: Okay, I guess I’m probably supposed to try and explain this and see what happens.

...

Me: Okay, so the part that’s confusing here is that this is supposed to be something that Eliezer thinks thoughtful, attentive people like Richard (and Paul?) aren’t getting, despite them having read lots of relevant material and paying attention and being generally on board with “alignment is hard.”

...so, what is a sort of mistake I could imagine a smart, thoughtful person who read the sequences making here?

My Eliezer-model imagines someone building what they think is an aligned ML system. They’ve trained it carefully to do things they reflectively approve of, they’ve put a lot of work into making it interpretable and honest. This Smart Thoughtful Researcher has read the sequences and believes that alignment is hard and whatnot. Nonetheless, they’ll have failed to really grok this “consequentialism-is-more-pervasive-and-important-than-you-think” concept. And this will cause doom when they try to scale up their project to accomplish something actually hard.

I… guess what I think Eliezer thinks is that Thoughful Researcher isn’t respecting inner optimizers enough. They’ll have built their system to be carefully aligned, but to do anything hard, it’ll end up generating inner-optimizers that aren’t aligned, and the inner-optimizers will kill everyone.

...

John: Nod. But not quite. I think you’re still missing something.

You’re familiar with the arguments of convergent instrumental goals?

Me: i.e. most agents will end up wanting power/​resources/​self-preservation/​etc?

John: Yeah.

But not only is “wanting power and self preservation” convergently instrumental. Consequentialism is convergently instrumental. Consequentialism is a (relatively) simple, effective process for accomplishing goals, so things that efficiently optimize for goals tend to approximate it.

Now, say there’s something hard you want to do, like build a moon base, or cure cancer or whatever. If there were a list of all the possible plans that cure cancer, ranked by “likely to work”, most of the plans that might work route through “consequentalism”, and “acquire resources.”

Not only that, most of the plans route through “acquire resources in a way that is unfriendly to human values.” Because in the space of all possible plans, while consequentialism doesn’t take that many bits to specify, human values are highly complex and take a lot of bits to specify.

Notice that I just said “in the space of all possible plans, here are the most common plans.” I didn’t say anything about agents choosing plans or acting in the world. Just listing the plans. And this is important because the hard part lives in the choosing of the plans.

Now, say you build an oracle AI. You’ve done all the things to try and make it interpretable and honest and such. If you ask it for a plan to cure cancer, what happens?

Me: I guess it gives you a plan, and… the plan probably routes through consequentialist agents acquiring power in an unfriendly way.

Okay, but if I imagine a researcher who is thoughtful but a bit too optimistic, what they might counterargue with is: “Sure, but I’ll just inspect the plans for whether they’re unfriendly, and not do those plans.”

And what I might then counterargue their counterargument with is:

1) Are you sure you can actually tell which plans are unfriendly and which are not?

and,

2) If you’re reading very carefully, and paying lots of attention to each plan… you’ll still have to read through a lot of plans before you get to one that’s actually good.

John: Bingo. I think a lot of people imagine asking an oracle to generate 100 plans, and they think that maybe half the plans will be pretty reasonable. But, the space of plans is huge. Exponentially huge. Most plans just don’t work. Most plans that work route through consequentialist optimizers who convergently seek power because you need power to do stuff. But then the space of consequentialist power-seeking plans are still exponentially huge, and most ways of seeking power are unfriendly to human values. The hard part is locating a good plan that cures cancer that isn’t hostile to human values in the first place.

Me: And it’s not obvious to me whether this problem gets better or worse if you’ve tried to train the oracle to only output “reasonable seeming plans”, since that might output plans that are deceptively unaligned.

John: Do you understand why I brought up this plan/​oracle example, when you originally were talking about inner optimizers?

Me: Hmm. Um, kinda. I guess it’s important that there was a second example.

John: …and?

Me: Okay, so partly you’re pointing out that hardness of the problem isn’t just about getting the AI to do what I want, it’s that doing what I want is actually just really hard. Or rather, the part where alignment is hard is precisely when the thing I’m trying to accomplish is hard. Because then I need a powerful plan, and it’s hard to specify a search for powerful plans that don’t kill everyone.

John: Yeah. One mistake I think people end up making here is that they think the problem lives in the AI-who’s-deciding/​doing things, as opposed to in the actual raw difficulty of the search.

Me: Gotcha. And it’s important that this comes up in at least two places – inner optimizers with an agenty AI, and an oracle that just output plans that would work. And the fact that it shows up in two fairly different places, one of which I hadn’t thought of just now, is suggestive that it could show up in even more places I haven’t thought of at all.

And this is confusing enough that it wasn’t initially obvious to Richard Ngo, who’s thought a ton about alignment. Which bodes ill for the majority of alignment researchers who probably are less on-the-ball.

1. ^

I’m tempted to say “the main reason” why Alignment Is Hard, but then remembered Eliezer specifically reminded everyone not to summarize him as saying things like “the key reason for X” when he didn’t actually say that, and often is tailoring his arguments to a particular confusion with his interlocuter.

• Suppose we took this whole post and substituted every instance of “cure cancer” with the following:
Version A: “win a chess game against a grandmaster”
Version B: “write a Shakespeare-level poem”
Version C: “solve the Riemann hypothesis”
Version D: “found a billion-dollar company”
Version E: “cure cancer”
Version F: “found a ten-trillion-dollar company”
Version G: “take over the USA”
Version H: “solve the alignment problem”
Version I: “take over the galaxy”

And so on. Now, the argument made in version A of the post clearly doesn’t work, the argument in version B very likely doesn’t work, and I’d guess that the argument in version C doesn’t work either. Suppose I concede, though, that the argument in version I works: that searching for an oracle smart enough to give us a successful plan for taking over the galaxy will very likely lead us to develop an agentic, misaligned AGI. Then that still leaves us with the question: what about versions D, E, F, G and H? The argument is structurally identical in each case—so what is it about “curing cancer” that is so hard that, unlike winning chess or (possibly) solving the Riemann hypothesis, when we train for that we’ll get misaligned agents instead?

We might say: well, for humans, curing cancer requires high levels of agency. But humans are really badly optimised for many types of abstract thinking—hence why we can be beaten at chess so easily. So why can’t we also be beaten at curing cancer by systems less agentic than us?

Eliezer has a bunch of intuitions which tell him where the line of “things we can’t do with non-dangerous systems” should be drawn, which I freely agree I don’t understand (although I will note that it’s suspicious how most people can’t do things on the far side of his line, but Einstein can). But insofar as this post doesn’t consider which side of the line curing cancer is actually on, then I don’t think it’s correctly diagnosed the place where Eliezer and I are bouncing off each other.

• For all tasks A-I, most programs that we can imagine writing to do that task will need to search through various actions and evaluating the consequences. The only one of those tasks we currently know how to solve with a program is A, and chess programs do indeed use search and evaluation.

I’d guess that whether something can be done safely is mostly a function of how easy it is, and how isolated it is from the real world. The Riemann hypothesis seems pretty tricky, but it’s isolated from the real world, so it can probably be solved by a safe system. Chess is isolated and easy. Starting a billion dollar company is very entangled with the real world, and very tricky. So we probably couldn’t do it without a dangerous system.

• This all makes sense, except for the bit where you draw the line at a certain level of “tricky and entangled with world”. Why isn’t it the case that danger only arises for the first AIs that can do tasks half as tricky? Twice as tricky? Ten times as tricky?

• Consider what happens if you had to solve your list of problems and didn’t inherently care about human values? To what extent would you do ‘unfriendly’ things via consequentialism? How hard would you need to be constrained to stop doing that? Would it matter if you could also do far trickier things by using consequentialism and general power-seeking actions?

The reason, as I understand it, that a chess-playing AI does things the way we want it to is that we constrain the search space it can use because we can fully describe that space, rather than having to give it any means of using any other approaches, and for now that box is robust.

But if someone gave you or I the same task, we wouldn’t learn chess, we would buy a copy of Stockfish, or if it was a harder task (e.g. be better than AlphaZero) we’d go acquire resources using consequentialism. And it’s reasonable to think that if we gave a fully generic but powerful future AI the task of being the best at chess, at some point it’s going to figure out that the way to do that is acquire resources via consequentialism, and potentially to kill or destroy all its potential opponents. Winner.

Same with the poem or the hypothesis, I’m not going to be so foolish as to attack the problem directly unless it’s already pretty easy for me. And in order to get an AI to write a poem that good, I find it plausible that the path to doing that is less monkeys on a typewriter and more resource acquisition so I can understand the world well enough to do that. As a programmer of an AI, right now, the path is exactly that—it’s ‘build an AI that gets me enough more funding to potentially get something good enough to write that kind of poem,’ etc.

Another approach, and more directly a response to your question here, is to ask, which is easier for you/​the-AI: Solving the problem head-on using only known-safe tactics and existing resources, or seeking power via consequentialism?

Yes, at some amount of endowment, I already have enough resources relative to the problem at hand and see a path to a solution, so I don’t bother looking elsewhere and just solve it, same as a human. But mostly no for anything really worth doing, which is the issue?

• I agree with basically your whole comment. But it doesn’t seem like you’re engaging with the frame I’m using. I’m trying to figure out how agentic the first AI that can do task X is, for a range of X (with the hope that the first AI that can do X is not very agentic, for some X that is a pivotal task). The claim that a highly agentic highly intelligent AI will likely do undesirable things when presented with task X is very little evidence about this, because a highly agentic highly intelligent AI will likely do undesirable things when presented with almost any task.

• Thank you, that is clarifying, together with your note to Scott on ACX about wanting it to ‘lack a motivational system.’ I want to see if I have this right before I give another shot at answering your actual question.

So as I understand your question now, what you’re asking is, will the first AI that can do (ideally pivotal) task X be of Type A (general, planning, motivational, agentic, models world, intelligent, etc) or Type B (basic, pattern matching, narrow, dumb, domain specific, constrained, boxed, etc).

I almost accidentally labeled A/​B as G/​N there, and I’m not sure if that’s a fair labeling system and want to see how close the mapping is? (e.g. narrow AI and general AI as usually understood). If not, is there a key difference?

• Instead of “dumb” or “narrow” I’d say “having a strong comparative advantage in X (versus humans)”. E.g. imagine watching evolution and asking “will the first animals that take over the world be able to solve have already solved the Riemann hypothesis”, and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.

Similarly, I don’t expect that any AI which can do a bunch of groundbreaking science to be “narrow” by our current standards, but I do hope that they have a strong comparative disadvantage at taking-over-world-style tasks, compared with doing-science-style tasks.

And that’s related to agency, because what we mean by agency is not far off “having a comparative advantage in taking-over-world style tasks”.

Now, I expect that at some point, this line of reasoning stops being useful, because your systems are general enough and agentic enough that, even if their comparative advantage isn’t taking over the world, they can pretty easily do that anyway. But the question is whether this line of reasoning is still useful for the first systems which can do pivotal task X. Eliezer thinks no, because he considers intelligence and agency to be very strongly linked. I’m less sure, because humans have been evolved really hard to be agentic, so I’d be surprised if you couldn’t beat us at a bunch of intellectual tasks while being much less agentic than us.

Side note: I meant “pattern-matching” as a gesture towards “the bit of general intelligence that doesn’t require agency” (although in hindsight I can see how this is confusing, I’ve just made an edit on the ACX comment).

• “will the first animals that take over the world be able to solve the Riemann hypothesis”, and the answer is no because humans intelligence, while general, is still pointed more at civilisation-building-style tasks than mathematics.

Pardon the semantics, but I think the question you want to use here is “will the first animals that take over the world have already solved the Riemann hypothesis”. IMO humans do have the ability (“can”) to solve the Riemann hypothesis, and the point you’re making is just about the ordering in which we’ve done things.

• Yes, sorry, you’re right; edited.

• No one actually knows the exact task-difficulty threshold, but the intuition is that once a task is hard enough, any AI capable of completing the task is also capable of thinking of strategies that involve betraying its human creators. However, even if I don’t know the exact threshold, I can think of examples that should definitely be above the line. Starting a billion dollar company seems pretty difficult, but it could maybe be achieved by an special-purpose algorithm that just plays the stock market really well. But if we add a few more stipulations, like that the company has to make money by building an actual product, in an established industry with lots of competition, then probably that can only be done by a dangerous algorithm. It’s not a very big step from “figuring out how to outwit your competitors” to “realizing that you could outwit humans in general”.

An implicit assumption here is that I’m drawing the line between “safe” and “dangerous” at the point where the algorithm realizes that it could potentially achieve higher utility by betraying us. It’s possible that an algorithm could realize this, but still not be strong enough to “win” against humanity. Hopefully that seems reasonable to you, if not I can give some reasons why I think so.

• The easiest way is probably to build a modestly-sized company doing software and then find a way to destabilize the government and cause hyperinflation.

I think the rule of thumb should be: if your AI could be intentionally deployed to take over the world, it’s highly likely to do so unintentionally.

• Yes, it definitely doesn’t work with A or C. It might work with B, because judging whether a poem is Shakespeare-level or not is heavily entangled with human society and culture and it may turn out that manipulating humans to rave about whatever you wrote (whether it’s actually Shakespeare-level poetry or not) might be easier. I expect not, but it’s hard to be sure. I would certainly put C as safer than B.

Everything else is obviously much more dangerous.

• That was my intuition as well. A and C are just not entangled with the physical world at all. B is a maybe; it’s a big leap from poetry to taking over the world, but humans are something that has to be modelled and that’s where trouble starts.

• My understanding is that you can’t safely do even A with an arbitrarily powerful optimizer. An arbitrarily powerful optimizer who’s reward function is solely “beat the grandmaster” would do everything possible to ensure it’s reward function is maximised with the highest probability. For instance, it might amass as much compute as possible to ensure that it’s made no errors at all, it might armor it’s servers to ensure no one switches it off, and of course, it might pharmacologically mess with the grandmaster to inhibit their performance.

The fact that it can be done safely by a weak AI isn’t to say that it’s safe to do with a powerful AI.

• For the purposes of this argument, I’m interested in what can be done safely by some AI we can build. If you can solve alignment safely with some AI, then you’re in a good situation. What an arbitrarily powerful optimiser will do isn’t the crux, we all agree that’s dangerous.

• Looking at the A substitution, why doesn’t this argument work?

I think by “win a chess game against a grandmaster” you are specifically asking about the game itself. In real life we also have to arrange the game, stay alive until the game, etc. Let’s take all that out of scope, it’s obviously unsafe.

If there were a list of all the possible plans that win a chess game against a grandmaster, ranked by “likely to work”, most of the plans that might work route through “consequentalism”, and “acquire resources.” Now, say you build an oracle AI. You’ve done all the things to try and make it interpretable and honest and such. If you ask it for a plan to win a chess game against a grandmaster, what happens?

Well it definitely doesn’t give you a plan like “If the grandmaster plays e4, you play e5, and then if they play d4, you play f5, …” because that plan is too large. I think the desired outcome is a plan like “open with pawn to d4, observe the board position, then ask for another plan”. Are Oracle AIs allowed to provide self-referential plans?

Regardless, if I’m an Oracle AI looking for the most likely plan, I’m now very concerned that you’ll have a heart attack, or an attack of arrogance, or otherwise mess up my perfect plan. Unlikely, sure, but I’m searching for the most “likely to work” here. So the actual plan I give you is “ask the grandmaster how his trip to Madrid went, then ask me for another plan”. Then the grandmaster realizes that I know about his(*eg) affair and will reveal it if he wins, and he attempts to lose as gracefully as possible. So now the outcome is much more robust to events.

• I agree that highly agentic versions of the system will complete the tasks better. My claim is just that they’re not necessary to complete the task very well, and so we shouldn’t be confident that selection for completing that task very well will end up producing the highly agentic versions.

• That helps, thanks. Raemon says:

The part where alignment is hard is precisely when the thing I’m trying to accomplish is hard. Because then I need a powerful plan, and it’s hard to specify a search for powerful plans that don’t kill everyone.

I now read you as pointing to chess as:

• It is “hard to accomplish” from the perspective of human cognition.

• It does not require a “powerful”/​”agentic” plan.

• It’s “easy” to specify a search for a good plan, we already did it.

So maybe alignment is like that.

• Yepp. And clearly alignment is much harder than chess, but it seems like an open question whether it’s harder than “kill everyone” (and even if it is, there’s an open question of how much of an advantage we get from doing our best to point the system at the former not the latter).

• whether it’s harder than “kill everyone”

“Kill everyone” seems like it should be “easy”, because there are so many ways to do it: humans only survive in environments with a specific range of temperatures, pressures, atmospheric contents, availability of human-digestible food, &c.

• I agree the argument doesn’t work for A, B, and C, but I think the way it doesn’t work should make you pessimistic about how much we can trust the outputted plans in more complex task domains.

For A, it doesn’t seem certain to me that the AI will only generate plans which only involve making a chess move. It has no reason to prefer simpler plans over more complex, and it may gain a lot by suggesting that the player, for instance, lobotomize their opponent, or hook it up to some agent AI in a way such that most worlds lead to it playing against a grandmaster with a lobotomy.

If you penalize the AI significantly for thought cycles, it will just output 100 different ways of connecting itself to an agent (or otherwise producing an optimization process which achieves it’s goal). If you don’t penalize it very much for thought cycles it will come up with a way to win against it’s opponent, then add on a bunch of ways to ensure they’re lobotomized before the match.

Most ways of defining these goals seems like it always leads to most action sequences or all of them being bad, or having bad components given too little or too much penalization for thought cycles, so as your ability to foresee the consequences of the actions taken decreases, you should also dramatically decrease the expected value of any particular generated plan. This means that in domains where the agent is actually useful, action plans which are easy to verify without actually executing are the only ones which can be used. This means D, E, F, G, I, and possibly H depending on the form the solution takes all pose astronomical risks.

Another possible solution would be to estimate how many thought cycles it should take to solve the problem, as well as how accurate that estimate needs to be in order to not result in optimisers or lobotomies, then only use solutions in that range.

Edit: the point is that the simpler cases don’t work because it’s very easy to verify the actions are in an action space which is easy to verify won’t lead to catastrophic failure. For A you can just make sure the action space is that of chess moves, for C of chess proofs.

• But I think Richard’s point is ‘but we totally built AIs that defeated grand chess masters without destroying the world. So, clearly it’s possible to use tool AI to do this sort of thing. So… why do think various domains will reliably output horrible outcomes? If you need to cure cancer, maybe there is an analogous way to cure cancer that just… isn’t trying that hard?’

Richard is that what you were aiming at?

• The reason why we can easily make AIs which solve chess without destroying the world is because we can make specialized AIs such that they can only operate in the theoretical environment of states of chess boards, and in that environment we can tell it exactly what it’s goal should be.

If we tell an AGI to generate plans for winning at chess, and it knows about the outside world, then because the state space is astronomically larger, it is astronomically more difficult to tell it what it’s goal should be, and so any goal we do give it either satisfies corrigibility, and we can tell it “do what I want”, or incompletely captures what we mean by ‘win this chess game’.

For cancer, there may well be a way to solve the problem using a specialized AI, which works in an environment space simple enough that we can completely specify our goal. I assume though that we are using a general AI in all the hypothetical versions of the problem, which has the property ‘it’s working in an environment space large enough that we can’t specify what we want it to do’ or if it doesn’t know a priori it’s plans can affect the state of a far larger environment space which can affect the environment space it cares about, it may deduce this, and figure out a way to exploit this feature.

• This is what I came here to say! I think you point out a crisp reason why some task settings make alignment harder than others, and why we get catastrophically optimized against by some kinds of smart agents but not others (like Deep Blue).

• I might be conflating Richard, Paul, and my own guesses here. But, I think part of the argument here is about what can happen before AGI, that gives us lines of hope to pursue.

Like, my-model-of-Paul wants various tools for amplifying his own thought to (among other things) help think about solving the longterm alignment problem. And the question is whether there are ways of doing that that actually help when trying to solve the sorts of problems Paul wants to solve. We’ve successfully augmented human arithmetic and chess. Are there tools we actually wish we had, that narrow AI meaningfully helps with,

I’m not sure if Richard has a particular strategy in mind, but I assume he’s exploring the broader question of “what useful things can we build that will help navigate x-risk”

The original dialogs were exploring the concept of pivotal acts that could change humanity’s strategic position. Are there AIs that can execute pivotal acts that are more like calculators and Deep Blue than like autonomous moon-base-builders? (I don’t know if Richard actually shares the pivotal act /​ acute risk period frame, or was just accepting it for sake of argument)

• The problem is not with whether we call the AI AGI or not, it’s whether we can either 1) fully specify our goals in the environment space it’s able to model (or otherwise not care too deeply about the environment space it’s able to model), or 2) verify the effects of the actions it says to do have no disastrous consequences.

To determine whether a tool AI can be used to solve problems Paul wants to solve, or execute pivotal acts, we need a to both 1) determine that the environment is small enough for us to accurately express our goal, and 2) ensure the AI is unable to infer the existence of a broader environment.

(meta note: I’m making a lot of very confident statements, and very few are of the form “<statement>, unless <other statement>, then <statement> may not be true”. This means I am almost certainly overconfident, and my model is incomplete, but I’m making the claims anyway so that they can be developed)

• This dialog was much less painful for me to read than i expected, and I think it manages to capture at least a little of the version-of-this-concept that I possess and struggle to articulate!

(...that sentence is shorter, and more obviously praise, in my native tongue.)

A few things I’d add (epistemic status: some simplification in attempt to get a gist across):

If there were a list of all the possible plans that cure cancer, ranked by “likely to work”, most of the plans that might work route through “consequentalism”, and “acquire resources.”

Part of what’s going on here is that reality is large and chaotic. When you’re dealing with a large and chaotic reality, you don’t get to generate a full plan in advance, because the full plan is too big. Like, imagine a reasoner doing biological experimentation. If you try to “unroll” that reasoner into an advance plan that does not itself contain the reasoner, then you find yourself building this enormous decision-tree, like “if the experiments come up this way, then I’ll follow it up with this experiment, and if instead it comes up that way, then I’ll follow it up with that experiment”, and etc. This decision tree quickly explodes in size. And even if we didn’t have a memory problem, we’d have a time problem—the thing to do in response to surprising experimental evidence is often “conceptually digest the results” and “reorganize my ontology accordingly”. If you’re trying to unroll that reasoner into a decision-tree that you can write down in advance, you’ve got to do the work of digesting not only the real results, but the hypothetical alternative results, and figure out the corresponding alternative physics and alternative ontologies in those branches. This is infeasible, to say the least.

Reasoners are a way of compressing plans, so that you can say “do some science and digest the actual results”, instead of actually calculating in advance how you’d digest all the possible observations. (Note that the reasoner specification comprises instructions for digesting a wide variety of observations, but in practice it mostly only digests the actual observations.)

Like, you can’t make an “oracle chess AI” that tells you at the beginning of the game what moves to play, because even chess is too chaotic for that game tree to be feasibly representable. You’ve gotta keep running your chess AI on each new observation, to have any hope of getting the fragment of the game tree that you consider down to a managable size.

Like, the outputs you can get out of an oracle AI are “no plan found”, “memory and time exhausted”, “here’s a plan that involves running a reasoner in real-time” or “feed me observations in real-time and ask me only to generate a local and by-default-inscrutable action”. In the first two cases, your oracle is about as useful as a rock; in the third, it’s the realtime reasoner that you need to align; in the fourth, all whe word “oracle” is doing is mollifying you unduly, and it’s this “oracle” that you need to align.

(NB: It’s not obvious to me that cancer cures require routing through enough reality-chaos that plans fully named in advance need to route through reasoners; eg it’s plausible that you can cure cancer with a stupid amount of high-speed trial and error. I know of no pivotal act, though, that looks so easy to me that nonrealtime-plans can avoid the above quadlemma.)

And it’s not obvious to me whether this problem gets better or worse if you’ve tried to train the oracle to only output “reasonable seeming plans”

My point above addresses this somewhat, but I’m going to tack another point on for good measure. Suppose you build an oracle and take the “the plan involves a realtime reasoner” fork of the above quadlemma. How does that plan look? Does the oracle say “build the reasoner using this simple and cleanly-factored mind architecture, which is clearly optimizing for thus-and-such objectives?” If that’s so easy, why aren’t we building our minds that way? How did it solve these alignment challenges that we find so difficult, and why do you believe it solved them correctly? Also, AIs that understand clean mind-architectures seem deeper in the tech tree than AIs that can do some crazy stuff; why didn’t the world end five years before reaching this hypothetical?

Like, specifying a working mind is hard. (Effable, transparent, and cleanly-factored minds are hander still, apparently.) You probably aren’t going to get your first sufficiently-good-reasoner from “project oracle” that’s training a non-interactive system to generate plans so hard that it invents its own mind architectures and describes their deployment, you’re going to get it from some much more active system that is itself a capable mind before it knows how to design a capable mind, like (implausible detail for the purpose of concrete visualization) the “lifelong learner” that’s been chewing through loads and loads of toy environments while it slowly acretes the deep structures of cognition.

Maybe once you have that, you can go to your oracle and be like “ok, you’re now allowed to propose plans that involve deploying this here lifelong learner”, but of course your lifelong learner doesn’t have to be a particularly alignable architecture; its goals don’t have to be easily identifiable and cleanly separable from the rest of its mind.

Which is mostly just providing more implausible detail that makes the “if your oracle emits plans that involve reasoners, then it’s the reasoners you need to align” point more concrete. But… well, I’m also trying to gesture at why the “what if we train the oracle to only output reasonable plans?” thought seems, to me, to come at it from a wrong angle, in a manner that I still haven’t managed to precisely articulate.

(I’m also hoping this conveys at least a little more of why the “just build an oracle that does alignment research” looks harder than doing the alignment research our own damn selves, and I’m frustrated by how people give me a pitying look when I suggest that humanity should be looking for more alignable paradigms, and then turn around and suggest that oracles can do that no-problem. But I digress.)

• Also, AIs that understand clean mind-architectures seem deeper in the tech tree than AIs that can do some crazy stuff; why didn’t the world end five years before reaching this hypothetical?

Possible world: Alignment is too hard for a small group of people to cleanly understand, but not too far beyond that. In part because the profitability/​researcher status gradient doesn’t push AI research towards alignment, building an AI which is cleanly designed and aligned is a natural solution found by a mid-level messy AI, even though that mid-level messy AI is still too dumb to help mainstream researchers gain a ton of power by the tasks they try it on. Because gaining power is hard due to adversarial pressures.

(After I’ve written that, I believe what I’ve written less, one because it involves a few independent details, but two because I don’t see why the mainstream researchers wouldn’t have elicited that capability but alignment researchers did.

I have an intuition that I didn’t fully express with the above, though, and so I’m not totally backing off of my hunch that there’s some gap in your argument which I quoted.)

• Like, you can’t make an “oracle chess AI” that tells you at the beginning of the game what moves to play, because even chess is too chaotic for that game tree to be feasibly representable. You’ve gotta keep running your chess AI on each new observation, to have any hope of getting the fragment of the game tree that you consider down to a managable size.

It’s not obvious to me how generally true this is. You can’t literally specify every move at the beginning of the game, but it seems like there could be instructions that work for more specified chess tasks. Like, I imagine a talented human chess coach could generate a set of instructions in English that would work well for defeating me at chess at least once (maybe there already exist “how to beat noobs at chess” instructions that will work for this). I would be unsurprised if there exists a set of human-readable instructions of human-readable length that would give me better-than-even odds of defeating a pre-specified chess expert at least once, that can be generated just by noticing and exploiting as-yet-unnoticed regularities in either that expert’s play in particular or human-expert-level chess in general.

It’s possible my intuition here is related to my complete lack of expertise in chess, and I would not be surprised if Magnus-Carlsen-defeating instructions do not exist (at least, not without routing through a reasoner). Still, I think I assign greater credence to shallow-pattern-finding AI enabling a pivotal act than you do, and I’m wondering if the chess example is probing this difference in intuition.

• As a causal chess player it seems unlikely to me that there are any such instructions that would lead a beginner to beat even a halfway decent player. Chess is very dependent on calculation (mentally stepping through the game tree) and evaluation (recognising if a position is good or bad). Given the slow clock speed of the human brain (compared to computers), our calculations are slow and so we must lean heavily on a good learned evaluation function, which probably can’t be explicitly represented in a way that would be fast enough to execute manually. In other words you’d end up taking hours to make a move or something.

There’s no shortcut like “just move these pawns 3 times in a mysterious pattern, they’ll never expect it”—“computer lines” that bamboozle humans require deep search that you won’t be able to do in realtime.

Edit: the Oracle’s best chance against an ok player would probably be to give you a list of trick openings that lead to “surprise” checkmate and hope that the opponent falls into one, but it’s a low percentage.

• I’m not sure that this is true. (Depends a lot on what rating do you define as “Halfway decent”). There are, in fact, rules that generalize over lots of board states, such as

• capture toward the center

• early on, focus on getting knight/​bishop to squares from which they have many moves

• etc.

If I had one day to make such a list, I don’t think a beginner could use it to beat a 1200 player in, say, a 30 minute game. But I’m very uncertain about the upper limit of usefulness of such a list. I wonder about stuff like that a lot, but it’s very hard to tell. (Have you read a book about chess principles?)

I’m not even confident that you couldn’t beat Magnus. It depends on a bunch of factors, but perhaps you could just choose a line that seems forcing for black and try to specify enough branches of the tree to give you > 50% chance that it covers the game with Magnus. You could call this cheating, but it’s unclear how to formalize the challenge to avoid it. If Magnus knows who he’s playing against, this would make it significantly harder.

• I’m very confident that Magnus absolutely crushes a beginner who has been given a personal chess book, of normal book length, written by God. Magnus still has all the advantages.

• Magnus can evaluate moves faster and has a deeper search tree.

• The book of chess can provide optimal opening lines, but the beginner needs to memorize them, and Magnus has a greater capacity for memorizing openings.

• The book of chess can provide optimal principles for evaluating moves, but the beginner has to apply them, and decide what to do when they point in different directions. This comes from practice. A book of normal size can only provide limited practice examples.

• The beginner will have a higher rate of blunders. It is hard to “capture toward the center” when you don’t even see the capture.

Some intuitions from chess books: the book God would give to a beginner is different to the book God would give a 1200 player. After reading a chess book, it is normal for ability to initially go down, until the advice has been distilled and integrated with practice. Reading a chess book helps improve faster, not to be instantly better.

Some intuitions from chess programs: they lose a lot of power if you cut down their search time to simulate the ability of a beginner to calculate variations, and also cut down their opening database to simulate the ability of a beginner to memorize openings, and also give them a random error chance to simulate a beginner’s rate of blunders.

• Sorry for the double response, but a separate point here is that your method of estimating the effectiveness of the best possible book seems dubious to me. It seems to be “let’s take the best book we have; the perfect book won’t be that much better”. But why would this be true, at all? We have applied tons of optimization pressure to chess and probably know that the ceiling isn’t that far above Stockfish, but we haven’t applied tons of optimization pressure to distilling chess. How do you know that the best possible book won’t be superior by some large factor? Why can’t the principles be so simple that applying them is easy? (This is a more general question; how can you e.g. estimate the effectiveness of the best possible text book for some subfield of math?)

I’m a bit more sympathetic to this if we play Blitz, but for the most interesting argument, I think we should assume classical time format, where any beginner can see all possible captures.

• Thanks for the double response. This line seems potentially important. If we could safely create an Oracle that can create a book of chess that massively boosts chess ability, then we could maybe possibly miraculously do the same thing to create a book that massively boosts AI safety research ability.

I agree that my argument above was pretty sketchy, just “intuitions” really. Here’s something a bit more solid, after further reflection.

I’m aware of adversarial examples and security vulnerabilities, so I’m not surprised if a superintelligence is able to severely degrade human performance via carefully selected input. A chess book that can make Magnus lose to a beginner wouldn’t surprise me. Neither would a chess book that degraded a beginner’s priorities such that they obsessed about chess, for however many Elo points that would be worth.

But mostly this problem is in the opposite direction: can we provide carefully curated input that allows an intelligence to learn much faster? In this direction the results seem much less dramatic. My impression is that the speed of learning is limited by both the inputs and the learner. If the book of chess is a perfect input, then the limiting factor is the reader, and an average reader won’t get outsized benefits from perfect inputs.

Possible counter-argument: supervised learning can outperform unsupervised learning by some large factor, data quality can likewise have a big impact. That’s fine, but every chess book I’ve read has been supervised learning, and chess books are already higher data quality than scraping r/​chess. So those optimizations have already been made.

Possible counter-argument: few-shot learning in GPT-3? This seems more like surface knowledge that is already in the language model. So maybe a chess beginner already has the perfect chess algorithm somewhere in their brain, and the chess book just needs to surface that model and suppress all the flawed models that are competing with it? I don’t buy it, that’s not what it feels like learning chess from the inside, but maybe I need to give the idea some weight.

Possible counter-argument: maybe humans are actually really intelligent and really good learners and the reason we’re so flawed is that we have bad inputs? Eg from other flawed humans, random chance hiding things, biases in what we pay attention to, etc. I don’t buy this, but I don’t actually have a clear reason why.

• But mostly this problem is in the opposite direction: can we provide carefully curated input that allows an intelligence to learn much faster? In this direction the results seem much less dramatic. My impression is that the speed of learning is limited by both the inputs and the learner. If the book of chess is a perfect input, then the limiting factor is the reader, and an average reader won’t get outsized benefits from perfect inputs.

Which results did you have in mind? The ‘machine teaching’ results are pretty dramatic and surprising, although one could question whether they have any practical implications.

• I wasn’t aware of them. Thanks. Yes, that’s exactly the sort of thing I’d expect to see if there was a large possible upside in better teaching materials that an Oracle could produce. So I no longer disagree with Rafael & Richard on this.

• But mostly this problem is in the opposite direction: can we provide carefully curated input that allows an intelligence to learn much faster? In this direction the results seem much less dramatic. My impression is that the speed of learning is limited by both the inputs and the learner. If the book of chess is a perfect input, then the limiting factor is the reader, and an average reader won’t get outsized benefits from perfect inputs.

My problem with this is that you’re treating the amount of material as fixed and abstracting it as “speed”; however, what makes me unsure about the power of the best possible book is that it may choose a completely different approach.

E.g., consider the “ontology” of high-level chess principles. We think in terms of “development” and “centralization [of pieces]” and “activity” and “pressure” and “attacking” and “discoveries” and so forth. Presumably, most of these are quite helpful; if you have no concept of discoveries, you will routinely place your queen or king on inconvenient squares and get punished. If you have no concept of pressure, you have no elegant way of pre-emptive reaction if your opponent starts aligning a lot of pieces toward your king, et cetera.

So, at the upper end of my probability distribution for how good a book would be, it may introduce a hundred more such concepts, each one highly useful to elegantly compress various states. It will explain them all in the maximally intuitive and illustrative way, such that they all effortlessly stick, in the same way that sometimes things you hear just make sense and fit your aesthetic, and you recall them effortlessly. After reading this book, a beginner will look at a bunch of moves of a 2000 elo player, and go “ah, these two moves clearly violate principle Y”. Even though this player has far less ability to calculate lines, they know so many elegant compressions that they may compensate in a direct match. Much in the same way that you may beat someone who has practiced twice as long as you but has no concept of pressure; they just can’t figure out how to spot situations from afar where their king is suddenly in trouble.

• Isn’t it trivial for the beginner to beat Magnus using this book? God just needs to predict Magnus perfectly, and write down a single list of moves that the beginner needs to follow to beat him. Half a page is enough.

In general, you ignored this approach, which is the main reason why I’m unsure whether a book from a superintelligence could beat Magnus.

• I read your idea of “a line that seems forcing for black”, and I interpreted it as being forcing for black in general, and responded in terms of memorizing optimal opening lines. It sounds like you meant a line that would cause Magnus in particular to respond in predictable ways? Sorry for missing that.

I can imagine a scenario with an uploaded beginner and an uploaded Magnus in a sealed virtual environment running on error-correcting hardware with a known initial state and a deterministic algorithm, and your argument goes through there, and in sufficiently similar scenarios.

Whereas I had in mind a much more chaotic scenario. For example, I expect Magnus’s moves to depend in part on the previous games he played, so predicting Magnus requires predicting all of those games, and thus the exponential tree of previous games. And I expect his moves to depend in part on his mood, eg how happy he’d be with a draw. So our disagreement could be mostly about the details of the hypothetical, such as how much time passes between creating the book and playing the game?

• I read your idea of “a line that seems forcing for black”, and I interpreted it as being forcing for black in general

So to clarify: this interpretation was correct. I was assuming that a superintelligence cannot perfectly predict Magnus, pretty much for the reasons you mention (dependency on previous games, mood, etc.) But I then changed that standard when you said

I’m very confident that Magnus absolutely crushes a beginner who has been given a personal chess book, of normal book length, written by God.

Unlike a superintelligence, surely god could simulate Magnus perfectly no matter what; this is why I called the problem trivial—if you invoke god.

If you don’t invoke god (and thus can’t simulate magnus), I remain unsure. There are already games where world champions play the top move recommended by the engine 10 times in a row, and those have not been optimized for forcing lines. You may overestimate how much uncertainty or variance there really is. (Though again, if Magnus knows what you’re doing, it gets much harder since then he could just play a few deliberately bad moves and get you out of preparation.)

• Yes, I used “God” to try to avoid ambiguity about (eg) how smart the superintelligence is, and ended up just introducing ambiguity about (eg) whether God plays dice. Oops. I think the God hypothetical ends up showing the usual thing: Oracles fail[1] at large/​chaotic tasks, and succeed at small/​narrow tasks. Sure, more things are small and narrow if you are God, but that’s not very illuminating.

So, back to an Oracle, not invoking God, writing a book of chess for a beginner, filling it with lines that are forcing for black, trying to get >50% of the tree. Why do we care, why are we discussing this? I think because chess is so much smaller and less chaotic than most domains we care about, so if an Oracle fails at chess, it’s probably going to also fail at AI alignment, theorem proving, pivotal acts, etc.

There’s some simple failure cases we should get out of the way:

• As you said, if Magnus knows or suspects what he’s playing against, he plays a few lower probability moves and gets out of the predicted tree. Eg, 1. e4 d6 is a 1% response from Magnus. Or, if Magnus thinks he’s playing a beginner, then he uses the opportunity to experiment, and becomes less predictable. So assume that he plays normally, predictably.

• If Magnus keeps playing when he’s in a lost position, it’s really hard for a move to be “forced” if all moves lead to a loss with correct play. One chess principle I got from a book: don’t resign before the end game if you don’t know that your opponent can play the end game well. Well, assume that Magnus resigns a lost position.

• What if the beginner misremembers something, and plays the wrong move? How many moves can a beginner remember, working from an Oracle-created book that has everything pre-staged with optimized mnemonics? I assume 1,000 moves, perfect recall. 10 moves per page for a 100 page book.

So we need to optimize for lines that are forcing, short, and winning[2]. Shortness is important because a 40 move line where each move is 98% forced is overall ~45% forcing, and because we can fit more short lines into our beginner’s memory. If you search through all top-level chess games and find ones where the players play the engine-recommended move ten times in a row, that is optimizing for winning (from the players) and forcing (from the search). Ten moves isn’t long enough, we need ~30 moves for a typical game.

Terrible estimate: with 500,000 games in chessgames.com, say there are 50 games with forcing lines of ten moves, a 10,000x reduction. An Oracle can search better, for games that haven’t been played yet. So maybe if Oracle searched through 5 trillion games it would find a game with a forcing line of 20 moves? At some point I question whether chess can be both low variance enough to have these long forcing lines, and also high variance enough to have so many potential games to search through. Of course chess has ample variance if you allow white to play bad moves, but then you’re not winning.

Another approach, trying to find a forcing opening, running through the stats on chessgames.com in a greedy way, I get this “Nimzo-Indian, Samisch” variation, which seems to be playable for both sides, but perhaps slightly favors black:

1. d4 Nf6 (73% forced—Magnus games)

2. c4 e6 (72% forced—Magnus games)

3. Nc3 Bb4 (83% forced—all games)

4. a3 Bxc3+ (100% forced—all games)

5. bxc3 c5 (55% forced—all games)

6. f3 d5 (85% forced—all games)

Multiplying that through gets 20% forcing over six moves. So maybe Oracle is amazingly lucky and there are hitherto undiscovered forcing lines directly from this well-known position to lost positions for black, missed by Stockfish, AlphaZero, and all humans. Well, then Oracle still needs to cover another 30% of the tree and get just as lucky a few more times. If that happens, I think I’m in crisis of faith mode where I have to reevaluate whether grandmaster chess was an elaborate hoax. So many positions we thought were even turn out to be winning for white, everyone missed it, what happened?

1. Where “fail” means “no plan found”, “memory and time exhausted”, “here’s a plan that involves running a reasoner in real-time” or “feed me observations in real-time and ask me only to generate a local and by-default-inscrutable action”, as listed by so8res above. ↩︎

2. It doesn’t help that chess players also search for lines that are forcing, short, and winning, at least some of the time. ↩︎

• You can consider me convinced that the “find forcing lines” approach isn’t going to work.

(How well the perfect book could “genuinely” teach someone is a different question, but that’s definitely not enough to beat Magnus.)

• Yeah, this is part of what I was getting at. The narrowness of the task “write a set of instructions for a one-off victory against a particular player” is a crucial part of what makes it seem not-obviously-impossible to me. Fully simulating Magnus should be adequate, but then obviously you’re invoking a reasoner. What I’m uncertain about is if you can write such instructions without invoking a reasoner.

• I agree that it’s plausible chess-plans can be compressed without invoking full reasoners (and with a more general point that there are degrees of compression you can do short of full-on ‘reasoner’, and with the more specific point that I was oversimplifying in my comment). My intent with my comment was to highlight how “but my AI only generates plans” is sorta orthogonal to the alignment question, which is pushed, in the oracle framework, over to “how did that plan get compressed, and what sort of cognition is invoved in the plan, and why does running that cognition yield good outcomes”.

I have not yet found a pivotal act that seems to me to require only shallow realtime/​reactive cognition, but I endorse the exercise of searching for highly specific and implausibly concrete pivotal acts with that property.

• While writing this, I was reminded of an older (2017) conversation between Eliezer and Paul on FB. I reread it to see whether Paul seemed like he’d be making the set of mistakes this post is outlining.

It seems like Paul acknowledges the issues here, but his argument is that you can amplify humans without routing through “the hard parts” that are articulated in this post. i.e. it seems like you can use current ML to build something that helps a human effectively “think longer”, and he thinks one can do this without routing through the dangerous-plan-searchspace. I don’t know if there’s much counterargument beyond “no, if you’re building an ML system that helps you think longer about anything important, you already need to have solved the hard problem of searching through plan-space for actually helpful plans.”

Eliezer:

I have a new oversimplied straw way to misdescribe the disagreement between myself and Paul Christiano, and I’m interested to hear what Paul thinks of it:

Paul sees “learn this function” or “learn and draw from this probability distribution” as a kind of primitive op that modern machine learning is increasingly good at and can probably give us faithful answers to. He wants to figure out how to compose these primitives into something aligned.

Eliezer thinks that what is inside the black box inexorably kills you when the black box is large enough, like how humans are cognitive daemons of natural selection (the outer optimization process operating on the black box of genes accidentally constructed a (sapient) inner optimization process inside the black box) and this is chicken-and-egg unavoidable whenever the black box is powerful enough to do something like predict complicated human judgments, since in this case the outer optimization was automatically powerful enough to consider and select among multiple hypotheses the size of humans, and the inner process is automatically as powerful as human intelligence.

Paul thinks it may be possible to do something with this anyway given further expenditures of cleverness and computation, like taking a million such black boxes and competing them to produce the best predictions.

Eliezer expects an attempt like this to come with catastrophically exponential sample-complexity costs in theory, and to always fail in practice because you are trying to corral hostile superintelligences which means you’ve already lost. E.g. we can tell stories about how inner predictors could take over AIXI-tl by using bad off-policy predictions to manipulate AIXI-tl into a position where only that predictor (or LDT-symmetrized predictor class) can predict the answer to a one-way-hash problem it set up; and this isn’t an isolated flaw, it faithfully reflects the fact that once you are trying to outwit a hostile superintelligence you’re already screwed. Plus maybe it can do the equivalent of Rowhammering you, or using a bad but “testable” prediction just once that gets somebody to let it out of the box, etcetera. Only it doesn’t do any of those things, it does something cleverer, etcetera. Eliezer thinks that once there’s a hostile superintelligence running anywhere inside the system you are almost surely screwed *in practice*, which means Eliezer thinks Paul never gets to the point of completing one primitive op of the system before the system kills him.

Paul:

I think this is a mostly correct picture of the disagreement. I *would* agree with “what is inside the black box inexorably kills you when the black box is large enough,” if we imagine humans interacting with an arbitrarily large black box. This is a key point of agreement.

I am optimistic anyway because before humanity tries to produce “aligned AI with IQ 200” we can produce “aligned AI with IQ 199.” Indeed, if we train our systems with gradient descent then the intelligence will almost necessarily increase continuously. The goal is to maintain alignment as an inductive invariant, not to abruptly inject it into an extremely powerful system. So the gap between “smartest aligned intelligence we have access to” and “AI we are currently trying to train” is always quite small. This doesn’t make the problem easy, but I do think it’s a central feature of the problem that isn’t well accounted for in your arguments for pessimism.

Buck:

My guess of Eliezer’s reply is:

If we had an IQ 199 aligned AGI, that would indeed be super handy for building the IQ 200 one. But that seems quite unlikely.

Firstly, the black box learner required to build an AI that is aligned at all (eg the first step of capability amplification), even if that learned AI is very dumb, must itself be a really powerful learner, powerful enough that it is susceptible to scary internal unaligned optimizers.

Secondly, building an IQ 200 aligned agent via imitation of a sequence of progressively smarter aligned agents seems quite unlikely to be competitive, so without unlikely amounts of coordination someone will just directly build the IQ 200 agent.

Paul:

The first step of capability amplification is a subhuman AI similar in kind to the AI we have today; so if this is someone’s objection then they ought to be able to stick their neck out today (e.g. by saying that we can’t solve the alignment problem for systems we build today, or by saying that systems we can build today definitely won’t be able to participate in amplification).

The AlphaGo Zero example really seems to take much of the wind out of the concerns about feasibility. It’s the most impressive example of RL to date, and it was literally trained as a sequence of increasingly good go players learning to imitate one another.

I think the worst concerns are daemons (which are part of the unreasonably-innocuous-sounding “reliability” in my breakdown) and the impossibility alignment-preserving amplification. Setting up imitation /​ making informed oversight work also seems pretty hard, but I think it’s less related to Eliezer’s concerns.

Buck:

My Eliezer-model says that systems we can build today definitely won’t be able to participate that helpfully in amplification. Amplification of human tasks seems like an extremely hard problem—our neural nets don’t seem like they’d be up to it without adding in a bunch of features like attention, memory, hierarchical planning systems, and so on. The daemons come in once you start adding in all of that, and if you don’t add in all that, your neural nets aren’t powerful enough to help you.

Skipping ahead a bit, to the next Paul rejoinder that felt relevant:

Paul

I think the most important general type of amplification is “think longer.” I think breaking a question down into pieces is a relatively easy case for amplification that can probably work with existing models. MCTS is a lot easier to get working than that.

> My Eliezer-model says that systems we can build today definitely won’t be able to participate that helpfully in amplification.

To the extent that this is actually anyone’s view, it would be great if they could be much clearer about what exactly they think can’t be done.

• The first step of capability amplification is a subhuman AI similar in kind to the AI we have today; so if this is someone’s objection then they ought to be able to stick their neck out today (e.g. by saying that we can’t solve the alignment problem for systems we build today, or by saying that systems we can build today definitely won’t be able to participate in amplification).

It seems non-obvious that the systems we have today can be aligned with human values. They certainly aren’t smart enough to model all of human morality, but they may be able to have some corrigibility properties? This presents the research directions of:

• Train a model to have corrigibility properties, as an existence proof. This also provides the opportunity to study the architecture of such a model.

• Develop some theory relating corrigibility properties, and expressiveness of your model.

• Eliezer thinks that what is inside the black box inexorably kills you when the black box is large enough, like how humans are cognitive daemons of natural selection (the outer optimization process operating on the black box of genes accidentally constructed a (sapient) inner optimization process inside the black box) and this is chicken-and-egg unavoidable whenever the black box is powerful enough to do something like predict complicated human judgments, since in this case the outer optimization was automatically powerful enough to consider and select among multiple hypotheses the size of humans, and the inner process is automatically as powerful as human intelligence.

It might be worth pointing out that evolution seems to be doing something different from the oracle in the Original Post.

Evolution:

• building something piece by piece, and testing those pieces (in reality), and then building things from those

Oracle:

• Wandering the space, adrift from that connection to reality, w/​out the checking throughout.

• I don’t know if there’s much counterargument beyond “no, if you’re building an ML system that helps you think longer about anything important, you already need to have solved the hard problem of searching through plan-space for actually helpful plans.”

This is definitely a problem, but I would say human amplification further isn’t a solution because humans aren’t aligned.

I don’t really have a good what human values are, even in an abstract English definition sense, but I’m pretty confident that “human values” are not, and are not easily transformable from, “a human’s values.”

Though maybe that’s just most of the reason why you’d have to have your amplifier already aligned, and not a separate problem itself.

• I’m having some trouble phrasing this comment clearly, and I’m also not sure how relevant it is to the post except that the post inspired the thoughts, so bear with me...

It seems important to distinguish between several things that could vary with time, over the course of a plan or policy:

1. What information is known.

• This is related to Nate’s comment here: it is much more computationally feasible to specify a plan/​policy if it’s allowed to contain terms that say “make an observation, then run this function on it to decide the next step,” rather than writing out a lookup table pairing every sequence of observations to the next action.

2. What objective function is being maximized.

• This is usually assumed (?) to be static in this kind of discussion, but in principle the objective could vary in response to future observations.

In principle, this is equivalent to a static objective function with terms for “how it would respond” to each possible sequence of observations (ignoring subtleties about orders over world-states vs. world-histories). But this has exactly the same structure as the previous point: it’s more feasible to say “make an observation, then run this function to update the objective” than to unroll the same thing into a lookup table known entirely at the start.

3. Object-level features of the actions that are chosen.

• Some of the properties under discussion, like Nate’s “lasing,” are about this stuff changing equivariantly /​ “in the right way” as other things change.

The recent discussions about consequentialism seem to be about the case where we have a task that takes a significant amount of real-world time, over which many observations (1) will be made with implications for subsequent decisions—but over which the objective (2) is approximately unchanging. This setup leads to various scary properties of what the policies actually do (3).

But, I don’t understand the rationale for focusing on this case where the objective (2) doesn’t change. (In the sense of “doesn’t change” specified above—that we can specify it simply over long time horizons, rather than incurring an exp(T) cost for unrolling its updates on observations sequences.)

One reason to care about this case is a hope for oracle AI, since oracle AI is something that receives “questions” (objectives simple enough for us to feel we understand) and returns “answers” (plans that may take time). This might produce a good argument that oracle AI is unsafe, but it doesn’t apply to systems with changing objectives.

In the case of human intelligence, it seems to me that (2) evolves not too much more slowly than (1), and becomes importantly non-constant for longer-horizon cases of human planning.

If I set myself a brief and trivial goal like “make the kitchen cleaner over the next five minutes,” I will spend those five minutes acting much like a clean-kitchen-at-all-costs optimizer, with all my subgoals pointing coherently in that direction (“wash this dish,” “pick up the sponge”). If I set myself a longer-term goal like “get a new job,” I may well find my preferences about the outcome have evolved substantially well before the task is complete.

This fact seems orthogonal to the fact that I am “good at search” relative to all known things that aren’t humans. Relative to all non-humans, I’m very good at finding policies that are high-EV for the targets I’m trying to hit. But my targets evolve over time.

Indeed, I imagine this is why the complexity of human value doesn’t create more of a problem for human action than it does. I don’t have a simply-specifiable constant objective with a term for “make people happy” (or whatever); I have an objective with an update rule that reacts to human feedback over time. The update rule may have been optimized for something on an evolutionary timescale, but it’s not obvious its application in an individual human can be modeled as optimizing anything.

(For a case that has the intelligence gap of humans/​AGI, consider human treatment of animals. I’ve heard this brought up as an analogy for misaligned AI, and it’s an interesting one. But the shape of the problem is not “humans are good at search, and have an objective which omits ‘animal values,’ or includes them in the wrong way.” Sometimes people just decide to become vegan for ethical reasons! Sometimes whole cultures do.

This looks like a real case of individual values being updated, i.e. I don’t think the right model of someone who goes vegan at age 31 is “this person is maximizing an objective which gives them points for eating animals, but only until age 31, and negative points thereafter.”)

If we think of humans as a prototype case of an “inner optimizer,” with evolution the outer optimizer, we have to note that the inner optimizer doesn’t have a constant objective, even though the outer one does. The inner optimizer is very powerful, has the lasing property, and all of that, but it gets applied to a changing objective, which seems to produce qualitatively different results in terms of corrigibility, Goodhart, etc. The same thing could be true of an AGI, if it’s the product of something like gradient descent rather than a system with an internal objective we explicitly wrote. This is not strong evidence that it will be true, but it at least motivates asking the question.

(It seems noteworthy, here, that when people talk about the causes of human misery /​ “non-satisfaction of human values,” they typically point to things like scarcity, coordination problems, and society-level optimization systems with constant objectives. If we’re good at search, and human value is complex, why aren’t we constantly harming each other by executing incorrigibly on misaligned plans at an individual level? Something fitting this description no doubt happens, but it causes less damage that a naive application of AI safety theory would lead one to expect.)

• Curated. This is a great instance of someone increasing clarity on a load-bearing concept (at least in some models) through an earnest attempt to improve their own understanding.

• I’m not sure I understand why it’s important that the fraction of good plans is 1% vs .00000001%. If you have any method for distinguishing good from bad plans, you can chain it with an optimizer to find good plans even if they’re rare. The main difficulty is generating enough bits—but in that light, the numbers I gave above are 7 vs 33 bits—not a clear qualitative difference. And in general I’d be kind of surprised if you could get up to say 50 bits but then ran into a fundamental obstacle in scaling up further.

• Can you be more concrete about how you would do this? If my method for evaluation is “sit down and think about the consequences of doing this for 10 hours”, I have no idea how I would chain it with an optimizer to find good plans even if they are rare.

• Basically the same techniques as in Deep Reinforcement Learning from Human Preferences and the follow-ups—train a neural network model to imitate your judgments, then chain it together with RL.

I think current versions of that technique could easily give you 33 bits of information—although as noted elsewhere, the actual numbers of bits you need might be much larger than that, but the techniques are getting better over time as well.

• Hmm, I don’t currently find myself very compelled by this argument. Here are some reasons:

In order to even get a single expected datapoint of approval, I need to sample 10^8 examples, which in our current sampling method would take 10^8 * 10 hours, e.g. approximately 100,000 years. I don’t understand how you could do “Learning from Human Preferences” on something this sparse

I feel even beyond that, this still assumes that the reason it is proposing a “good” plan is pure noise, and not the result of any underlying bias that is actually costly to replace. I am not fully sure how to convey my intuitions here, but here is a bad analogy: It seems to me that you can have go-playing-algorithms that lose 99.999% of games against an expert AI, but that doesn’t mean you can distill a competitive AI that wins 50% of games, even though it’s “only 33 bits of information”.

Like, the reason why your AI is losing has a structural reason, and the reason why the AI is proposing consequentialist plans also has a structural reason, so even if we get within 33 bits (which I do think seems unlikely), it’s not clear that you can get substantially beyond that, without drastically worsening the performance of the AI. In this case, it feels like maybe an AI maybe gets lucky and stumbles upon a plan that solves the problem without creating a consequentialist reasoner, but it’s doing that out of mostly luck, not because it actually has a good generator for non-consequentialist-reasoner-generating-plans, and there is no reliable way to always output those plans without actually sampling at least something like 10^4 plans.

The intuition of “as soon as I have an oracle for good vs. bad plans I can chain an optimizer to find good plans” feels far too strong to me in generality, and I feel like I can come up with dozen of counterexamples where this isn’t the case. Like, I feel like… this is literally a substantial part of the P vs. NP problem, and I can’t just assume my algorithm just like finds efficient solution to arbitrary NP-hard problems.

• Thanks for the push-back and the clear explanation. I still think my points hold and I’ll try to explain why below.

In order to even get a single expected datapoint of approval, I need to sample 10^8 examples, which in our current sampling method would take 10^8 * 10 hours, e.g. approximately 100,000 years. I don’t understand how you could do “Learning from Human Preferences” on something this sparse

This is true if all the other datapoints are entirely indistinguishable, and the only signal is “good” vs. “bad”. But in practice you would compare /​ rank the datapoints, and move towards the ones that are better.

Take the backflip example from the human preferences paper: if your only signal was “is this a successful backflip?”, then your argument would apply and it would be pretty hard to learn. But the signal is “is this more like a successful backflip than this other thing?” and this makes learning feasible.

More generally, I feel that the thing I’m arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.

I should note that comparisons is only one reason why the situation isn’t as bad as you say. Another is that even with only non-approved data points to label, you could do things like label “which part” of the plan is non-approved. And with very sophisticated AI systems, you could ask them to predict which plans would be approved/​non-approved, even if they don’t have explicit examples, simply by modeling the human approvers very accurately in general.

I feel even beyond that, this still assumes that the reason it is proposing a “good” plan is pure noise, and not the result of any underlying bias that is actually costly to replace.

When you say “costly to replace”, this is with respect to what cost function? Do you have in mind the system’s original training objective, or something else?

If you have an original cost function F(x) and an approval cost A(x), you can minimize F(x) + c * A(x), increasing the weight on c until it pays enough attention to A(x). For an appropriate choice of c, this is (approximately) equivalent to asking “Find the most approved policy such that F(x) is below some threshold”—more generally, varying c will trace out the Pareto boundary between F and A.

so even if we get within 33 bits (which I do think seems unlikely)

Yeah, I agree 33 bits would be way too optimistic. My 50% CI is somewhere between 1,000 and 100,000 bits needed. It just seems unlikely to me that you’d be able to generate, say, 100 bits but then run into a fundamental obstacle after that (as opposed to an engineering /​ cost obstacle).

Like, I feel like… this is literally a substantial part of the P vs. NP problem, and I can’t just assume my algorithm just like finds efficient solution to arbitrary NP-hard problems.

I don’t think the P vs. NP analogy is a good one here, for a few reasons:

* The problems you’re talking about above are statistical issues (you’re saying you can’t get any statistical signal), while P vs. NP is a computational question.

* In general, I think P vs. NP is a bad fit for ML. Invoking related intuitions would have led you astray over the past decade—for instance, predicting that neural networks should not perform well because they are solving a problem (non-convex optimization) that is NP-hard in the worst case.

• This is true if all the other datapoints are entirely indistinguishable, and the only signal is “good” vs. “bad”. But in practice you would compare /​ rank the datapoints, and move towards the ones that are better.

Take the backflip example from the human preferences paper: if your only signal was “is this a successful backflip?”, then your argument would apply and it would be pretty hard to learn. But the signal is “is this more like a successful backflip than this other thing?” and this makes learning feasible.

More generally, I feel that the thing I’m arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.

I should note that comparisons is only one reason why the situation isn’t as bad as you say. Another is that even with only non-approved data points to label, you could do things like label “which part” of the plan is non-approved. And with very sophisticated AI systems, you could ask them to predict which plans would be approved/​non-approved, even if they don’t have explicit examples, simply by modeling the human approvers very accurately in general.

Well, sure, but that is changing the problem formulation quite a bit. It’s also not particularly obvious that it helps very much, though I do agree it helps. My guess is even with a rank-ordering, you won’t get the 33 bits out of the system in any reasonable amount of time at 10 hours evaluation cost. I do think if you can somehow give more mechanistic and detailed feedback, I feel more optimistic in situations like this, but also feel more pessimistic that we will actually figure out how to do that in situations like this.

More generally, I feel that the thing I’m arguing against would imply that ML in general is impossible (and esp. the human preferences work), so I think it would help to say explicitly where the disanalogy occurs.

I feel like you are arguing for a very strong claim here, which is that “as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time”.

This sentence can of course be false without implying that the human preferences work is impossible, so there must be some confusion happening. I am not arguing that this is impossible for all problems, indeed ML has shown that this is indeed quite feasible for a lot of problems, but making the claim that it works for all of them is quite strong, but I also feel like it’s obvious enough that this is very hard or impossible for a large other class of problems (like, e.g. reversing hash functions), and so we shouldn’t assume that we can just do this for an arbitrary problem.

• I feel like you are arguing for a very strong claim here, which is that “as soon as you have an efficient way of determining whether a problem is solved, and any way of generating a correct solution some very small fraction of the time, you can just build an efficient solution that solves it all of the time”

Hm, this isn’t the claim I intended to make. Both because it overemphasizes on “efficient” and because it adds a lot of “for all” statements.

If I were trying to state my claim more clearly, it would be something like “generically, for the large majority of problems of the sort you would come across in ML, once you can distinguish good answers you can find good answers (modulo some amount of engineering work), because non-convex optimization generally works and there are a large number of techniques for solving the sparse rewards problem, which are also getting better over time”.

• If I were trying to state my claim more clearly, it would be something like “generically, for the large majority of problems of the sort you would come across in ML, once you can distinguish good answers you can find good answers (modulo some amount of engineering work), because non-convex optimization generally works and there are a large number of techniques for solving the sparse rewards problem, which are also getting better over time”.

I am a bit confused by what we mean by “of the sort you would come across in ML”. Like, this situation, where we are trying to derive an algorithm that solves problems without optimizers, from an algorithm that solves problems with optimizers, is that “the sort of problem you would come across in ML”?. It feels pretty different to me from most usual ML problems.

I also feel like in ML it’s quite hard to actually do this in practice. Like, it’s very easy to tell whether a self-driving car AI has an accident, but not very easy to actually get it to not have any accidents. It’s very easy to tell whether an AI can produce a Harry Potter-level quality novel, but not very easy to get it to produce one. It’s very easy to tell if an AI has successfully hacked some computer system, but very hard to get it to actually do so. I feel like the vast majority of real-world problems we want to solve do not currently follow the rule of “if you can distinguish good answers you can find good answers”. Of course, success in ML has been for the few subproblems where this turned out to be easy, but clearly our prior should be on this not working out, given the vast majority of problems where this turned out to be hard.

(Also, to be clear, I think you are making a good point here, and I am pretty genuinely confused for which kind of problems the thing you are saying does turn out to be true, and appreciate your thoughts here)

• When you say “costly to replace”, this is with respect to what cost function? Do you have in mind the system’s original training objective, or something else?

If you have an original cost function F(x) and an approval cost A(x), you can minimize F(x) + c * A(x), increasing the weight on c until it pays enough attention to A(x). For an appropriate choice of c, this is (approximately) equivalent to asking “Find the most approved policy such that F(x) is below some threshold”—more generally, varying c will trace out the Pareto boundary between F and A.

I was talking about “costly” in terms of computational resources. Like, of course if I have a system that gets the right answer in 1100,000,000 cases, and I have a way to efficiently tell when it gets the right answer, then I can get it to always give me approximately always the right answer by just running it a billion times. But that will also take a billion times longer.

In-practice, I expect most situations where you have the combination of “In one in a billion cases I get the right answer and it costs me $1 to compute an answer” and “I can tell when it gets the right answer”, you won’t get to a point where you can compute a right answer for anything close to$1.

• I think the problem is not quite so binary as “good/​bad”. It seems to be more effective vs ineffective and beneficial vs harmful.

The problem is that effective plans are more likely to be harmful. We as a species have already done a lot of optimization in a lot of dimensions that are important to us, and the most highly effective plans almost certainly have greater side effects that make thing worse in dimensions that we aren’t explicitly telling the optimizer to care about.

It’s not so much that there’s a direct link between sparsity of effective plans and likelihood of bad outcomes, as that more complex problems (especially dealing with the real world) seem more likely to have “spurious” solutions that technically meet all the stated requirements, but aren’t what we actually want. The beneficial effective plans become sparse faster than the harmful effective plans, simply because in a more complex space there are more ways to be unexpectedly harmful than good.

• Yes, I think I understand that more powerful optimizers can find more spurious solutions. But the OP seemed to be hypothesizing that you had some way to pick out the spurious from the good solutions, but saying it won’t scale because you have 10^50, not 100, bad solutions for each good one. That’s the part that seems wrong to me.

• That part does seem wrong to me. It seems wrong because 10^50 is possibly too small. See my post Seeking Power is Convergently Instrumental in a Broad Class of Environments:

If the agent flips the first bit, it’s locked into a single trajectory. None of its actions matter anymore.

But if the agent flips the second bit – this may be suboptimal for a utility function, but the agent still has lots of choices remaining. In fact, it still can induce observation histories. If and , then that’s observation histories. Probably at least one of these yields greater utility than the shutdown-history utility.

And indeed, we can apply the scaling law for instrumental convergence to conclude that for every u-OH, at least of its permuted variants (weakly) prefer flipping the second pixel at , over flipping the first pixel at .

Choose any atom in the universe. Uniformly randomly select another atom in the universe. It’s about times more likely that these atoms are the same, than that a utility function incentivizes “dying” instead of flipping pixel 2 at .

(For objectives over the agent’s full observation history, instrumental convergence strength scales exponentially with the complexity of the underlying environment—the environment in question was extremely simple in this case! For different objective classes, the scaling will be linear, but that’s still going to get you far more than 100:1 difficulty, and I don’t think we should privilege such small numbers.)

• Your “harmfulness” criteria will always have some false negative rate.

If you incorrectly classify a harmful plan as beneficial one time in a million, in the former case you’ll get 10^44 plans that look good but are really harmful for every one that really is good. In the latter case you get 10000 plans that are actually good for each one that is harmful.

• This would imply a fixed upper bound on the number of bits you can produce (for instance, a false negative rate of 1 in 128 implies at most 7 bits). But in practice you can produce many more than 7 bits, by double checking your answer, combining multiple sources of information, etc.

• Combining multiple source of information, double checking etc are ways to decrease error probability, certainly. The problem is that they’re not independent. For highly complex spaces not only does the number of additional checks you need increase super-linearly, but the number of types of checks you need likely possibly also increases super-linearly.

That’s my intuition, at least.

• beneficial effective plans become sparse faster than the harmful effective plans

The constants are more important than the trend here, whether a good plan for a pivotal act that sorts out AI risk in the medium term can be found. Discrimination of good plans only has to be improved enough to overcome the threshold of what’s needed to search plans effective enough to solve that problem.

• Or rather, the part where alignment is hard is precisely when the thing I’m trying to accomplish is hard. Because then I need a powerful plan, and it’s hard to specify a search for powerful plans that don’t kill everyone.

• This post seems to be using a different meaning of “consequentialism” to what I am familiar with (that of moral philosophy). Subsequently, I’m struggling to follow the narrative from “consequentialism is convergently instrumental” onwards.

Can someone give me some pointers of how I should be interpreting the definition of consequentialism here? If it is just the moral philosophy definition, then I’m getting very confused as to why “judge morality of actions by their consequences” is a useful subgoal for agents to optimize against...

• Thanks, I found this helpful!

Consequentialism is convergently instrumental. Consequentialism is a (relatively) simple, effective process for accomplishing goals, so things that efficiently optimize for goals tend to approximate it.

I think this is the most important premise. I don’t have a solid justification for it yet, but I’m groping towards a non-solid justification at least in my agency sequence. I think John Wentworth’s stuff on the good regulator theorem is another line of attack that could turn into a solid justification. TurnTrout also has relevant work IIRC.

• Is there a book out there on instrumental convergence? I have not come across this idea before and would like to learn more about it.

• I think there’s some confusion going on with “consequentialism” here, and that’s at least a part of what’s at play with “why isn’t everyone seeing the consequentialism all the time”.

One question I asked myself reading this is “does the author distinguish ‘consequentialism’ with ‘thinking and predicting’ in this piece?” and I think it’s uncertain and leaning towards ‘no’.

So, how do other people use ‘consequentialism’?

It’s sometimes put forward as a moral tradition/​ethical theory, as an alternative to both deontology and virtue ethics. I forget which philosopher decided this was the trifecta but these are often compared and contrasted to each other. In particular, the version used here seems to not fit well with this article.

Another might be that consequentialism is an ethical theory that requires prediction (whereas others do not) -- I think this is an important feature of consequentialism, but it seems like ‘the set of all ethical theories which have prediction as a first class component’ is bigger than just consequentialism. I do think that ethical theories that require prediction as a first class component are important for AI alignment, specifically intent alignment (less clear if useful for non-intent-alignment alignment research).

A different angle to this would be “do common criticisms of consequentialism apply to the concept being used here”. Consequentialism has had a ton of philosophical debate over the last century (probably more?) and according to me there’s a bunch of valid criticisms.[1]

Finally I feel like this is missing a huge step in the recent history of ethical theories, which is the introduction of Moral Uncertainty. I think Moral Uncertainty is a huge step, but the miss (in this article) is a ‘near miss’. I think a similar argument could have been made for AI researchers /​ Alignment researchers, using the framing of Moral Uncertainty, should be updating on net in the direction of consequentialism being useful/​relevant for modeling systems (and possibly useful for designing alignment tech).

1. ^

I’m not certain that the criticisms will hold, but I think that proponents of consequentialism have insufficiently engaged with the criticisms; my net current take is uncertain but leaning in the consequentialists favor. (See also: Moral Uncertainty)

• I’m pretty sure “consequentialism” here wasn’t meant to mean anything to do with ethics in this case (which I acknowledge as confusing)

I think consequentialism-as-ethics means “the right/​moral thing to do is to choose actions that have good consequences.”

I think consequentialism as Eliezer/​John meant here is more like “the thing to do is choose actions that have the consequences you want.”

A consequentialist is something that thinks, predicts, and plans (and, if possible, acts) in such a way as to bring about particular consequences.

(I think it’s plausible that we want different words for these things, but I think this use of the word consequentialism is fairly natural, and makes sense to see “moral consequentalism” as a subset of consequentialism.)

• Saying this again separately, if you taboo ‘consequentialism’ and take these as the definitions for a concept:

“the thing to do is choose actions that have the consequences you want.”

A ___ is something that thinks, predicts, and plans (and, if possible, acts) in such a way as to bring about particular consequences.

I think this is what “the majority of alignment researchers who probably are less on-the-ball” are in fact thinking about quite often.

We just don’t call it ‘consequentialism’.

• does it have a name, or just a vaguely amorphous concept blob?

• Goal-directed?

• I like this one. I think it does a lot to capture both the concept and the problem.

The concept is that we expect AI systems to be convergently goal-directed.

The problem is that people in AI research often uncertain about goal-directeness and its emergence in advanced AI systems. (My attempt to paraphrase the problem of the post, in terms of goal-directedness, at least)

• Nothing comes to mind as a single term, in particular because I usually think of ‘thinking’, ‘predicting’, and ‘planning’ separately.

If you’re okay with multiple terms, ‘thinking, predicting, and planning’.

Aside: now’s a great time to potentially rewrite the LW tag header on consequentialism to match this meaning/​framing. (Would probably help with aligning people on this site, at least). https://​​www.lesswrong.com/​​tag/​​consequentialism

• Yeah this seems like one way it could resolve the differences in arguments.

My guess (though I don’t know for certain) is that more AI alignment researchers would agree with “the thing to do is choose actions that have the consequences you want” is an important part of AI research, than “the right/​moral thing to do is to choose actions that have good consequences” is an important part of AI research.

I’m curious how much confusion you think is left after taboo-ing the term and communicating the clarification?

• I personally didn’t feel confused, so I think I mostly turn that question around to you? (i.e. it seemed natural to me to use “consequentalist” in this way, and insofar as any confusion came up, specifying ‘oh, no I didn’t mean it as an ethical theory’ seems like it should address it. But, you might disagree)

• I think my personal take is basically “yeah it seems like almost everything routes through a near-consequentialist theory” and “calling this theory ‘consequentialism’ seems fair to me”.

I spend a lot of time with people that are working on AI /​ AI Alignment who aren’t in the rationality community, and I don’t think this is the take for all of them. In particular I imagine from the “words have meaning, dammit” camp a lot of disagreement about ‘consequentialism’ the term, but if you taboo’d it, there’s a lot of broad agreement here.

In particular, I think this belief is super common and super strong in researchers focused on aligning AGI, or otherwise focused on long-term alignment.

I do think there’s a lot of disagreement in the more near-term alignment research field.

This is why this article felt weird to me—it’s not clear that there is a super wide mistake being made, and to the extent Raemon/​John think there is, there’s also a lot of people who are uncertain (again c/​f moral uncertainty) even if updating in the ‘thinking/​predicting’ direction.

E.g. for this bit:

I… guess what I think Eliezer thinks is that Thoughful Researcher isn’t respecting inner optimizers enough.

My take is median Thoughtful Researcher is more uncertain about inner optimizers—instead of being certain that EY is wrong here.

And pointing at another bit:

Consequentialism is a (relatively) simple, effective process for accomplishing goals, so things that efficiently optimize for goals tend to approximate it.

I think people would disagree with this as consequentialism.

It’s important to maybe point at another term that’s charged with a nontraditional meaning in this community: rationality.

We mean something closer to skeptical empiricism that the actual term, but if you taboo it I think you end up with a lot more agreement about what we’re talking about.

• I agree that non-agentic AI is a fools errand when it comes to alignment, but there’s one point where I sort of want to defend it as not being quite as bad as this post suggests:

Me: Okay, so partly you’re pointing out that hardness of the problem isn’t just about getting the AI to do what I want, it’s that doing what I want is actually just really hard. Or rather, the part where alignment is hard is precisely when the thing I’m trying to accomplish is hard. Because then I need a powerful plan, and it’s hard to specify a search for powerful plans that don’t kill everyone.

This depends heavily on the geometry/​measure of the space you search over, which depends heavily on the “interface” you have for interacting with the world.

Consider the case of getting rid of cancer. If your interface is a chemical mixture that you inject into the cancer patient(s), it would technically be a valid solution for that chemical mixture to contain nanobots that seize broad power and uses this for extensive research that eventually generates a cure for cancer. But this is a complex solution, which requires enormously many coordinated pieces. It probably occupies a much smaller part of the search space than genuine solutions, and even unaligned solutions would tend to be stuff like “this kills the cancer but it also kills the patient”. Further, just from a computational point of view, the power-grabbing solution would be much easier to “catch” because it suddenly requires modelling huge parts of society, which might be orders of magnitude more expensive than anything staying within a person.

On the other hand, if your interface to the world isn’t a chemical mixture that gets injected into patients, but is instead a computer program that gets uploaded to some server, then a great deal of power seeking is necessary to even get an arrangement where you are able to medically affect the cancer patients. The jump from a direct medical solution to even further power seeking becomes much smaller.

This isn’t just an informal argument; you can take a look at e.g. Alex Turner’s proofs, and see that they are deeply dependent on the measure on the goals (which is formalized by the symmetry group chosen).

That said, this doesn’t necessarily solve things. There are some tasks that can be very satisfactorily solved with a narrow window like this, so that power-seeking isn’t a problem. But there is enormous economic value in more generalized interaction with the world, so there will inevitably be pressure to building genuine agents.

• Here is a couple of “hard” things you can easily do with hypercompute, without causing dangerous consequentialism.

Given a list of atom coordinates, run a quantum accuracy simulation of those atoms. (Where the atoms don’t happen to make a computer running a bad program).

Find the smallest arrangement of atoms that forms a valid Or gate by brute forcing over the above simulator.

Brute forcing over large arrangements of atoms could find a design containing a computer containing an AI. But brute forcing over arrangements of 100 atoms should be fine, and can do a lot of interesting chemistry. Note that a psychoactive that makes humans care less about AI risk won’t be preferentially selected. Its not simulating the simple molecule in the world, that would be dangerous. Its simulating a simple molecule in a vacuum. (Or a standard temp and pressure 80% N2 + 20% O2 atmosphere, or some other simple hardcoded test setup.)

• Let’s assume for a moment that consequentialism in Eliezer’s sense is the most pervasive thing in the problem space (this is not a claim anyone has made as far as I can tell). What does leaning into consequentialism super hard look like in terms of approaches? The only line of attack l know of which seems to meet the description is the convergent power-seeking sequence.

• Not only that, most of the plans route through “acquire resources in a way that is unfriendly to human values.” Because in the space of all possible plans, while consequentialism doesn’t take that many bits to specify, human values are highly complexand take a lot of bits to specify.

1) It’s easier to build a moon base with money. And*, it’s easier to steal money than earn it.

*This is a hypothetical

2) Even replacing that plan with a one that ‘human values’ says works, is tricky. What is an acceptable way to earn money?

Just listing the plans.

One does not enumerate all of possibility.

Okay, but if I imagine a researcher who is thoughtful but a bit too optimistic, what they might counterargue with is: “Sure, but I’ll just inspect the plans for whether they’re unfriendly, and not do those plans.”

And here you swap out ‘a plan’ for ‘plans’.

Me: Okay, so partly you’re pointing out that hardness of the problem isn’t just about getting the AI to do what I want, it’s that doing what I want is actually just really hard. Or rather, the part where alignment is hard is precisely when the thing I’m trying to accomplish is hard. Because then I need a powerful plan, and it’s hard to specify a search for powerful plans that don’t kill everyone.

The fact that this is being used as a metaphor, disconnects it from the problem.

Suppose, tomorrow, a ‘cure for cancer’ was created. And the solution was surprisingly simple.

It seems clear that say, ‘beating you at chess’ isn’t that hard to plan. Why would ‘cure cancer’ be so very, very hard?

It seems like the tricky bit about a plan is that...maybe a plan wouldn’t work?

You might have to do experiments, and learn from them, and come up with new ideas...you are not sailing somewhere that is on a map, or doing something that has been done before.

• I wonder if the confusion isn’t about implications of consequentialism, but about the implications of independent agents. Related to the (often mentioned, but never really addressed) problem that humans don’t have a CEV, and we have competition built-in to our (inconsistent) utility functions.

I have yet to see a model of multiple agents WRT “alignment”. The ONLY reason that power/​resources/​self-preservation is instrumental is if there are unaligned agents in competition. If multiple agents agree on the best outcomes and the best way to achieve them, then it doesn’t matter which agent does what, or even which agent(s) exist.

Fully-aligned agents are really just multiple processing cores of one agent.

It’s when we talk about partial-alignment that we go off the rails. In this case, we should address competition and tradeoffs as actual things the agent(s) have to consider.

• Rather than saying that most likely-to-work plans for curing cancer route through consequentialism, I think it would be more precise to say that most simple likely-to-work plans route through consequentialism.

For every plan that can be summarized as “build a powerful consequentialist and then delegate the problem to it”, it seems like there should be a corresponding (perhaps very complicated) plan that can be summarized as “directly execute the plan that that consequentialist would have used if you had built it.”

The size of that complexity penalty varies depending on the nature of the problem. There’s maybe a useful sense in which the “hard problems” are exactly the ones where the complexity penalty for avoiding the consequentialist is large.

• If your planning space is finite, it doesn’t contain sufficiently complex plans for your argument to go through. If your planning space is infinite, you need to select some measure to be able to talk about “most”, and then that measure will capture the “simple” aspect.