Arguments for optimism on AI Alignment (I don’t endorse this version, will reupload a new version soon.)

Or, why we probably don’t need to worry about AI.

So this post is partially a response to Amalthea’s comment on how I simply claimed that my side is right, and I responded by stating that I was going for a short comment rather than having to make another very long comment on the issue.


This is the post where I won’t try to claim that my side is right, and instead give evidence so I can properly collect my thoughts here. This will be a link-heavy post, and I’ll reference a lot of concepts and conversations, so it will help if you have some light background on these ideas, but I will try to make everything intelligible to the lay/​non-technical person.

This will be a long post, so get a drink and a snack.

The Sharp Left Turn probably won’t happen, because AI training is very different from evolution

Nate Soares suggests that a critical problem in AI safety is the sharp left turn, and the sharp left turn essentially is that capabilities generalize much more than the goals, ie it is basically goal misgeneralization plus fast takeoff:

My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.

And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn’t make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it’s not like they could be eating/​fornicating due to explicit reasoning about how those activities lead to more IGF. They can’t yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don’t suddenly start eating/​fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.

So essentially the analogy is akin to AI is aligned in the training data, but in the test set, due to the limitations of the method of alignment, fail to generalize to the test set.

Here’s the problem: We actually know why the sharp left turn happened, and the circumstances that led to the sharp left turn in humans won’t reappear in AI training and AI progress.

Basically, the sharp left turn happened because the outer optimizer of evolution was billions of times less powerful than the inner search process like human lifetime learning, and the inner learners like us humans die after basically a single step, or at best 2-3 steps of the outer optimizer. Evolution mostly can’t transmit as ,many bits from one generation to the next generation via it’s tools, compared to cultural evolution, and the difference between their ability to transmit bits over certain time-scales is massive.

Once we had the ability to transmit some information via culture, that meant that given our ability to optimize billions of times more efficiently, we could essentially undergo a sharp left turn where capabilities spiked. But the only reason this happened was to quote Quintin Pope:

Once the inner learning processes become capable enough to pass their knowledge along to their successors, you get what looks like a sharp left turn. But that sharp left turn only happens because the inner learners have found a kludgy workaround past the crippling flaw where they all get deleted shortly after initialization.

This does not exist for AIs trained with SGD, and there is a much smaller gap between the outer optimizer SGD and the inner optimizer, with the difference being ~0-40x.

Here’s the source for it below, and I’ll explicitly quote it:


See also: Model Agnostic Meta Learning proposed a bi-level optimization process that used between 10 and 40 times more compute in the inner loop, only for Rapid Learning or Feature Reuse? to show they could get about the same performance while removing almost all the compute from the inner loop, or even by getting rid of the inner loop entirely.

Also, we can set the ratio of outer to inner optimization steps to basically whatever we want, which means that we can control the inner learner’s rates of learning far better than evolution, meaning we can prevent a sharp left turn from happening.

A crux I have with Jan Kulevit is that to the extent that animals do have culture, it is much more limited than human culture, and that evolution largely has little ability to pass on traits non-culturally, and very critically this is a one-time inefficiency, there is no reason to assume a second source of massive inefficiency leading to a sharp left turn:

X4vier and particular illustrates this, and I’ll show it below:



I don’t believe that Nate’s example actually shows the misgeneralization were concerned about

This is because the alleged misgeneralization was not a situation where 1 AI was trained in an environment and maximized the correlates IGF, then in the new environment it encountered inputs that changed the goals such that it now misgeneralizes the goal to not pursue IGF anymore.

What happened is that evolution trained humans in one environment to optimize the correlates of IGF, then basically trained new humans in another environment, and they diverged.

Very critically, there were thousands of different systems/​humans being trained on in drastically different environments, not 1 AI being trained on different environments like in modern AI training, so it’s not a valid example of misgeneralization.

Some posts and quotes from Quintin Pope will help:

(Part 2, how this matters for analogies from evolution) Many of the most fundamental questions of alignment are about how AIs will generalize from their training data. E.g., “If we train the AI to act nicely in situations where we can provide oversight, will it continue to act nicely in situations where we can’t provide oversight?”

When people try to use human evolutionary history to make predictions about AI generalizations, they often make arguments like “In the ancestral environment, evolution trained humans to do X, but in the modern environment, they do Y instead.” Then they try to infer something about AI generalizations by pointing to how X and Y differ.

However, such arguments make a critical misstep: evolution optimizes over the human genome, which is the top level of the human learning process. Evolution applies very little direct optimization power to the middle level. E.g., evolution does not transfer the skills, knowledge, values, or behaviors learned by one generation to their descendants. The descendants must re-learn those things from information present in the environment (which may include demonstrations and instructions from the previous generation).

This distinction matters because the entire point of a learning system being trained on environmental data is to insert useful information and behavioral patterns into the middle level stuff. But this (mostly) doesn’t happen with evolution, so the transition from ancestral environment to modern environment is not an example of a learning system generalizing from its training data. It’s not an example of:

We trained the system in environment A. Then, the trained system processed a different distribution of inputs from environment B, and now the system behaves differently.

It’s an example of:

We trained a system in environment A. Then, we trained a fresh version of the same system on a different distribution of inputs from environment B, and now the two different systems behave differently.

These are completely different kinds of transitions, and trying to reason from an instance of the second kind of transition (humans in ancestral versus modern environments), to an instance of the first kind of transition (future AIs in training versus deployment), will very easily lead you astray.

Two different learning systems, trained on data from two different distributions, will usually have greater divergence between their behaviors, as compared to a single system which is being evaluated on the data from the two different distributions. Treating our evolutionary history like humanity’s “training” will thus lead to overly pessimistic expectations regarding the stability and predictability of an AI’s generalizations from its training data.

Drawing correct lessons about AI from human evolutionary history requires tracking how evolution influenced the different levels of the human learning process. I generally find that such corrected evolutionary analogies carry implications that are far less interesting or concerning than their uncorrected counterparts. E.g., here are two ways of thinking about how humans came to like ice cream:

If we assume that humans were “trained” in the ancestral environment to pursue gazelle meat and such, and then “deployed” into the modern environment where we pursued ice cream instead, then that’s an example where behavior in training completely fails to predict behavior in deployment.

If there are actually two different sets of training “runs”, one set trained in the ancestral environment where the humans were rewarded for pursuing gazelles, and one set trained in the modern environment where the humans were rewarded for pursuing ice cream, then the fact that humans from the latter set tend to like ice cream is no surprise at all.

In particular, this outcome doesn’t tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they’ll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.

A comment by Quintin on why humans didn’t actually misgeneralize to liking icecream:


AIs are white boxes, and we are the innate reward system

Edit from comments due to Steven Byrnes: The white-box definition I’m using in this post does not correspond to the intuitive definition of a white box, and instead refers to the computer analysis/​security sense of the term.

These links will be the definitions of white box AI going forward for this post:




The above arguments on why the Sharp Left Turn probably won’t reappear in modern AI development, and why the claim that humans didn’t misgeneralize is enough to land us out of the most doomy voices like Eliezer Yudkowsky, and in particular the removal of reasons to assume extreme misgeneralization lands us out of MIRI-sphere views, as well as arguably outside of 50% p(doom). But I wanted to argue that the chance of doom is way lower than that, so low that we mostly shouldn’t be concerned about AI, and thus I have to provide a positive story of why AIs very likely are aligned, and I argue that AIs are white boxes and we are the innate reward system, in this context.

The key advantage we have over evolution is that unlike studying brains, we have full read-write access to their internals, and they’re essentially a special type of computer program, and we already have ways to manipulate computer programs at essentially no cost to us. Indeed, this is why SGD and backpropagation works at all to optimize SGD. If the AI was a black box, SGD and backpropagation wouldn’t work.

The innate reward system aligns us via whitebox methods, and the values that the reward system imprints on us is ridiculously reliable, where almost every human has empathy for friends and acquaintances, parental instincts, revenge etc.

This is shown in the link below:


(Here, we must take a detour and say that our reward system is ridiculously good at aligning us to survive, and the flaws like obesity in the modern world are usually surprisingly mild failures, in that the human isn’t as capable of things as we thought, and this arguably implies that alignment failures in practice will look much more like capabilities failures, and passing the analogy back to the AI case, I basically don’t expect X-risk, GCRs, or really anything more severe than say the AI messing up a kitchen, for example.)

Steven Byrnes raised the concern that if you don’t know how to do the manipulation, then it does cost you to gain the knowledge.

Steven Byrnes’s comment is linked here: https://​​​​posts/​​JYEAL8g7ArqGoTaX6/​​?commentId=3xxsumjgHWoJqSzqw

Nora Belrose responded on what white boxing meant, as well as how people use SGD to automate the search so that the cost of manipulation in an overall sense is as low as possible:


I mean it in the computer security sense, where it refers to the observability of the source code of a program (Nora Belrose)


We can do better than IDA Pro & Ghidra by exploiting the differentability of neural nets, using SGD to locate the manipulations of NN weights that improve alignment the most

I’d be much more worried if we didn’t have SGD and were just evolving AGI in a sim or smth (Nora Belrose)


I’m pointing out that it’s a white box in the very literal sense that you can observe and manipulate everything that’s going on inside, and this is a far from trivial fact because you can’t do this with other systems we routinely align like humans or animals. (Nora Belrose)


No, I don’t agree this is a weakening. In a literal sense it is zero cost to analyze and manipulate the NNs. It may be greater than zero cost to come up with manual manipulations that achieve some goal. But that’s why we automate the search for manipulations using SGD (Nora Belrose)

Steven Byrnes argues that this could be due to differing definitions:


I think that’s a black box with a button on the front panel that says “SGD”. We can talk all day about all the cool things we can do by pressing the SGD button. But it’s still a button outside the box, metaphorically.

To me, “white box” would mean: If an LLM outputs A rather than B, and you ask me why, then I can always give you a reasonable answer. I claim that this is closer to how that term is normally used in practice.

(Yes I know, it’s not literally a button, it’s an input-output interface that also changes the black box internals.) (Steven Byrnes)

This is the response chain so that I could see why Nora Belrose and Steven Byrnes were disagreeing.

I ultimately think a potential difference is that for alignment purposes, the humans vs AI abstraction is not a very useful abstraction, and SGD vs the inner optimizer is the better abstraction here, and thus it doesn’t matter whether AI progresses generally, it’s the specific progress by humans + SGD vs the inner optimizer that’s important, and thus the cost of manipulating AI values is quite low.

This leads to...

I believe the security mindset is inappropriate for AI

In general, a common disagreement with a lot of LWers is that there is very limited transfer of knowledge from the computer security field to AI, because AI is very different in ways that make the analogies inappropriate.

For one particular example, you can randomly double your training data, or the size of the model, and it will work usually just fine. A rocket would explode if you tried to double the size of your fuel tanks.

All of this and more is explained by Quintin below, but there are several big disanalogies between the AI field and the computer security field, so much so that I think that ML/​AI is a lot like quantum mechanics, where we shouldn’t port intuitions from other fields and expect them to work because of the weirdness of the domain:


Similarly, I think that machine learning is not really like computer security, or rocket science (another analogy that Yudkowsky often uses). Some examples of things that happen in ML that don’t really happen in other fields:

Models are internally modular by default. Swapping the positions of nearby transformer layers causes little performance degradation.

Swapping a computer’s hard drive for its CPU, or swapping a rocket’s fuel tank for one of its stabilization fins, would lead to instant failure at best. Similarly, swapping around different steps of a cryptographic protocol will, usually make it output nonsense. At worst, it will introduce a crippling security flaw. For example, password salts are added before hashing the passwords. If you switch to adding them after, this makes salting near useless.

We can arithmetically edit models. We can finetune one model for many tasks individually and track how the weights change with each finetuning to get a “task vector” for each task. We can then add task vectors together to make a model that’s good at multiple of the tasks at once, or we can subtract out task vectors to make the model worse at the associated tasks.

Randomly adding /​ subtracting extra pieces to either rockets or cryptosystems is playing with the worst kind of fire, and will eventually get you hacked or exploded, respectively.

We can stitch different models together, without any retraining.

The rough equivalent for computer security would be to have two encryption algorithms A and B, and a plaintext X. Then, midway through applying A to X, switch over to using B instead. For rocketry, it would be like building two different rockets, then trying to weld the top half of one rocket onto the bottom half of the other.

Things often get easier as they get bigger. Scaling models makes them learn faster, and makes them more robust.

This is usually not the case in security or rocket science.

You can just randomly change around what you’re doing in ML training, and it often works fine. E.g., you can just double the size of your model, or of your training data, or change around hyperparameters of your training process, while making literally zero other adjustments, and things usually won’t explode.

Rockets will literally explode if you try to randomly double the size of their fuel tanks.

I don’t think this sort of weirdness fits into the framework /​ “narrative” of any preexisting field. I think these results are like the weirdness of quantum tunneling or the double slit experiment: signs that we’re dealing with a very strange domain, and we should be skeptical of importing intuitions from other domains.

I also believe that the epistemic differences between computer security and alignment is in computer security, there’s an easy to check ground truth for whether a crypto-system is broken, whereas in AI alignment, we don’t have the ability to get feedback from proposed breakages of alignment schemes.

For more, see Quintin’s post section on the difference between AI safety and computer security in regards to epistemics, and a worked example of an attempted security break, where there is suggestive evidence that inner misaligned models/​optimization daemons go away as we increase the amount of dimensions.


(Where Quintin Pope talks about the fact that alignment doesn’t have good feedback loops on ground truth on “What is an attempted break?”, and the example of a claimed break actually went away as the dimensions was scaled up, and note that the disconfirmatory evidence was more realistic than the attempted break.)

This is why I disagreed with Jeffrey Ladish about the security mindset on Twitter: I believe it’s a trap for those not possessing technical knowledge, like a lot of LWers, and there are massive differences between AI and computer security that means most attempted connections fail.


uh I guess I hope he reads enough to internalize the security mindset?? (Jeffrey Ladish)


I generally tend to think the security mindset is a trap, because ML/​AI alignment is very different from rocket engineering or cybersecurity.

For a primer on why, read @QuintinPope5′s post section on it:

https://​​​​posts/​​wAczufCpMdaamF9fy/​​my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Yudkowsky_mentions_the_security_mindset__ (Myself)

So now that I’ve tried to show why porting over the security mindset is flawed, I want to talk about a class of adversaries like gradient hackers or inner-misaligned mesa-optimization, and why I believe this is actually very difficult to do against SGD, and even the non-platonic ideal version of SGD, we can detect most mesa-optimizers quite easily.

Inner Misalignment, or at least Gradient Hacking is very difficult for AIs trained on SGD

I’ll be taking the inner misalignment definition from Evan Hubinger’s post The Inner Alignment Problem:


We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem.

The basic reason why it’s hard for a misaligned mesa-optimizer to stick around for long is because Gradient Descent is in fact, much more powerful and white-boxy than people realize, and in particular it has 5 defenses that any mesa-optimizer would need to overcome in order to misalign it:


1.) Everything is optimized simultaneously and independently. The exact local sensitivity to of each and every parameter to the loss is computed in the backward pass and every parameter is updated based on this. This means it is impossible for a mesaoptimizer or gradient hacker (as long as it is differentiably connected to the network) to shield any part of itself from gradient updates. This includes the gradient hacker’s internal circuitry, its connections with other parts of the network, its connections to the network’s input data, and the connections between the rest of the network and the loss.

2.) Gradient descent is maximal. Each parameter is updated in the direction of steepest descent. This means that, unlike biological life, gradient hackers cannot ‘burn’ something to increase entropy to be traded off against decreasing entropy elsewhere. Only the local maximally increasing entropy direction is allowed.

3.) It is stateless. Each parameter is updated according to its local sensitivity to the loss for each datapoint. Credit is not propagated across time. It is impossible to ‘bargain’ with gradient descent by increasing loss now but decreasing loss more later, or having decreased loss in the past.

4.) It is first-order only. It only computes partial derivatives of the parameter to the loss. It does not take into account potential second order interactions. While bad in some cases, this makes gradient descent immune to ‘blackmail’ or ‘tripwires’ setup by mesaoptimizers where optimizing some part of their circuit would potentially cause a huge loss as a second order effect. GD will blithely trigger the tripwire and then optimize away the consequences. This also disallows ‘trade’ where a gradient hacker could cause low loss on one causal branch in exchange for high loss on some other branch. Gradient descent will instead sum the steepest descent direction on both branches.

5.) It is infinitesimal. GD only computes the infinitesimal local sensitivities of each parameter regardless of the actual learning rate or step-size in practice. This means gradient descent does not ‘see’ if you are standing next to a giant cliff, so cannot be blackmailed in this way either.

Basically it will optimize the entire causal graph, and leave no slack, and as a bonus are extremely resistant to blackmail by mesa-optimizers. In general, a big part of my optimism around inner alignment is that SGD is extraordinarily good at credit assignment, and it has quite strong correction features in the case that a mesa-optimizer does attempt to misalign it.

We also can detect most mesa-optimizers in the AI without the need for mechanistic interpretability, like so:

While it might be possible to construct gradient hackers around these features, the advantages to us is that all of these approaches (except perhaps 2) should be visible to us even without detailed mechanistic interpretability knowledge of the network. We know a-priori where the stop-gradients or non-differentiable parts of the architecture are. We can straightforwardly observe what gradients are being clipped—and if the gradient updates to the same set of parameters is being constantly clipped that could be a red flag. We can also find regions of high curvature in the loss landscape by just inspecting the Hessians or approximations of it. Only a gradient hacker which utilizes correlations across minibatches to cancel out its interventions in model behaviour seems difficult to detect.

One caveat here is that the prevention of mesa-optimizers applies fully only to SSL learning on IID data, which is an unfortunate limitation, albeit I do expect SGD to still be ridiculously good at credit assignment even in the RL context.

In particular, the detection mechanisms for mesa-optimizers are intact, but we do need to worry about 1 new potential inner misalignment pathway.

Also of importance is that this post only applies to the classic supervised/​self-supervised learning on i.i.d data which most current foundation models are trained in. The important things about this paradigm is that the outputs of the model have no causal link to the data distribution—i.e. the model cannot influence what future data it will be trained on (except potentially highly indirectly influencing successor models [1]). In an RL context, or supervised learning with adaptive data sampling, where there is a link between the model’s outputs and the future data distribution, then this provides another route for gradient hackers to operate—by creating outputs which will steer the data distribution in a desired direction which will strengthen the gradient hacker.

But there’s also weak evidence that optimization daemons/​demons, often called inner misaligned models, go away when you increase the dimension count:


Another poster (ironically using the handle “DaemonicSigil”) then found a scenario in which gradient descent does form an optimization demon. However, the scenario in question is extremely unnatural, and not at all like those found in normal deep learning practice. So no one knew whether this represented a valid “proof of concept” that realistic deep learning systems would develop optimization demons.

Roughly two and a half years later, Ulisse Mini would make DaemonicSigil’s scenario a bit more like those found in deep learning by increasing the number of dimensions from 16 to 1000 (still vastly smaller than any realistic deep learning system), which produced very different results, and weakly suggested that more dimensions do reduce demon formation.



This was actually a crux in a discussion between me and David Xu about inner alignment, where I argued that the sharp left turn conditions don’t exist in AI development, and he argued that misalignment happens when there are gaps that go uncorrected, which is likely referring to the gap between the base goal like SGD and the internal optimizer’s goal that leads to inner misalignment, and I argued that inner misalignment is likely to be extremely difficult to do, due to SGD being able to correct the gap between the inner and outer mesa-optimizer in most cases, and I now showed the argument in this post:

Twitter conversation below:


Speaking as someone who’s read that post (alongside most of Quintin’s others) and who still finds his basic argument unconvincing, I can say that my issue is that I don’t buy his characterization of the doom argument—e.g. I disagree that there needs to be a “vast gap”. (David Xu)


SGD is not the kind of thing where you need “vast gaps” between the inner and outer optimizer to get misalignment; on my model, misalignment happens whenever gaps appear that go uncorrected, since uncorrected gaps will tend to grow alongside capabilities/​coherence. (David Xu)


since uncorrected gaps will tend to grow alongside capabilities/​coherence.

This is definitely what I don’t expect, and part of that is because I expect that uncorrected inner misalignment will be squashed out by SGD unless extreme things happen:

https://​​​​posts/​​w2TAEvME2yAG9MHeq/​​gradient-hacking-is-extremely-difficult (Myself)


Yes, that definitely sounds cruxy—you expect SGD to contain corrective mechanisms by default, whereas I don’t. This seems like a stronger claim than “SGD is different from evolution”, however, and I don’t think I’ve seen good arguments made for it. (David Xu)

This reminds me, I should address that other conversation I had with David Xu on how strong priors do we need to encode to ensure alignment, vs how much can we let it learn and it leading to a good outcome, or alternatively how much do we need to specify upfront. And that leads to...

I expect reasonably weak priors to work well to align AI with human values, and that a lot of the complexity can be offloaded to the learning process

Equivalently speaking, I expect the cost of specification of values to be relatively low, and that a lot of the complexity is offloadable to the learning process.

This was another crux between David Xu and me, specifically on the question of whether you can largely get away with weak priors, or do you actually need to encode a lot stronger prior to prevent misalignment? It ultimately boiled down to the crux that I expected reasonably weak priors to be enough, guided by the innate reward system.

A big part of my reasoning here has to do with the fact that a lot of values and biases are inaccessible by the genome, and that means that you can’t directly specify them. You can shape them via setting up training algorithms and data, but it turns out that it’s very difficult to directly specify things like values, for instance in the genome. This is primarily because the genome does not have direct access to the world model or the brain, which would be required to hardcode the prior. To the extent that it can, it has to be over relatively simple properties, which means that you need to get alignment with relatively weak priors encoded, and the innate reward system generally does this fantastically, with examples of misalignment being rare and mild.

The fact that humans can reliably get values like “having empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc”, without requiring the genome to hardcode a lot of prior information, and getting away with reasonably weak priors is a rather underappreciated thing, since it means that we don’t need to specify our values very much, and thus we can reliably offload most of the value learning work to AI.

Here are some posts and comments below:



(I want to point out that it’s not just that with weak prior information that the genome can reliably bind humans to real-enough things such that for example, they don’t die from thirst from drinking fake water, but that it can create the innate reward system which uses simple update rules to reliably get nearly every person on earth to have empathy for their family and ingroup, revenge when others harmed us, etc, and the rare exceptions to the pattern are rather rare and usually mild alignment failures at best. That’s a source of a lot of my optimism on AI safety and alignment.)




Here is the compressed conversation between David Xu and me:


(And the reason I’d be more optimistic there is basically because I expect the human has meta-priors I’d endorse, causing them to extrapolate in a “good” way, and reach a policy similar to one I myself would reach under similar augmentation.) (David Xu)


(In reality, of course, I disagree with the framing in both cases: “two different systems” isn’t correct, because the genetic information that evolution was working with in fact does encode fairly strong priors, as I mentioned upthread.) (David Xu)


My disagreement is that I expect the genetic priors to be quite weak, and that a lot of values are learned, not encoded in priors, because values are inaccessible to the genome:


Maybe we will eventually be able to hardcode it, but we don’t need that. (Myself)


Values aren’t “learned”, “inferred”, or any other words that suggests they’re directly imbibed from the training data, because values aren’t constrained by training data alone; if this were false, it would imply the orthogonality thesis is false. (David Xu)

I’m going to reply in this post and say that the orthogonality thesis is a lot like the no free lunch theorem: An extraordinarily powerful result that is too general to apply, because it only applies to the space of all logically possible AIs, and it only works if you have 0 prior that’s applied, which in this case would require you to specify everything, including the values of the system, or at best use stuff like brute force search or memorization algorithms.

I have a very similar attitude to “Most goals in the space of goal space are bad.” I’d probably agree in the most general sense, but that even weak priors can prevent most goals from being bad, and thus I suspect that a 0 prior condition is likely necessary. But I’m not arguing that with 0 prior, models are aligned with people without specifying everything. I’m arguing that we can get away with reasonably weak priors, and let within life-time learning do the rest.

Once you introduce even weak priors to the situation, then the issue is basically resolved, and I stated that weak priors work to induce learning of values, and it’s consistent with the orthogonality thesis to have arbitrarily tiny prior information be necessary to learn alignment.

I could make an analogous argument for capabilities, and I’d be demonstrably wrong, since the conclusion doesn’t hold.

This is why I hate the orthogonality thesis, despite rationalists being right on it: It allows for too many outcomes, and any inference like values aren’t learned can’t be supported based on the orthogonality thesis.


The problem with the orthogonality thesis is that it allows for too many outcomes, and notice I said the genetic prior is weak, not non-existent, which would be compatible with the orthogonality thesis. (Myself)


The orthogonality thesis, as originally deployed, isn’t meant as a tool to predict outcomes, but to counter arguments (pretty much) like the ones being made here: encountering “good” training data doesn’t constrain motivations. Beyond that the thesis doesn’t say much. (David Xu)


I suspect it’s true when looking at the multiverse of AIs as a whole, then it’s true, if we impose 0 prior, but even weak priors start to constrain your motivations a lot. I have more faith in weak priors + whiteboxness working out than you do. (Myself)


I have more faith in weak priors + whiteboxness working out than you do.

I agree that something in the vicinity of this is likely [a] crux. (David Xu)


TBC, I do think it’s logically possible for the NN landscape to be s.t. everything I’ve said is untrue, and that good minds abound given good data. I don’t think this is likely a priori, and I don’t think Quintin’s arguments shift me very much, but I admit it’s possible. (David Xu)

##My own algorithm for how to do AI alignment

This is a subpoint, but for those that want to have a ready-to-go alignment plan, here it is:

  1. Implement a weak prior over goal space.

  2. Use DPO, RLHF, or something else to create a preference model.

  3. Create a custom loss function for the preference model.

  4. Use the backpropagation algorithm to optimize it and achieve a low loss.

  5. Repeat the backpropagation algorithm until you achieve an acceptable solution.

Now that I’m basically finished with laying out the arguments and the conversations, lets move on to the conclusion:


My optimism on AI safety stems from a variety of sources. The reasons are, in order of the post, not ordered by importance are:

  1. I don’t believe the sharp left turn is anywhere near as general as Nate Soares puts it, because the conditions that caused a sharp left turn in humans was basically cultural learning in humans being able to optimize over much faster time-scales than evolution could respond, evolution not course-correcting us, and being able to transmit OOMs more information via culture through the generations than evolution could. None of these conditions hold for modern AI development.

  2. I don’t believe that Nate’s example of misgeneralizing the goal of IGF actually works as an actual example of misgeneralization that matters for our purposes, because they were not that 1 AI is trained for a goal in environment A, and then in environment B, it does not pursue the goal, but instead pursues a different goal competently.

Instead, what’s happening is that 1 human generation, or 1 human is trained in Environment A, and then a fresh generation of humans is trained on a different distribution, which predictably will have more divergence than the first case.

In particular, there’s no reason to be concerned about the alignment of AI misgeneralizing, since we have no reason to assume that the central example of Lesswrong is actually misgeneralization. From Quintin:

If we assume that humans were “trained” in the ancestral environment to pursue gazelle meat and such, and then “deployed” into the modern environment where we pursued ice cream instead, then that’s an example where behavior in training completely fails to predict behavior in deployment.

If there are actually two different sets of training “runs”, one set trained in the ancestral environment where the humans were rewarded for pursuing gazelles, and one set trained in the modern environment where the humans were rewarded for pursuing ice cream, then the fact that humans from the latter set tend to like ice cream is no surprise at all.

In particular, this outcome doesn’t tell us anything new or concerning from an alignment perspective. The only lesson applicable to a single training process is the fact that, if you reward a learner for doing something, they’ll tend to do similar stuff in the future, which is pretty much the common understanding of what rewards do.

  1. AIs are mostly white boxes, at the very least, and the control over AI that we have means that a better analogy is through our innate reward systems, which align us to quite a lot of goals spectacularly well, so well that the total evidence of alignment could easily put X-risk or even say, killing a human 5-15+ OOMs or less, which would make the alignment problem a non-problem for our purposes. It would pretty much single-handedly make AI misuse the biggest problem, but that issue has different solutions, and governments are likely to regulate AI misuse anyway, so existential risk gets cut 10-99%+ or more.

  2. I believe the security mindset is inappropriate for AI due to the fact that aligning AI mostly doesn’t involve dealing with adversarial intelligences or inputs, and the reason turns out to be that the most natural class, inner misaligned mesa-optimizers/​optimization daemons mostly doesn’t exist, because of my next reason. Also alignment is in a different epistemic state to computer security, and there are other disanalogies that make porting intuitions from other fields into ML/​AI research very difficult to do correctly.

  3. It is actually really difficult to inner misalign the AI, since SGD is really good at credit assignment, and optimizes the entire causal graph leading to the loss, leaving no slack. It’s not like evolution where you have to do this from Gwern’s post here:


Imagine trying to run a business in which the only feedback given is whether you go bankrupt or not. In running that business, you make millions or billions of decisions, to adopt a particular model, rent a particular store, advertise this or that, hire one person out of scores of applicants, assign them this or that task to make many decisions of their own (which may in turn require decisions to be made by still others), and so on, extended over many years. At the end, you turn a healthy profit, or go bankrupt. So you get 1 bit of feedback, which must be split over billions of decisions. When a company goes bankrupt, what killed it? Hiring the wrong accountant? The CEO not investing enough in R&D? Random geopolitical events? New government regulations? Putting its HQ in the wrong city? Just a generalized inefficiency? How would you know which decisions were good and which were bad? How do you solve the “credit assignment problem”?

The way SGD solves this problem is by running backprop, which is a white-box algorithm, and Nora Belrose explains it more here:


And that’s the base optimizer, not the mesa-optimizer, which is why SGD has a chance to correct the inner-misaligned agent far more effectively than cultural/​biological evolution, the free market, etc. It is white-box, like the inner optimizers it runs, and solves credit assignment in a much better way than the previous optimizers like cultural/​biological evolution, the free market, etc could hope to do.

  1. I believe that due to information inaccessibility plus the fact that the brain acts quite a lot like a Universal Learning Machine/​Neural Turing Machine, this means that alignment in the human case for say surviving, having empathy for friends etc, can’t depend on complicated genetic priors, and thus to the extent that genetic priors are encoded in, they need to be fairly weak and universalish-priors, plus help from the innate reward system, which is built upon those priors to use simple updating rules to reinforce certain behaviors and penalize others, and this works ridiculously well to align humans to surviving and having things like empathy/​sympathy for the ingroup, revenge etc.

So now that we have listed the reasons why I expect optimism on AI safety, I’ll add 1 new mini-section to show that the shutdown problem from AI is almost solved.

Addendum 1: The shutdown problem for AI is almost solved

It turns out that we can keep the most useful aspects of Expected Utility Maximization while making an AI shutdownable.

Sami Petersen showed that we can integrate incomplete preferences to AIs while weakening transitivity just enough to get a non-trivial theory of Expected Utility Maximization that’s quite a lot safer. Elliott Thornley proposed that incomplete preferences would be used to solve the shut-down problem, and the very nice thing about subagent models of Expected Utility Maximization is that they require a unanimous committee in order for a decision to be accepted as a sure gain.

This is both useful, but can lead to problems. On the one hand, we only need one expected utility maximizer that wants to be able to shut down the AI in order for us to shut it down as a whole, but we would need to be sort of careful on where their execution conditions/​domain is, as unanimous committees can terrible because only one agent needs to do something to grind the entire system to a halt, which is why in the real world, it’s usually not a preferred way to govern something.

Nevertheless, for AI safety purposes, this is still very, very useful, and if it grows up to have broader conditions than the ones outlined in the posts below, this might be the single biggest MIRI success of the last 15 years, which is ridiculously good.


http://​​​​u-242443/​​uploads/​​2023-05-02/​​m343uwh/​​The Shutdown Problem- Two Theorems%2C Incomplete Preferences as a Solution.pdf

Edit 3: I’ve removed addendum 2 as I think it’s mostly irrelevant, and Daniel Kokotajlo showed me that Ajeya actually expects things to slow down in the next few years, so the section really didn’t make that much sense.