That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
Specifically, the point is that evolution’s approach to alignment is very different and much worse than what we can do, so the evolution model doesn’t suggest concern, since we are doing something very, very different and way better than what evolution did to align AIs.
Similarly, the mechanisms that allow for fast takeoff doesn’t automatically mean that the inner optimizer is billions of times faster/more powerful. It’s not human vs AI takeoff that matters, but SGD or the outer optimizer vs the inner optimizer that matters. And the implicit claim is that the outer/inner optimizer optimization differential will be vastly less unequal than the evolution/human optimization differential due to the very different dynamics of the AI situation.
Start with ‘deliberately.’ Why would that matter?
Because you can detect if an evolutionarily like sharp-left turn happened, and it’s by doing this:
“If you suspect that you’ve maybe accidentally developed an evolution-style inner optimizer, look for a part of your system that’s updating its parameters ~a billion times more frequently than your explicit outer optimizer.”
And you can prevent it because we can assign basically whatever ratio of optimization steps between the outer and inner optimizer we want like so:
“Human “inner learners” take ~billions of inner steps for each outer evolutionary step. In contrast, we can just assign whatever ratio of supervisory steps to runtime execution steps, and intervene whenever we want.”
Step two seems rather arbitrary. Why billions?
It doesn’t too much matter what the threshold is, but the point is that there needs to be a large enough ratio between the outer optimizer like SGD won’t be able to update it in time.
Step three does not seem necessary at all. It does seem like currently we are doing exactly this, but even if we didn’t, the inner optimizer has more optimization pressure working for it in the relevant areas, so why would we presume that the outer optimizer would be able to supervise it effectively or otherwise keep it in check?
First, you’re essentially appealing to the argument that the inner optimizer is way more effective than the outer optimizer like SGD, which seems like the second argument. Second of all, Quintin Pope responded to that in another comment, but the gist here is A: We can in fact directly reward models for IGF or human values, unlike evolution, and B: human values is basically the exact target for reward shaping, since they arose at all.
Another reason is that we are the innate reward system, to quote Nora Belrose, and once we use the appropriate analogies, we are able to do powerful stuff around alignment that evolution just can’t do.
“inner loss function” = the combination of predictive processing and reward circuitry that collectively make up the brain’s actual training objective.
“inner loss function includes no mention human values / objectives” because the brain’s training objective includes no mention of inclusive genetic fitness.
I would argue that ‘AIs contribute to AI capabilities research’ is highly analogous to ‘humans contribute to figuring out how to train other humans.’ And that ‘AIs seeking out new training data’ is highly analogous to ‘humans creating bespoke training data to use to train other people especially their children via culture’ which are exactly the mechanisms Quintin is describing humans as using to make a sharp left turn.
The key here is that there isn’t nearly as large of a difference between the outer optimizer like SGD and the inner optimizer, since SGD is way more powerful than evolution since it can directly select over policies.
Also, this Quintin Pope’s comment is worth sharing on why we shouldn’t expect AIs to have a sharp left turn: We do not train AIs that are initalized from scratch, takes billions of inner optimization steps before the outer optimizer step, then dies or is deleted. We don’t kill AIs well before they’re fully trained.
Even fast takeoff, unless the fast takeoff is specific to the inner optimizer, will not produce the sharp left turn. We not only need fast takeoff but also fast takeoff localized to the inner optimizer, without SGD or the outer optimizer also being faster.
I would also note that, if you discover (as in Quintin’s example of Evo-inc) that major corporations are going around using landmines as hubcaps, and that they indeed managed to gain dominant car market share and build the world’s most functional cars until recently, that is indeed a valuable piece of information about the world, and whether you should trust corporations or other humans to be able to make good choices, realize obvious dangers and build safe objects in general. Why would you think that such evidence should be ignored?
The problem is no such thing exists, and we have no reason to assume the evolutionary sharp left turn is generalizable or not a one-off.
Quintin acknowledges the similarity, but says this would not result in an orders of magnitude speedup. Why not?
The key thing to remember is that the comparison here isn’t whether AI would speed up progress, but rather will the inner optimizer have multiple orders of magnitude more progress than the outer optimizer like SGD? And Quintin suggests the answer is no, because of both history and the fact that we can control the ratio of outer to inner optimization steps, and we can actually reward them for say IGF or human flourishing, unlike evolution.
Thus the question isn’t about AI progress in general, or AI vs human intelligence or progress, or how fast AI is able to takeoff in general, but rather how fast can the inner optimizer take off compared to the outer optimizer like SGD. Thus this part is addressing the wrong thing here, because it’s not a sharp left turn, since you don’t show that the inner optimizer that is not SGD has a fast takeoff, you rather show that AI has a fast takeoff, which are different questions requiring different answers.
Ignore the evolution parallel here, and look only at the scenario offered. What happens when the AI starts contributing to AI research? If the AI suddenly became able to perform as a human-level alignment researcher or capabilities researcher, only at the speed of an AI with many copies in parallel, would that not speed up development by orders of magnitude? Is this not Leike’s explicit plan for Superalignment, with the hope that we could then shift enough resources into alignment to keep pace?
One could say ‘first the AI will speed up research by automating only some roles somewhat, then more roles more, so it won’t be multiple orders of magnitude at the exact same time’ but so what? The timelines this implies do not seem so different from the timeline jumps in evolution. We would still be talking (in approximate terms throughout, no need to get pedantic) about takeoff to vast superintelligence in a matter of years at most, versus a prior human information age that lasted decades, versus industrial civilization lasting centuries, versus agricultural civilization lasting millennia, versus cultural transmission lasting tens of thousands, homo sapiens hundreds of thousands, human-style primates millions, primates in general tens of millions, land animals hundreds of millions, life and Earth billions, the universe tens of billions? Presumably with a ‘slow takeoff’ period of years as AIs start to accelerate work, then a period of months when humans are mostly out of the loop, then… something else?
That seems like a sharp enough left turn to me.
“The second distinction he mentions is that this allows more iteration and experimentation. Well, maybe. In some ways, for some period. But the whole idea of ‘we can run alignment experiments on current systems, before they are dangerously general, and that will tell us what applies in the future’ assumes the conclusion.”
This definitely is a crux between a lot of pessimistic and optimistic views, and I’m not sure I totally think it follows from accepting the premise of Quintin’s post.
The third distinction claims that capabilities gains will be less general. Why? Are cultural transmission gains general in this sense, or specific? Except that enough of that then effectively generalized. Humans, indeed, have continuously gained new capabilities, then been bottlenecked due to lack of other capabilities, then used their new capabilities to solve the next bottleneck. I don’t see why this time is different, or why you wouldn’t see a human-level-of-generality leap to generality from the dynamics Quintin is describing. I see nothing in his evolutionary arguments here as reason to not expect that. There are reasons for or against expecting more or less such generality, but mostly they aren’t covered here, and seem orthogonal to the discussion.
I think the intention was that the generality boost was almost entirely because of the massive ratio between the outer and inner optimizer. I agree with you that capabilities gains being general isn’t really obviated by the quirks of evolution, it’s actually likely IMO to persist, especially with more effective architectures.
The fourth claim is made prior to its justification, which is in the later sections.
The point here is that we no longer have a reason to privilege the hypothesis that capabilities generalize further than alignment, because the mechanisms that enabled a human sharp left turn of capabilities that generalized further than alignment are basically entirely due to the specific oddities and quirks of evolution, and very critically they do not apply to our situation, which means that we should expect way more alignment generalization than you would expect if you believed the evolution—human analogy.
In essence, Nate Soares is wrong to assume that capabilities generalizing further than alignment was something that was a general factor of making intelligence, and instead the capabilities generalizing further than alignment was basically entirely due to evolution massively favored inner optimizer parameter updates rather than the outer optimizer updates, combined with our ability to do things evolution flat out can’t do like set the ratio of outer/inner parameter updates to exactly what we want. our ability to straightforwardly reward goals rather than reward shape them, and the fact that for alignment purposes, we are the innate reward system, means that the sharp left turn is a non-problem.
As a general note, these sections seem mostly to be making a general alignment is easy, alignment-by-default claim, rather than being about what evolution offers evidence for, and I would have liked to see them presented as a distinct post given how big and central and complex and disputed is the claim here.
I sort of agree with this, but also the key here is the fact that the sharp left turn in it’s worrisome form for alignment doesn’t exist means that at the very least, we can basically entirely exclude very doomy views like yours or MIRI solely from the fact that evolution provides no evidence for the sharp left turn in the worrisome form.
He starts with an analogous claim to his main claim, that humans being clearly misaligned with genetic fitness is not evidence that we should expect such alignment issues in AIs. His argument (without diving into his earlier linked post) seems to be that humans are fresh instances trained on new data, so of course we expect different alignment and different behavior.
But if you believe that, you are saying that humans are fresh versions of the system. You are entirely throwing out from your definition of ‘the system’ all of the outer alignment and evolutionary data, entirely, saying it does not matter, that only the inner optimizer matters. In which case, yes, that does fully explain the differences. But the parallel here does not seem heartening. It is saying that the outcome is entirely dependent on the metaphorical inner optimizer, and what the system is aligned to will depend heavily on the details of the training data it is fed and the conditions under which it is trained, and what capabilities it has during that process, and so on. Then we will train new more capable systems in new ways with new data using new techniques, in an iterated way, in similar fashion. How should this make us feel better about the situation and its likely results?
Basically, because it’s not an example of misgeneralization. The issue we have with alleged AI misgeneralization is that the AI is aligned with us in training data A, but misgeneralization happens in the test set B.
Very critically, it is not an example of one AI A being aligned with us, then it’s killed and we have a misaligned AI B, because we usually don’t delete AI models.
To justify Quintin’s choice to ignore the outer optimization process, the basic reason is that the gap between the outer optimizer and inner optimizer is far greater, in that the inner optimizer is essentially 1,000,000,000 times more powerful than the outer optimizer, whereas modern AI only has a 10-40x gap between the inner and outer optimizer.
Once again, the whole argument is that the current techniques will break down when capabilities advance. Saying aberrant data does not usually break alignment at current capability levels is some evidence of robustness, given that the opposite would have been evidence against it and was certainly plausible before we experimented, but again this ignores the claimed mechanisms. It also ignores that the current style of alignment does indeed seem porous and imprecise, in a way that is acceptable at current capabilities levels but that would be highly scary at sufficiently high capabilities levels. My model of how this works is that the system will indeed incorporate all the data, and will get more efficient and effective at this as capabilities advance, but this does not currently have that much practical import in many cases.
Okay, a fundamental crux here is that I actually think we are way more powerful than evolution at aligning AI, arguably 6-10 OOMs better if not more, and the best comparison is something like our innate reward system, where we see very impressive alignment between the inner rewards like compassion for our ingroup, and even the failures of alignment like say obesity are much less impactful than the hypothesized misalignment from AI.
The central claim, that evolution provides no evidence for the sharp left turn, definitely seems false to me, or at least strongly overstated. Even if I bought the individual arguments in the post fully, which I do not, that is not how evidence works. Consider the counterfactual. If we had not seen a sharp left turn in evolution, civilization had taken millions of years to develop to this point with gradual steady capability gains, and we saw humans exhibiting strong conscious optimization mostly for their genetic fitness, it would seem crazy not to change our beliefs at all about what is to come compared to what we do observe. Thus, evidence.
I do not agree with this, because evolution is very different and much weaker than us at aligning intelligences, so a lot of outcomes were possible, and thus it’s not surprising that a sharp left turn happened. It would definitely strengthen the AI optimist case by a lot, but the negation of the statement would not provide evidence for AI ruin.
In the generally strong comments to OP, Steven Byrnes notes that current LLM systems are incapable of autonomous learning, versus humans and AlphaZero which are, and that we should expect this ability in future LLMs at some point. Constitutional AI is not mentioned, but so far it has only been useful for alignment rather than capabilities, and Quintin suggests autonomous learning mostly relies upon a gap between generation and discernment in favor of discernment being easier. I think this is an important point, while noting that what matters is ability to discern between usefully outputs at all, rather than it being easier, which is an area where I keep trying to put my finger on writing down the key dynamics and so far falling short.
I’ll grant this point, but then the question becomes, why do we expect there to be a gap between the inner and outer optimizer like SGD gap be very large via autonomous learning, which would be necessary for the worrisome version of the sharp left turn to exist?
Or equivalently, why should we expect to see the inner learner, compared to SGD to reap basically of the benefits of autonomous learning? Note this is not a question about why it could cause a general fast AI takeoff, or why it would boost AI capabilities enormously, but rather why the inner learner would gain ~all of the benefits of autonomous learning to quote Steven Byrnes, rather than both SGD or the outer optimizer to also be able to gain very large amounts of optimization power?
Specifically, the point is that evolution’s approach to alignment is very different and much worse than what we can do, so the evolution model doesn’t suggest concern, since we are doing something very, very different and way better than what evolution did to align AIs.
Similarly, the mechanisms that allow for fast takeoff doesn’t automatically mean that the inner optimizer is billions of times faster/more powerful. It’s not human vs AI takeoff that matters, but SGD or the outer optimizer vs the inner optimizer that matters. And the implicit claim is that the outer/inner optimizer optimization differential will be vastly less unequal than the evolution/human optimization differential due to the very different dynamics of the AI situation.
Because you can detect if an evolutionarily like sharp-left turn happened, and it’s by doing this:
“If you suspect that you’ve maybe accidentally developed an evolution-style inner optimizer, look for a part of your system that’s updating its parameters ~a billion times more frequently than your explicit outer optimizer.”
And you can prevent it because we can assign basically whatever ratio of optimization steps between the outer and inner optimizer we want like so:
“Human “inner learners” take ~billions of inner steps for each outer evolutionary step. In contrast, we can just assign whatever ratio of supervisory steps to runtime execution steps, and intervene whenever we want.”
It doesn’t too much matter what the threshold is, but the point is that there needs to be a large enough ratio between the outer optimizer like SGD won’t be able to update it in time.
First, you’re essentially appealing to the argument that the inner optimizer is way more effective than the outer optimizer like SGD, which seems like the second argument. Second of all, Quintin Pope responded to that in another comment, but the gist here is A: We can in fact directly reward models for IGF or human values, unlike evolution, and B: human values is basically the exact target for reward shaping, since they arose at all.
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/?commentId=f2CamTeuxhpS2hjaq#f2CamTeuxhpS2hjaq
Another reason is that we are the innate reward system, to quote Nora Belrose, and once we use the appropriate analogies, we are able to do powerful stuff around alignment that evolution just can’t do.
https://forum.effectivealtruism.org/posts/JYEAL8g7ArqGoTaX6/ai-pause-will-likely-backfire#White_box_alignment_in_nature
Also, some definitions here:
“inner optimizer” = the brain.
“inner loss function” = the combination of predictive processing and reward circuitry that collectively make up the brain’s actual training objective.
“inner loss function includes no mention human values / objectives” because the brain’s training objective includes no mention of inclusive genetic fitness.
The key here is that there isn’t nearly as large of a difference between the outer optimizer like SGD and the inner optimizer, since SGD is way more powerful than evolution since it can directly select over policies.
Also, this Quintin Pope’s comment is worth sharing on why we shouldn’t expect AIs to have a sharp left turn: We do not train AIs that are initalized from scratch, takes billions of inner optimization steps before the outer optimizer step, then dies or is deleted. We don’t kill AIs well before they’re fully trained.
Even fast takeoff, unless the fast takeoff is specific to the inner optimizer, will not produce the sharp left turn. We not only need fast takeoff but also fast takeoff localized to the inner optimizer, without SGD or the outer optimizer also being faster.
https://forum.effectivealtruism.org/posts/JbScJgCDedXaBgyKC/?commentId=GxeL447BgCxZbr5eS#GxeL447BgCxZbr5eS
The problem is no such thing exists, and we have no reason to assume the evolutionary sharp left turn is generalizable or not a one-off.
The key thing to remember is that the comparison here isn’t whether AI would speed up progress, but rather will the inner optimizer have multiple orders of magnitude more progress than the outer optimizer like SGD? And Quintin suggests the answer is no, because of both history and the fact that we can control the ratio of outer to inner optimization steps, and we can actually reward them for say IGF or human flourishing, unlike evolution.
Thus the question isn’t about AI progress in general, or AI vs human intelligence or progress, or how fast AI is able to takeoff in general, but rather how fast can the inner optimizer take off compared to the outer optimizer like SGD. Thus this part is addressing the wrong thing here, because it’s not a sharp left turn, since you don’t show that the inner optimizer that is not SGD has a fast takeoff, you rather show that AI has a fast takeoff, which are different questions requiring different answers.
“The second distinction he mentions is that this allows more iteration and experimentation. Well, maybe. In some ways, for some period. But the whole idea of ‘we can run alignment experiments on current systems, before they are dangerously general, and that will tell us what applies in the future’ assumes the conclusion.”
This definitely is a crux between a lot of pessimistic and optimistic views, and I’m not sure I totally think it follows from accepting the premise of Quintin’s post.
I think the intention was that the generality boost was almost entirely because of the massive ratio between the outer and inner optimizer. I agree with you that capabilities gains being general isn’t really obviated by the quirks of evolution, it’s actually likely IMO to persist, especially with more effective architectures.
The point here is that we no longer have a reason to privilege the hypothesis that capabilities generalize further than alignment, because the mechanisms that enabled a human sharp left turn of capabilities that generalized further than alignment are basically entirely due to the specific oddities and quirks of evolution, and very critically they do not apply to our situation, which means that we should expect way more alignment generalization than you would expect if you believed the evolution—human analogy.
In essence, Nate Soares is wrong to assume that capabilities generalizing further than alignment was something that was a general factor of making intelligence, and instead the capabilities generalizing further than alignment was basically entirely due to evolution massively favored inner optimizer parameter updates rather than the outer optimizer updates, combined with our ability to do things evolution flat out can’t do like set the ratio of outer/inner parameter updates to exactly what we want. our ability to straightforwardly reward goals rather than reward shape them, and the fact that for alignment purposes, we are the innate reward system, means that the sharp left turn is a non-problem.
I sort of agree with this, but also the key here is the fact that the sharp left turn in it’s worrisome form for alignment doesn’t exist means that at the very least, we can basically entirely exclude very doomy views like yours or MIRI solely from the fact that evolution provides no evidence for the sharp left turn in the worrisome form.
Basically, because it’s not an example of misgeneralization. The issue we have with alleged AI misgeneralization is that the AI is aligned with us in training data A, but misgeneralization happens in the test set B.
Very critically, it is not an example of one AI A being aligned with us, then it’s killed and we have a misaligned AI B, because we usually don’t delete AI models.
To justify Quintin’s choice to ignore the outer optimization process, the basic reason is that the gap between the outer optimizer and inner optimizer is far greater, in that the inner optimizer is essentially 1,000,000,000 times more powerful than the outer optimizer, whereas modern AI only has a 10-40x gap between the inner and outer optimizer.
This is shown in a section here:
https://www.lesswrong.com/posts/wAczufCpMdaamF9fy/my-objections-to-we-re-all-gonna-die-with-eliezer-yudkowsky#Edit__Why_evolution_is_not_like_AI_training
Okay, a fundamental crux here is that I actually think we are way more powerful than evolution at aligning AI, arguably 6-10 OOMs better if not more, and the best comparison is something like our innate reward system, where we see very impressive alignment between the inner rewards like compassion for our ingroup, and even the failures of alignment like say obesity are much less impactful than the hypothesized misalignment from AI.
I do not agree with this, because evolution is very different and much weaker than us at aligning intelligences, so a lot of outcomes were possible, and thus it’s not surprising that a sharp left turn happened. It would definitely strengthen the AI optimist case by a lot, but the negation of the statement would not provide evidence for AI ruin.
I’ll grant this point, but then the question becomes, why do we expect there to be a gap between the inner and outer optimizer like SGD gap be very large via autonomous learning, which would be necessary for the worrisome version of the sharp left turn to exist?
Or equivalently, why should we expect to see the inner learner, compared to SGD to reap basically of the benefits of autonomous learning? Note this is not a question about why it could cause a general fast AI takeoff, or why it would boost AI capabilities enormously, but rather why the inner learner would gain ~all of the benefits of autonomous learning to quote Steven Byrnes, rather than both SGD or the outer optimizer to also be able to gain very large amounts of optimization power?