Firstly, I’d like to address the question of epistemics.
When I said “there’s no reason to reference evolution at all when forecasting AI development rates”, I was referring to two patterns of argument that I think are incorrect: (1) using the human sharp left turn as evidence for an AI sharp left turn, and (2) attempting to “rescue” human evolution as an informative analogy for other aspects of AI development.
(Note: I think Zvi did follow my argument for not drawing inferences about the odds of the sharp left turn specifically. I’m still starting by clarifying pattern 1 in order to set things up to better explain pattern 2.)
Pattern 1: using the human sharp left turn as evidence for an AI sharp left turn.
The original sharp left turn post claims that there are general factors about the structure and dynamics of optimization processes which both caused the evolutionary sharp left turn, and will go on to cause another sharp left turn in AI systems. The entire point of Nate referencing evolution is to provide evidence for these factors.
My counterclaim is that the causal processes responsible for the evolutionary sharp left turn are almost entirely distinct from anything present in AI development, and so the evolutionary outcome is basically irrelevant for thinking about AI.
From my perspective, this is just how normal Bayesian reasoning works. If Nate says:
P(human SLT | general factors that cause SLTs) ~= 1
P(human SLT | NOT general factors that cause SLTs) ~= 0
then observing the human SLT is very strong evidence for there being general factors that cause SLTs in different contexts than evolution.
OTOH, I am saying:
P(human SLT | NOT general factors that cause SLTs) ~= 1
And so observing the human SLT is no evidence for such general factors.
Pattern 2: attempting to “rescue” human evolution as an informative analogy for other aspects of AI development.
When I explain my counterargument to pattern 1 to people in person, they will very often try to “rescue” evolution as a worthwhile analogy for thinking about AI development. E.g., they’ll change the analogy so it’s the programmers who are in a role comparable to evolution, rather than SGD.
I claim that such attempted inferences also fail, for the same reason as argument pattern 1 above fails: the relevant portions of the causal graph driving evolutionary outcomes is extremely different from the causal graph driving AI outcomes, such that it’s not useful to use evolution as evidence to make inferences about nodes in the AI outcomes causal graph. E.g., the causal factors that drive programmers to choose a given optimizer are very different from the factors that cause evolution to “choose” a given optimizer. Similarly, evolution is not a human organization that makes decisions based on causal factors that influence human organizations, so you should look at evolution for evidence of organization-level failures that might promote a sharp left turn in AI.
Making this point was the purpose of the “alien space clowns” / EVO-Inc example. It was intended to provide a concrete example of two superficially similar seeming situations, where actually their causal structures are completely distinct, such that there are no useful updates to make from EVO-Inc’s outcomes to other automakers. When Zvi says:
I would also note that, if you discover (as in Quintin’s example of Evo-inc) that major corporations are going around using landmines as hubcaps, and that they indeed managed to gain dominant car market share and build the world’s most functional cars until recently, that is indeed a valuable piece of information about the world, and whether you should trust corporations or other humans to be able to make good choices, realize obvious dangers and build safe objects in general. Why would you think that such evidence should be ignored?
Zvi is proposing that there are common causal factors that led to the alien clowns producing dangerous cars, and could also play a similar role in causing other automakers to make unsafe vehicles, such that Evo-Inc’s outcomes provide useful updates for predicting other automakers’ outcomes. This is what I’m saying is false about evolution versus AI development.
At this point, I should preempt a potential confusion: it’s not the case that AI development and human evolution share zero causal factors! To give a trivial example, both rely on the same physical laws. What prevents there being useful updates from evolution to AI development is the different structure of the causal graphs. When you update your estimates for the shared factors between the graphs using evidence from evolution, this leads to trivial or obvious implications for AI development, because the shared causal factors play different roles in the two graphs. You can have an entirely “benign” causal graph for AI development, which predicts zero alignment issues for AI development, yet when you build the differently structured causal graph for human evolution, it still predicts the same sharp left turn, despite some of the causal factors being shared between the graphs.
This is why inferences from evolutionary outcomes to AI development don’t work. Propagating belief updates through the evolution graph doesn’t change any of the common variables away from settings which are benign in the AI development graph, since those settings already predict a sharp left turn when they’re used in the evolution graph.
Concrete example 1: We know from AI development that having a more powerful optimizer, running for more steps, leads to more progress. Applying this causal factor to the AI development graph basically predicts “scaling laws will continue”, which is just a continuation of the current trajectory. Applying the same factor to the evolution graph, combined with the evolution-specific fact of cultural transmission enabling a (relatively) sudden unleashing of ~9 OOM more effectively leveraged optimization power in a very short period of time, predicts an extremely sharp increase in the rate of progress.
Concrete example 2: One general hypothesis you could have about RL agents is “RL agents just do what they’re trained to do, without any weirdness”. (To be clear, I’m not endorsing this hypothesis. I think it’s much closer to being true than most on LW, but still false.) In the context of AI development, this has pretty benign implications. In the context of evolution, due to the bi-level nature of its optimization process and the different data that different generations are “trained” on, this causal factor in the evolution graph predicts significant divergence between the behaviors of ancestral and modern humans.
Zvi says this is an uncommon standard of epistemics, for there to be no useful inferences from one set of observations (evolutionary outcomes) to another (AI outcomes). I completely disagree. For the vast majority of possible pairs of observations, there are not useful inferences to draw. The pattern of dust specks on my pillow is not a useful reference point for making inferences about the state of the North Korean nuclear weapons program. The relationship between AI development and human evolution is not exceptional in this regard.
Secondly, I’d like to address a common pattern in a lot of Zvi’s criticisms.
My post has a unifying argumentative structure that Zvi seems to almost completely miss. This leads to a very annoying dynamic where:
My post makes a claim / argument that serves a very specific role in the context of the larger structure.
Zvi misses that context, and interprets the claim / argument as making some broader claim about alignment in general.
Zvi complains that I’m over-claiming, being too general, or should split the post along the separate claims Zvi (falsely) believes I’m making.
The unifying argumentative structure of my post is as follows:
Having outlined my argumentative structure, I’ll highlight some examples where Zvi’s criticisms fall into the previously mentioned dynamic.
1:
[Zvi] He then goes on to make another very broad claim.
[Zvi quoting me] > In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
[I list some ways one could produce an ML training process that’s actually similar to human evolution in the relevant sense that would lead to an evolution-like sharp left turn at some point]
[Zvi criticizes the above list on the grounds that inner misalignment could occur under a much broader range of circumstances than I describe]
(I added the bolding)
The issue here is that the list in question is specifically for sharp left turns that arise “due to the same mechanistic reasons as the sharp left turn of human evolution”, as I very specifically said in my original post. I’m not talking about inner alignment in general. I’m not even talking about sharp left turn threat scenarios in general! I’m talking very specifically about how the current AI paradigm would have to change before it had a mechanistic structure sufficiently similar to human evolution that I think a sharp left turn would occur “due to the same mechanistic reasons as the sharp left turn of human evolution”.
2:
As a general note, these sections seem mostly to be making a general alignment is easy, alignment-by-default claim, rather than being about what evolution offers evidence for, and I would have liked to see them presented as a distinct post given how big and central and complex and disputed is the claim here.
That is emphatically not what those sections are arguing for. The purpose of these sections is to describe two non-sharp left turn causing mechanisms for fast takeoff, in order to better illustrate that fast takeoff != sharp left turn. Each section specifically focuses on a particular mechanism of fast takeoff, and argues that said mechanism will not, in and of itself, lead to misalignment. You can still believe a fast takeoff driven by that mechanism will lead to misalignment for other reasons (e.g., a causal graph that looks like: “(fast takeoff mechanism) → (capabilities) → (something else) → (misalignment)”), if, say, you think there’s another causal mechanism driving misalignment, such that the fast takeoff mechanism’s only contribution to misalignment was to advance capabilities in a manner that failed to address that other mechanism.
These sections are not arguing about the ease of alignment in general, but about the consequence of one specific process.
3:
The next section seems to argue that because alignment techniques work on a variety of existing training regimes all of similar capabilities level, we should expect alignment techniques to extend to future systems with greater capabilities.
That is, even more emphatically, not what that specific section is arguing for. This section focuses specifically on the “AIs do AI capabilities research” mechanism of fast takeoff, and argues that it will not itself cause misalignment. Its purpose is specific to the context in which I use it: to address the causal influence of (AIs do capabilities research) directly to (misalignment), not to argue about the odds of misalignment in general.
Further, the argument that section made wasn’t:
because alignment techniques work on a variety of existing training regimes all of similar capabilities level, we should expect alignment techniques to extend to future systems
It was:
alignment techniques already generalize across human contributions to AI capability research. Let’s consider eight specific alignment techniques:
[list of alignment techniques]
and eleven recent capabilities advances:
[list of capabilities techniques]
I don’t expect catastrophic interference between any pair of these alignment techniques and capabilities advances.
And so, if you think AIs doing capabilities will be like humans doing capabilities research, but faster, then there will be a bunch of capabilities and alignment techniques, and the question is how much the capabilities techniques will interfere with the alignment techniques. Based on current data, the interference seems small and manageable. This is the trend being projected forwards, the lack of empirical interference between current capabilities and alignment (despite, as I note in my post, current capabilities techniques putting ~zero effort into not interfering with alignment techniques, an obviously dumb oversight which we haven’t corrected because it turns out we don’t even need to do so).
Once again, I emphasize that this is not a general argument about alignment, which can be detached from the rest of the post. It’s extremely specific to the mechanism for fast takeoff being analyzed, which is only being analyzed to further explore the connection between fast takeoff mechanisms and the odds of a sharp left turn.
4:
He closes by arguing that iteratively improving training data also exhibits important differences from cultural development, sufficient to ignore the evolutionary evidence as not meaningful in this context. I do not agree. Even if I did agree, I do not see how that would justify his broader optimism expressed here:
This part is a separate analysis of a different fast takeoff causal mechanism, arguing that it will not, itself cause misalignment either. Its purpose and structure mirrors that of the argument I clarified above, but focused on a different mechanism. It’s not a continuation of a previous (non-existent) “alignment is easy in general” argument.
Thirdly, I’d like to make some random additional commentary.
I would argue that ‘AIs contribute to AI capabilities research’ is highly analogous to ‘humans contribute to figuring out how to train other humans.’ And that ‘AIs seeking out new training data’ is highly analogous to ‘humans creating bespoke training data to use to train other people especially their children via culture’ which are exactly the mechanisms Quintin is describing humans as using to make a sharp left turn.
The degree of similarity is arguable. I think, and said in the original article, that similarity is low for the first mechanism and moderate for the second.
However, the appropriate way to estimate the odds of a given fast takeoff mechanism leading to AI misalignment is not to estimate the similarity between that mechanism and what happened during human evolution, then assign misalignment risk to the mechanism in proportion to the estimated similarity. Rather, the correct approach is to build detailed causal models of how both human evolution and AI development work, propagate the evidence from human evolutionary outcomes back through your human evolution causal model to update relevant latent variables in that causal model, transfer those updates to any of the AI development causal model’s latent variables which are also in the human evolution causal model, and finally estimate the new misalignment risk implied by the updated variables of the AI development model.
I discussed this in more detail in the first part of my comment, but whenever I do this, I find that the transfer from (observations of evolutionary outcomes) to (predictions about AI development) are pretty trivial or obvious, leading to such groundbreaking insights as:
More optimization power leads to faster progress
Human level general intelligence is possible
Neural architecture search is a bad thing to spend most of your compute on
Retraining a fresh instance of your architecture from scratch on different data will lead to different behavior
That seems like a sharp enough left turn to me.
A sharp left turn is more than just a fast takeoff. It’s the combined sudden increase in AI generality and breaking of previously existing alignment properties.
...humans being clearly misaligned with genetic fitness is not evidence that we should expect such alignment issues in AIs. His argument (without diving into his earlier linked post) seems to be that humans are fresh instances trained on new data, so of course we expect different alignment and different behavior.
But if you believe that, you are saying that humans are fresh versions of the system. You are entirely throwing out from your definition of ‘the system’ all of the outer alignment and evolutionary data, entirely, saying it does not matter, that only the inner optimizer matters. In which case, yes, that does fully explain the differences. But the parallel here does not seem heartening. It is saying that the outcome is entirely dependent on the metaphorical inner optimizer, and what the system is aligned to will depend heavily on the details of the training data it is fed and the conditions under which it is trained, and what capabilities it has during that process, and so on. Then we will train new more capable systems in new ways with new data using new techniques, in an iterated way, in similar fashion. How should this make us feel better about the situation and its likely results?
I find this perspective baffling. Where else do the alignment properties of a system derive from? If you have a causal structure like
then setting the value of the middle node will of course screen off the causal influence of the (programmers) node.
A possible clarification: in the context of my post when discussing evolution, “inner optimizer” means the brain’s “base” optimization process, not the human values / intelligence that arises from that process. The mechanistically most similar thing in AI development to that meaning of the word “inner optimizer” is the “base” training process: the combination of training data, base optimizer, training process, architecture, etc. It doesn’t mean the cognitive system that arises as a consequence of running that training process.
Consider the counterfactual. If we had not seen a sharp left turn in evolution, civilization had taken millions of years to develop to this point with gradual steady capability gains, and we saw humans exhibiting strong conscious optimization mostly for their genetic fitness, it would seem crazy not to change our beliefs at all about what is to come compared to what we do observe. Thus, evidence.
I think Zvi is describing a ~impossible world. I think this world would basically break ~all my models on how optimizing processes gain capabilities. My new odds of an AI sharp left turn would depend on the new models I made in this world, which in turn would depend on unspecified details of how human civilization’s / AI progress happens in this world.
I would also note that Quintin in my experience often cites parallels between humans and AIs as a reason to expect good outcomes from AI due to convergent outcomes, in circumstances where it would be easy to find many similar distinctions between the two cases. Here, although I disagree with his conclusions, I agree with him that the human case provides important evidence.
Once again, it’s not the degree of similarity that determines what inferences are appropriate. It’s the relative structure of the two causal graphs for the processes in question. The graphs for the human brain and current AI systems are obviously not the same, but they share latent variables that serve similar roles in determining outcomes, in a way that the bi-level structure of evolution’s causal graph largely prevents. E.g., Steven Byrnes has a whole sequence which discusses the brain’s learning process, and while there are lots of differences between the brain and current AI designs, there are also shared building blocks whose behaviors are driven by common causal factors. The key difference with evolution is that, once one updates the shared variables from looking at human brain outcomes and applies those updates to the AI development graph, there are non-trivial / obvious implications. Thus, one can draw relevant inferences by observing human outcomes.
Concrete example 1: brains use a local, non-gradient based optimization process to minimize predictive error, so there exists some non-SGD update rules that are competitive with SGD (on brainlike architectures, at least).
Concrete example 2: brains don’t require GPT-4 level volumes of training data, so there exist architectures with vastly more data-friendly scaling laws than GPT-4′s scaling.
In the generally strong comments to OP, Steven Byrnes notes that current LLM systems are incapable of autonomous learning, versus humans and AlphaZero which are, and that we should expect this ability in future LLMs at some point. Constitutional AI is not mentioned, but so far it has only been useful for alignment rather than capabilities, and Quintin suggests autonomous learning mostly relies upon a gap between generation and discernment in favor of discernment being easier. I think this is an important point, while noting that what matters is ability to discern between usefully outputs at all, rather than it being easier, which is an area where I keep trying to put my finger on writing down the key dynamics and so far falling short.
What I specifically said was:
Autonomous learning basically requires there to be a generator-discriminator gap in the domain in question, i.e., that the agent trying to improve its capabilities in said domain has to be better able to tell the difference between its own good and bad outputs.
I realize this is accidentally sounds like it’s saying two things at once (that autonomous learning relies on the generator-discriminator gap of the domain, and then that it relies on the gap for the specific agent (or system in general)). To clarify, I think it’s the agent’s capabilities that matter, that the domain determines how likely the agent is to have a persistent gap between generation and discrimination, and I don’t think the (basic) dynamics are too difficult to write down.
You start with a model M and initial data distribution D. You train M on D such that M is now a model of D. You can now sample from M, and those samples will (roughly) have whatever range of capabilities were to be found in D.
Now, suppose you have some classifier, C, which is able to usefully distinguish samples from M on the basis of that sample’s specific level of capabilities. Note that C doesn’t have to just be an ML model. It could be any process at all, including “ask a human”, “interpret the sample as a computer program trying to solve some problem, run the program, and score the output”, etc.
Having C allows you to sample from a version of M’s output distribution that has been “updated” on C, by continuously sampling from M until a sample scores well on C. This lets you create a new dataset D’, which you can then train M’ on to produce a model of the updated distribution.
So long as C is able to provide classification scores which actually reflect a higher level of capabilities among the samples from M / M’ / M″ / etc, you can repeat this process to continually crank up the capabilities. If your classifier C was some finetune of M, then you can even create a new C’ off of M’, and potentially improve the classifier along with your generator. In most domains though, classifier scores will eventually begin to diverge from the qualities that actually make an output good / high capability, and you’ll eventually stop benefiting from this process.
This process goes further in domains where it’s easier to distinguish generations by their quality. Chess / other board games are extreme outliers in this regard, since you can always tell which of two players actually won the game. Thus, the game rules act as a (pairwise) infallible classifier of relative capabilities. There’s some slight complexity around that last point, since a given trajectory could falsely appear good by beating an even worse / non-representative policy, but modern self-play approaches address such issues by testing model versions against a variety of opponents (mostly past versions of themselves) to ensure continual real progress. Pure math proofs is another similarly skewed domain, where building a robust verifier (i.e., a classifier) of proofs is easy. That’s why Steven was able to use it as a valid example of where self-play gets you very far.
Most important real world domains do not work like this. E.g., if there were a robust, easy-to-query process that could classify which of two scientific theories / engineering designs / military strategies / etc was actually better, the world would look extremely different.
There are other issues I have with this post, but my reply is already longer than the entire original post, so I’ll stop here, rather than, say, adding an entire additional section on my models of takeoff speed for AIs versus evolution (which I’ll admit probably should have another post to go with it).
Thank you for the very detailed and concrete response. I need to step through this slowly to process it properly and see the extent to which I did misunderstand things, or places where we disagree.
I realize this is accidentally sounds like it’s saying two things at once (that autonomous learning relies on the generator-discriminator gap of the domain, and then that it relies on the gap for the specific agent (or system in general)). I think it’s the agent’s capabilities that matter, that the domain determines how likely the agent is to have a persistent gap between generation and discrimination, and I don’t think the (basic) dynamics are too difficult.
You start with a model M and initial data distribution D. You train M on D such that M is now a model of D. You can now sample from M, and those samples will (roughly) have whatever range of capabilities were to be found in D.
Now, suppose you have some classifier, C, which is able to usefully distinguish samples from M on the basis of that sample’s specific level of capabilities. Note that C doesn’t have to just be an ML model. It could be any process at all, including “ask a human”, “interpret the sample as a computer program trying to solve some problem, run the program, and score the output”, etc.
Having C allows you to sample from a version of M’s output distribution that has been “updated” on C, by continuously sampling from M until a sample scores well on C. This lets you create a new dataset D’, which you can then train M’ on to produce a model of the updated distribution.
So long as C is able to provide classification scores which actually reflect a higher level of capabilities among the samples from M / M’ / M″ / etc, you can repeat this process to continually crank up the capabilities. If your classifier C was some finetune of M, then you can even create a new C’ off of M’, and potentially improve the classifier along with your generator. In most domains though, classifier scores will eventually begin to diverge from the qualities that actually make an output good / high capability, and you’ll eventually stop benefiting from this process.
This process goes further in domains where it’s easier to distinguish generations by their quality. Chess / other board games are extreme outliers in this regard, since you can always tell which of two players actually won the game. Thus, the game rules act as a (pairwise) infallible classifier of relative capabilities. There’s some slight complexity around that last point, since a given trajectory could falsely appear good by beating an even worse / non-representative policy, but modern self-play approaches address such issues by testing model versions against a variety of opponents (mostly past versions of themselves) to ensure continual real progress. Pure math proofs is another similarly skewed domain, where building a robust verifier (i.e., a classifier) of proofs is easy. That’s why Steven was able to use it as a valid example of where self-play gets you very far.
Most important real world domains do not work like this. E.g., if there were a robust, easy-to-query process that could classify which of two scientific theories / engineering designs / military strategies / etc was actually better, the world would look extremely different.
Thank you, this is helpful for me thinking further about this, the first paragraph seems almost right, except that instead of the single agent what you care about is the best trainable or available agent, since the two agents (M and C) need not be the same? What you get from this is an M that maximizes C, right? And the issue, as you note, is that in most domains a predictor of your best available C is going to plateau, so it comes down to whether having M gives you the ability to create C’ that can let you move ‘up the chain’ of capability here, while preserving any necessary properties at each transition including alignment. But where M will inherit any statistical or other flaws, or ways to exploit, C, in ways we don’t have any reason to presume we have a way to ‘rescue ourselves from’ in later iterations, and instead would expect to amplify over time?
(And thus, you need a security-mindset-level-robust-to-M C at each step for this to be a safe strategy to iterate on a la Christiano or Leike, and you mostly only should expect to get that in rare domains like chess, rather than expecting C to win the capabilities race in general? Or something like that? Again, comment-level rough here.)
On Quintin’s secondly’s concrete example 1 from above:
I think the core disagreement here is that Quintin thinks that you need very close parallels in order for the evolutionary example to be meaningful, and I don’t think that at all. And neither of us can fully comprehend why the other person is going with as extreme a position as we are on that question?
Thus, he says, yes of course you do not need all those extra things to get misalignment, I wasn’t claiming that, all I was saying was this would break the parallel. And I’m saying both (1) that misalignment could happen these other ways which he agrees with in at least some cases (but perhaps not all cases) and (2) also I do not think that these extra clauses are necessary for the parallel to matter.
And also (3) yes, I’ll totally cop to, because I don’t see why the parallel is in danger with these changes, I didn’t fully have in my head the distinction Quintin is trying to draw here, when I was writing that.
But I will say that, now that I do have it in my head, that I am at least confused why those extra distinctions are necessary for the parallel to hold, here? Our models of what is required here are so different that I’m pretty confused about it, and I don’t have a good model of why e.g. it matters that there are 9 OOMs of difference, or that the creation of the inner optimizer is deliberate (especially given that nothing evolution did was in a similar sense deliberate, as I understand these things at least—my model is that evolution doesn’t do deliberate things at all). And in some cases, to the extent that we want a tighter parallel, Quintin’s requirements seem to move away from that? Anyway, I notice I am confused.
Concrete example 4: Am I wrong here that you’re arguing that this path still exhibits key differences from cultural development and thus evolution does not apply? And then you also argue that this path does not cause additional severe alignment difficulties beyond those above. So I’m not sure where the misreading is here. After that, I discuss a disagreement with a particular claim.
(Writing at comment-speed, rather than carefully-considered speed, apologies for errors and potential repetitions, etc)
On the Evo-Clown thing and related questions in the Firstly section only.
I think we understand each other on the purpose of the Evo-Clown analogy, and I think it is clear what our disagreement is here in the broader question?
I put in the paragraph Quintin quoted in order to illustrate that, even in an intentionally-absurd example intended to illustrate that A and B share no causal factors, A and B still share clear causal factors, and the fact that A happened this way should give you substantial pause about the prospects for B, versus A never having happened at all and the things that caused A not having happened. I am curious (since Quintin does not comment) whether he agrees about the example, now that I bring up the reasons to be concerned.
The real question is the case of evolution versus AI development.
I got challenged by Quintin and by others as interpreting Quintin too broadly when I said:
That seems like quite a leap. If there is one particular development in humanity’s history that we can fully explain, we should then not cite evolution in any way, as an argument for anything?
In response to Quintin saying:
- THEN, there’s no reason to reference evolution at all when forecasting AI development rates, not as evidence for a sharp left turn, not as an “illustrative example” of some mechanism / intuition which might supposedly lead to a sharp left turn in AI development, not for anything.
I am happy to accept the clarification that I interpreted Quintin’s statement stronger than he intended it.
I still am confused how else I could have interpreted the original statement? But that does not matter, what matters is the disagreements we still clearly do have here.
I now understand Quintin’s model as saying (based on the comment plus his OP) that evolution so obviously does an overdetermined sharp left turn that it isn’t evidence of anything (e.g. that the world I proposed as an alternative breaks so many of his models that it isn’t worth considering)?
I agree that if evolution’s path is sufficiently overdetermined, then there’s no reason to cite that path as evidence. In which case we should instead be discussing the mechanisms that are overdetermining that result, and what they imply.
I think the reason we talk about evolution here is exactly because for most people, the underlying mechanisms very much aren’t obvious and overdetermined before looking at the results—if you skipped over the example people would think you were making a giant leap.
Concrete example 2: One general hypothesis you could have about RL agents is “RL agents just do what they’re trained to do, without any weirdness”. (To be clear, I’m not endorsing this hypothesis. I think it’s much closer to being true than most on LW, but still false.) In the context of AI development, this has pretty benign implications. In the context of evolution, due to the bi-level nature of its optimization process and the different data that different generations are “trained” on, this causal factor in the evolution graph predicts significant divergence between the behaviors of ancestral and modern humans.
Zvi says this is an uncommon standard of epistemics, for there to be no useful inferences from one set of observations (evolutionary outcomes) to another (AI outcomes). I completely disagree. For the vast majority of possible pairs of observations, there are not useful inferences to draw. The pattern of dust specks on my pillow is not a useful reference point for making inferences about the state of the North Korean nuclear weapons program. The relationship between AI development and human evolution is not exceptional in this regard.
Ok, sure. I agree that for any given pair of facts there is essentially nothing to infer from one about the other, given what else we already know, and that the two facts Quintin cites as an example are a valid example. But it seems wrong to say that AI developments and evolutionary developments relate to each other in a similar way or reference class to a speck on your pillow to the nuclear weapons program? Or that the distinctions proposed should generally be of a sufficient degree to imply there are no implications from one to the other?
What I was saying that Quintin is challenging in the second paragraph above, specifically, was not that for observations A and B it would be unusual for A to not have important implications for B. What I was saying was that there being distinctions in the causal graphs behind A and B is not a good reason to dismiss A having implications for B—certainly differences reduce it somewhat, but most of the time that A impacts B, there are important causal graph differences that could draw a similar parallel. And, again, this would strike down most reference class arguments.
Quintin does say there are non-zero implications in the comment, so I suppose the distinction does not much matter in the end. Nor does it much matter whether we are citing evolution, or citing our underlying models that also explain evolution’s outcomes, if we can agree on those models?
As in, we would be better served looking at:
One general hypothesis you could have about RL agents is “RL agents just do what they’re trained to do, without any weirdness.” In the context of AI development, this has pretty benign implications.
I think I kind of… do believe this? For my own perhaps quite weird definitions of ‘weirdness’ and ‘what you train it for’? And for those values, no, this is not benign at all, because I don’t consider SLT behaviors to be weird when you have the capabilities for them. That’s simply what you would expect, including from a human in the same spot, why are we acting so surprised?
If you define ‘weirdness’ sufficiently differently then it would perhaps be benign, but I have no idea why you would expect this.
And also, shouldn’t we use our knowledge of humans here, when faced with similar situations? Humans, a product of evolution, do all sorts of local SLTs in situations far removed from their training data, the moment you give them the affordance to do so and the knowledge that they can.
It is also possible we are using different understandings of SLT, and Quintin is thinking about it more narrowly than I am, as his later statements imply. In that case, I would say that I think the thing I care about, in terms of whether it happens, is the thing (or combination of things) I’m talking about.
Thus, in my view, humans did not do only the one big anti-evolution (?) SLT. Humans are constantly doing SLTs in various contexts, and this is a lot of what I am thinking about in this context.
What prevents there being useful updates from evolution to AI development is the different structure of the causal graphs.
Aha?!?
Quintin, I think (?) is saying that the fact that evolution provided us with a central sharp left turn is not evidence, because that is perfectly compatible with and predicted by AI models that aren’t scary.
So I notice I disagree with this twice.
First, I don’t think that the second because clause entirely holds, for reasons that I largely (but I am guessing not entirely) laid out in my OP, for reasons that I am confident Quintin disagrees with and would take a lot to untangle, although I do agree there is some degree of overdeterminedness here where if we hadn’t done the exact SLT we did but had still ramped up our intelligence, we would have instead done a slightly-to-somewhat different-looking SLT later.
Second, I think this points out a key thing I didn’t say explicitly and should have, which is the distinction between the evidence that humans did all their various SLTs (yes, plural, both collectively and individually), and the evidence that humans did these particular SLTs in these particular ways because of these particular mechanisms. Which I do see as highly relevant.
I can imagine a world where humans did an SLT later in a different way, and are less likely to do them on an individual level (note: I agree that this may be somewhat non-standard usage of SLT, but hopefully it’s mostly clear from context what I’m referring to here?) , and everything happened slower and more continuously (on the margin presumably we can imagine this without our models breaking, if only via different luck). And where we look at the details and say, actually it’s pretty hard to get this kind of thing to happen, and moving humans out of their training distributions causes them to hold up in a way we’d metaphorically like out of AIs really well even when they are smart enough and have enough info and reflection time to know better, and so on.
(EDIT: It’s late, and I’ve now responded in stages to the whole thing, which as Quintin noted was longer than my OP. I’m thankful for the engagement, and will read any further replies, but will do my best to keep any further interactions focused and short so this doesn’t turn into an infinite time sink that it clearly could become, even though it very much isn’t a demon thread or anything.)
On concrete example 2: I see four bolded claims in ‘fast takeoff is still possible.’ Collectively, to me, in my lexicon and way of thinking about such things, they add up to something very close to ‘alignment is easy.’
The first subsection says human misalignment does not provide evidence for AI misalignment, which isn’t one of the two mechanisms (as I understand this?), and is instead arguing against an alignment difficulty.
The bulk of the second subsection, starting with ‘Let’s consider eight specific alignment techniques,’ looks to me like an explicit argument that alignment looks easy based on your understanding of the history from AI capabilities and alignment developments so far?
The third subsection seems to also spend most of its space on arguing its scenario would involve manageable risks (e.g. alignment being easy), although you also argue that evolution/culture still isn’t ‘close enough’ to teach us anything here?
I can totally see how these sections could have been written out with the core intention to explain how distinct-from-evolution mechanisms could cause fast takeoffs. From my perspective as a reader, I think my response and general takeaway that this is mostly an argument for easy alignment is reasonable on reflection, even if that’s not the core purpose it serves in the underlying structure, and it’s perhaps not a fully general argument.
On concrete example 3: I agree that what I said was a generalization of what you said, and you instead said something more specific. And that your later caveats make it clear you are not so confident that things will go smoothly in the future. So yes I read this wrong and I’m sorry about that.
But also I notice I am confused here—if you didn’t mean for the reader to make this generalization, if you don’t think that failure of current capabilities advances to break current alignment techniques isn’t strong evidence for future capabilities advances not breaking then-optimal alignment techniques, then why we are analyzing all these expected interactions here? Why state the claim that such techniques ‘already generalize’ (which they currently mostly do as far as I know, which is not terribly far) if it isn’t a claim that they will likely generalize in the future?
When I explain my counterargument to pattern 1 to people in person, they will very often try to “rescue” evolution as a worthwhile analogy for thinking about AI development. E.g., they’ll change the analogy so it’s the programmers who are in a role comparable to evolution, rather than SGD.
In general one should not try to rescue intuitions, and the frequency of doing this is a sign of serious cognitive distortions. You should only try to rescue intutions when they have a clear and validated predictive or pragmatic track record.
The reason for this is very simple—most intuitions or predictions one could make are wrong, and you need a lot of positive evidence to privilege any particular hypotheses re how or what to think. In the absence of evidence, you should stop relying on an intuition, or at least hold it very lightly.
On the first section, we disagree on the degree of similarity in the metaphors.
I agree with you that we shouldn’t care about ‘degree of similarity’ and instead build causal models. I think our actual disagreements here lie mostly in those causal models, the unpacking of which goes beyond comment scope. I agree with the very non-groundbreaking insights listed, of course, but that’s not what I’m getting out of it. It is possible that some of this is that a lot of what I’m thinking of as evolutionary evidence, you’re thinking of as coming from another source, or is already in your model in another form to the extent you buy the argument (which often I am guessing you don’t).
On the difference in SLT meanings, what I meant to say was: I think this is sufficient to cause our alignment properties to break.
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
On the passage you find baffling: Ah, I do think we had confusion about what we meant by inner optimizer, and I’m likely still conflating the two somewhat. That doesn’t change me not finding this heartening, though? As in, we’re going to see rapid big changes in both the inner optimizer’s power (in all senses) and also in the nature and amount of training data, where we agree that changing the training data details changes alignment outcomes dramatically.
On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you’re confused why people keep citing it as evidence.
The comment in response to parallels provides some interesting thoughts and I agree with most of it. The two concrete examples are definitely important things to know. I still notice the thing I noticed in my comment about the parallels—I’d encourage thinking about what similar logic would say in the other cases?
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
Nearly every piece of empirical evidence I’ve seen contradicts this—more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.
Presumably you agree this would become false if the system was deceptively aligned or otherwise scheming against us? Perhaps the implicit claim is that we should generalize from current evidence toward thinking the deceptive alignment is very unlikely?
I also think it’s straightforward to construct cases where goodharting implies that applying the technique you used for a less capable model onto a more capable model would result in worse performance for the more capable model. I think it should be straightforward to construct such a case using scaling laws for reward model overoptimization.
(That said, I think if you vary the point of early stopping as models get more capable then you likely get strict performance improvements on most tasks. But, regardless there is a pretty reasonable technique of “train for duration X” which clearly gets worse performance in realistic cases as you go toward more capable systems.)
On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you’re confused why people keep citing it as evidence.
The ability to write fiction in a world does not demonstrate its plausibility. Beware generalizing from fictional fictional evidence!
The claim that such a world is impossible is a claim that, were you to try to write a fictional version of it, you would run into major holes in the world that you would have to either ignore or paper over with further unrealistic assumptions.
I think this post greatly misunderstands mine.
Firstly, I’d like to address the question of epistemics.
When I said “there’s no reason to reference evolution at all when forecasting AI development rates”, I was referring to two patterns of argument that I think are incorrect: (1) using the human sharp left turn as evidence for an AI sharp left turn, and (2) attempting to “rescue” human evolution as an informative analogy for other aspects of AI development.
(Note: I think Zvi did follow my argument for not drawing inferences about the odds of the sharp left turn specifically. I’m still starting by clarifying pattern 1 in order to set things up to better explain pattern 2.)
Pattern 1: using the human sharp left turn as evidence for an AI sharp left turn.
The original sharp left turn post claims that there are general factors about the structure and dynamics of optimization processes which both caused the evolutionary sharp left turn, and will go on to cause another sharp left turn in AI systems. The entire point of Nate referencing evolution is to provide evidence for these factors.
My counterclaim is that the causal processes responsible for the evolutionary sharp left turn are almost entirely distinct from anything present in AI development, and so the evolutionary outcome is basically irrelevant for thinking about AI.
From my perspective, this is just how normal Bayesian reasoning works. If Nate says:
P(human SLT | general factors that cause SLTs) ~= 1
P(human SLT | NOT general factors that cause SLTs) ~= 0
then observing the human SLT is very strong evidence for there being general factors that cause SLTs in different contexts than evolution.
OTOH, I am saying:
P(human SLT | NOT general factors that cause SLTs) ~= 1
And so observing the human SLT is no evidence for such general factors.
Pattern 2: attempting to “rescue” human evolution as an informative analogy for other aspects of AI development.
When I explain my counterargument to pattern 1 to people in person, they will very often try to “rescue” evolution as a worthwhile analogy for thinking about AI development. E.g., they’ll change the analogy so it’s the programmers who are in a role comparable to evolution, rather than SGD.
I claim that such attempted inferences also fail, for the same reason as argument pattern 1 above fails: the relevant portions of the causal graph driving evolutionary outcomes is extremely different from the causal graph driving AI outcomes, such that it’s not useful to use evolution as evidence to make inferences about nodes in the AI outcomes causal graph. E.g., the causal factors that drive programmers to choose a given optimizer are very different from the factors that cause evolution to “choose” a given optimizer. Similarly, evolution is not a human organization that makes decisions based on causal factors that influence human organizations, so you should look at evolution for evidence of organization-level failures that might promote a sharp left turn in AI.
Making this point was the purpose of the “alien space clowns” / EVO-Inc example. It was intended to provide a concrete example of two superficially similar seeming situations, where actually their causal structures are completely distinct, such that there are no useful updates to make from EVO-Inc’s outcomes to other automakers. When Zvi says:
Zvi is proposing that there are common causal factors that led to the alien clowns producing dangerous cars, and could also play a similar role in causing other automakers to make unsafe vehicles, such that Evo-Inc’s outcomes provide useful updates for predicting other automakers’ outcomes. This is what I’m saying is false about evolution versus AI development.
At this point, I should preempt a potential confusion: it’s not the case that AI development and human evolution share zero causal factors! To give a trivial example, both rely on the same physical laws. What prevents there being useful updates from evolution to AI development is the different structure of the causal graphs. When you update your estimates for the shared factors between the graphs using evidence from evolution, this leads to trivial or obvious implications for AI development, because the shared causal factors play different roles in the two graphs. You can have an entirely “benign” causal graph for AI development, which predicts zero alignment issues for AI development, yet when you build the differently structured causal graph for human evolution, it still predicts the same sharp left turn, despite some of the causal factors being shared between the graphs.
This is why inferences from evolutionary outcomes to AI development don’t work. Propagating belief updates through the evolution graph doesn’t change any of the common variables away from settings which are benign in the AI development graph, since those settings already predict a sharp left turn when they’re used in the evolution graph.
Concrete example 1: We know from AI development that having a more powerful optimizer, running for more steps, leads to more progress. Applying this causal factor to the AI development graph basically predicts “scaling laws will continue”, which is just a continuation of the current trajectory. Applying the same factor to the evolution graph, combined with the evolution-specific fact of cultural transmission enabling a (relatively) sudden unleashing of ~9 OOM more effectively leveraged optimization power in a very short period of time, predicts an extremely sharp increase in the rate of progress.
Concrete example 2: One general hypothesis you could have about RL agents is “RL agents just do what they’re trained to do, without any weirdness”. (To be clear, I’m not endorsing this hypothesis. I think it’s much closer to being true than most on LW, but still false.) In the context of AI development, this has pretty benign implications. In the context of evolution, due to the bi-level nature of its optimization process and the different data that different generations are “trained” on, this causal factor in the evolution graph predicts significant divergence between the behaviors of ancestral and modern humans.
Zvi says this is an uncommon standard of epistemics, for there to be no useful inferences from one set of observations (evolutionary outcomes) to another (AI outcomes). I completely disagree. For the vast majority of possible pairs of observations, there are not useful inferences to draw. The pattern of dust specks on my pillow is not a useful reference point for making inferences about the state of the North Korean nuclear weapons program. The relationship between AI development and human evolution is not exceptional in this regard.
Secondly, I’d like to address a common pattern in a lot of Zvi’s criticisms.
My post has a unifying argumentative structure that Zvi seems to almost completely miss. This leads to a very annoying dynamic where:
My post makes a claim / argument that serves a very specific role in the context of the larger structure.
Zvi misses that context, and interprets the claim / argument as making some broader claim about alignment in general.
Zvi complains that I’m over-claiming, being too general, or should split the post along the separate claims Zvi (falsely) believes I’m making.
The unifying argumentative structure of my post is as follows:
Evolution’s sharp left turn happened for evolution-specific reasons
Describes the causal structure of evolution’s sharp left turn.
Don’t misgeneralize from evolution to AI
Argues you shouldn’t generalize between things with very different causal structures.
Fast takeoff is still possible
Clarifies that I am not arguing against fast takeoff, and that fast takeoff can happen without a sharp left turn.
Proposes two AI-specific causal mechanisms that could cause a fast takeoff.
Discusses why it’s alignment relevant if fast takeoff happens because of either of the two mechanisms previously mentioned.
Will alignment generalize across sudden capabilities jumps?
Human “misalignment” with inclusive genetic fitness provides no evidence for AI misalignment
Somewhat awkwardly shoehorns in the argument that evolution also provides no evidence for inner alignment failures in general.
Capabilities jumps due to AI driving AI capabilities research
Argues that this specific fast takeoff mechanism will not itself cause a sharp left turn / alignment failure
Capabilities jumps due to AI iteratively refining its training data
Argues that this specific fast takeoff mechanism will not itself cause a sharp left turn / alignment failure
Conclusion
Having outlined my argumentative structure, I’ll highlight some examples where Zvi’s criticisms fall into the previously mentioned dynamic.
1:
(I added the bolding)
The issue here is that the list in question is specifically for sharp left turns that arise “due to the same mechanistic reasons as the sharp left turn of human evolution”, as I very specifically said in my original post. I’m not talking about inner alignment in general. I’m not even talking about sharp left turn threat scenarios in general! I’m talking very specifically about how the current AI paradigm would have to change before it had a mechanistic structure sufficiently similar to human evolution that I think a sharp left turn would occur “due to the same mechanistic reasons as the sharp left turn of human evolution”.
2:
That is emphatically not what those sections are arguing for. The purpose of these sections is to describe two non-sharp left turn causing mechanisms for fast takeoff, in order to better illustrate that fast takeoff != sharp left turn. Each section specifically focuses on a particular mechanism of fast takeoff, and argues that said mechanism will not, in and of itself, lead to misalignment. You can still believe a fast takeoff driven by that mechanism will lead to misalignment for other reasons (e.g., a causal graph that looks like: “(fast takeoff mechanism) → (capabilities) → (something else) → (misalignment)”), if, say, you think there’s another causal mechanism driving misalignment, such that the fast takeoff mechanism’s only contribution to misalignment was to advance capabilities in a manner that failed to address that other mechanism.
These sections are not arguing about the ease of alignment in general, but about the consequence of one specific process.
3:
That is, even more emphatically, not what that specific section is arguing for. This section focuses specifically on the “AIs do AI capabilities research” mechanism of fast takeoff, and argues that it will not itself cause misalignment. Its purpose is specific to the context in which I use it: to address the causal influence of (AIs do capabilities research) directly to (misalignment), not to argue about the odds of misalignment in general.
Further, the argument that section made wasn’t:
It was:
And so, if you think AIs doing capabilities will be like humans doing capabilities research, but faster, then there will be a bunch of capabilities and alignment techniques, and the question is how much the capabilities techniques will interfere with the alignment techniques. Based on current data, the interference seems small and manageable. This is the trend being projected forwards, the lack of empirical interference between current capabilities and alignment (despite, as I note in my post, current capabilities techniques putting ~zero effort into not interfering with alignment techniques, an obviously dumb oversight which we haven’t corrected because it turns out we don’t even need to do so).
Once again, I emphasize that this is not a general argument about alignment, which can be detached from the rest of the post. It’s extremely specific to the mechanism for fast takeoff being analyzed, which is only being analyzed to further explore the connection between fast takeoff mechanisms and the odds of a sharp left turn.
4:
This part is a separate analysis of a different fast takeoff causal mechanism, arguing that it will not, itself cause misalignment either. Its purpose and structure mirrors that of the argument I clarified above, but focused on a different mechanism. It’s not a continuation of a previous (non-existent) “alignment is easy in general” argument.
Thirdly, I’d like to make some random additional commentary.
The degree of similarity is arguable. I think, and said in the original article, that similarity is low for the first mechanism and moderate for the second.
However, the appropriate way to estimate the odds of a given fast takeoff mechanism leading to AI misalignment is not to estimate the similarity between that mechanism and what happened during human evolution, then assign misalignment risk to the mechanism in proportion to the estimated similarity. Rather, the correct approach is to build detailed causal models of how both human evolution and AI development work, propagate the evidence from human evolutionary outcomes back through your human evolution causal model to update relevant latent variables in that causal model, transfer those updates to any of the AI development causal model’s latent variables which are also in the human evolution causal model, and finally estimate the new misalignment risk implied by the updated variables of the AI development model.
I discussed this in more detail in the first part of my comment, but whenever I do this, I find that the transfer from (observations of evolutionary outcomes) to (predictions about AI development) are pretty trivial or obvious, leading to such groundbreaking insights as:
More optimization power leads to faster progress
Human level general intelligence is possible
Neural architecture search is a bad thing to spend most of your compute on
Retraining a fresh instance of your architecture from scratch on different data will lead to different behavior
A sharp left turn is more than just a fast takeoff. It’s the combined sudden increase in AI generality and breaking of previously existing alignment properties.
I find this perspective baffling. Where else do the alignment properties of a system derive from? If you have a causal structure like
(programmers) → (training data, training conditions, learning dynamics, etc) → (alignment properties)
then setting the value of the middle node will of course screen off the causal influence of the (programmers) node.
A possible clarification: in the context of my post when discussing evolution, “inner optimizer” means the brain’s “base” optimization process, not the human values / intelligence that arises from that process. The mechanistically most similar thing in AI development to that meaning of the word “inner optimizer” is the “base” training process: the combination of training data, base optimizer, training process, architecture, etc. It doesn’t mean the cognitive system that arises as a consequence of running that training process.
I think Zvi is describing a ~impossible world. I think this world would basically break ~all my models on how optimizing processes gain capabilities. My new odds of an AI sharp left turn would depend on the new models I made in this world, which in turn would depend on unspecified details of how human civilization’s / AI progress happens in this world.
Once again, it’s not the degree of similarity that determines what inferences are appropriate. It’s the relative structure of the two causal graphs for the processes in question. The graphs for the human brain and current AI systems are obviously not the same, but they share latent variables that serve similar roles in determining outcomes, in a way that the bi-level structure of evolution’s causal graph largely prevents. E.g., Steven Byrnes has a whole sequence which discusses the brain’s learning process, and while there are lots of differences between the brain and current AI designs, there are also shared building blocks whose behaviors are driven by common causal factors. The key difference with evolution is that, once one updates the shared variables from looking at human brain outcomes and applies those updates to the AI development graph, there are non-trivial / obvious implications. Thus, one can draw relevant inferences by observing human outcomes.
Concrete example 1: brains use a local, non-gradient based optimization process to minimize predictive error, so there exists some non-SGD update rules that are competitive with SGD (on brainlike architectures, at least).
Concrete example 2: brains don’t require GPT-4 level volumes of training data, so there exist architectures with vastly more data-friendly scaling laws than GPT-4′s scaling.
What I specifically said was:
I realize this is accidentally sounds like it’s saying two things at once (that autonomous learning relies on the generator-discriminator gap of the domain, and then that it relies on the gap for the specific agent (or system in general)). To clarify, I think it’s the agent’s capabilities that matter, that the domain determines how likely the agent is to have a persistent gap between generation and discrimination, and I don’t think the (basic) dynamics are too difficult to write down.
You start with a model M and initial data distribution D. You train M on D such that M is now a model of D. You can now sample from M, and those samples will (roughly) have whatever range of capabilities were to be found in D.
Now, suppose you have some classifier, C, which is able to usefully distinguish samples from M on the basis of that sample’s specific level of capabilities. Note that C doesn’t have to just be an ML model. It could be any process at all, including “ask a human”, “interpret the sample as a computer program trying to solve some problem, run the program, and score the output”, etc.
Having C allows you to sample from a version of M’s output distribution that has been “updated” on C, by continuously sampling from M until a sample scores well on C. This lets you create a new dataset D’, which you can then train M’ on to produce a model of the updated distribution.
So long as C is able to provide classification scores which actually reflect a higher level of capabilities among the samples from M / M’ / M″ / etc, you can repeat this process to continually crank up the capabilities. If your classifier C was some finetune of M, then you can even create a new C’ off of M’, and potentially improve the classifier along with your generator. In most domains though, classifier scores will eventually begin to diverge from the qualities that actually make an output good / high capability, and you’ll eventually stop benefiting from this process.
This process goes further in domains where it’s easier to distinguish generations by their quality. Chess / other board games are extreme outliers in this regard, since you can always tell which of two players actually won the game. Thus, the game rules act as a (pairwise) infallible classifier of relative capabilities. There’s some slight complexity around that last point, since a given trajectory could falsely appear good by beating an even worse / non-representative policy, but modern self-play approaches address such issues by testing model versions against a variety of opponents (mostly past versions of themselves) to ensure continual real progress. Pure math proofs is another similarly skewed domain, where building a robust verifier (i.e., a classifier) of proofs is easy. That’s why Steven was able to use it as a valid example of where self-play gets you very far.
Most important real world domains do not work like this. E.g., if there were a robust, easy-to-query process that could classify which of two scientific theories / engineering designs / military strategies / etc was actually better, the world would look extremely different.
There are other issues I have with this post, but my reply is already longer than the entire original post, so I’ll stop here, rather than, say, adding an entire additional section on my models of takeoff speed for AIs versus evolution (which I’ll admit probably should have another post to go with it).
Thank you for the very detailed and concrete response. I need to step through this slowly to process it properly and see the extent to which I did misunderstand things, or places where we disagree.
Thank you, this is helpful for me thinking further about this, the first paragraph seems almost right, except that instead of the single agent what you care about is the best trainable or available agent, since the two agents (M and C) need not be the same? What you get from this is an M that maximizes C, right? And the issue, as you note, is that in most domains a predictor of your best available C is going to plateau, so it comes down to whether having M gives you the ability to create C’ that can let you move ‘up the chain’ of capability here, while preserving any necessary properties at each transition including alignment. But where M will inherit any statistical or other flaws, or ways to exploit, C, in ways we don’t have any reason to presume we have a way to ‘rescue ourselves from’ in later iterations, and instead would expect to amplify over time?
(And thus, you need a security-mindset-level-robust-to-M C at each step for this to be a safe strategy to iterate on a la Christiano or Leike, and you mostly only should expect to get that in rare domains like chess, rather than expecting C to win the capabilities race in general? Or something like that? Again, comment-level rough here.)
On Quintin’s secondly’s concrete example 1 from above:
I think the core disagreement here is that Quintin thinks that you need very close parallels in order for the evolutionary example to be meaningful, and I don’t think that at all. And neither of us can fully comprehend why the other person is going with as extreme a position as we are on that question?
Thus, he says, yes of course you do not need all those extra things to get misalignment, I wasn’t claiming that, all I was saying was this would break the parallel. And I’m saying both (1) that misalignment could happen these other ways which he agrees with in at least some cases (but perhaps not all cases) and (2) also I do not think that these extra clauses are necessary for the parallel to matter.
And also (3) yes, I’ll totally cop to, because I don’t see why the parallel is in danger with these changes, I didn’t fully have in my head the distinction Quintin is trying to draw here, when I was writing that.
But I will say that, now that I do have it in my head, that I am at least confused why those extra distinctions are necessary for the parallel to hold, here? Our models of what is required here are so different that I’m pretty confused about it, and I don’t have a good model of why e.g. it matters that there are 9 OOMs of difference, or that the creation of the inner optimizer is deliberate (especially given that nothing evolution did was in a similar sense deliberate, as I understand these things at least—my model is that evolution doesn’t do deliberate things at all). And in some cases, to the extent that we want a tighter parallel, Quintin’s requirements seem to move away from that? Anyway, I notice I am confused.
Concrete example 4: Am I wrong here that you’re arguing that this path still exhibits key differences from cultural development and thus evolution does not apply? And then you also argue that this path does not cause additional severe alignment difficulties beyond those above. So I’m not sure where the misreading is here. After that, I discuss a disagreement with a particular claim.
(Writing at comment-speed, rather than carefully-considered speed, apologies for errors and potential repetitions, etc)
On the Evo-Clown thing and related questions in the Firstly section only.
I think we understand each other on the purpose of the Evo-Clown analogy, and I think it is clear what our disagreement is here in the broader question?
I put in the paragraph Quintin quoted in order to illustrate that, even in an intentionally-absurd example intended to illustrate that A and B share no causal factors, A and B still share clear causal factors, and the fact that A happened this way should give you substantial pause about the prospects for B, versus A never having happened at all and the things that caused A not having happened. I am curious (since Quintin does not comment) whether he agrees about the example, now that I bring up the reasons to be concerned.
The real question is the case of evolution versus AI development.
I got challenged by Quintin and by others as interpreting Quintin too broadly when I said:
In response to Quintin saying:
I am happy to accept the clarification that I interpreted Quintin’s statement stronger than he intended it.
I still am confused how else I could have interpreted the original statement? But that does not matter, what matters is the disagreements we still clearly do have here.
I now understand Quintin’s model as saying (based on the comment plus his OP) that evolution so obviously does an overdetermined sharp left turn that it isn’t evidence of anything (e.g. that the world I proposed as an alternative breaks so many of his models that it isn’t worth considering)?
I agree that if evolution’s path is sufficiently overdetermined, then there’s no reason to cite that path as evidence. In which case we should instead be discussing the mechanisms that are overdetermining that result, and what they imply.
I think the reason we talk about evolution here is exactly because for most people, the underlying mechanisms very much aren’t obvious and overdetermined before looking at the results—if you skipped over the example people would think you were making a giant leap.
Ok, sure. I agree that for any given pair of facts there is essentially nothing to infer from one about the other, given what else we already know, and that the two facts Quintin cites as an example are a valid example. But it seems wrong to say that AI developments and evolutionary developments relate to each other in a similar way or reference class to a speck on your pillow to the nuclear weapons program? Or that the distinctions proposed should generally be of a sufficient degree to imply there are no implications from one to the other?
What I was saying that Quintin is challenging in the second paragraph above, specifically, was not that for observations A and B it would be unusual for A to not have important implications for B. What I was saying was that there being distinctions in the causal graphs behind A and B is not a good reason to dismiss A having implications for B—certainly differences reduce it somewhat, but most of the time that A impacts B, there are important causal graph differences that could draw a similar parallel. And, again, this would strike down most reference class arguments.
Quintin does say there are non-zero implications in the comment, so I suppose the distinction does not much matter in the end. Nor does it much matter whether we are citing evolution, or citing our underlying models that also explain evolution’s outcomes, if we can agree on those models?
As in, we would be better served looking at:
I think I kind of… do believe this? For my own perhaps quite weird definitions of ‘weirdness’ and ‘what you train it for’? And for those values, no, this is not benign at all, because I don’t consider SLT behaviors to be weird when you have the capabilities for them. That’s simply what you would expect, including from a human in the same spot, why are we acting so surprised?
If you define ‘weirdness’ sufficiently differently then it would perhaps be benign, but I have no idea why you would expect this.
And also, shouldn’t we use our knowledge of humans here, when faced with similar situations? Humans, a product of evolution, do all sorts of local SLTs in situations far removed from their training data, the moment you give them the affordance to do so and the knowledge that they can.
It is also possible we are using different understandings of SLT, and Quintin is thinking about it more narrowly than I am, as his later statements imply. In that case, I would say that I think the thing I care about, in terms of whether it happens, is the thing (or combination of things) I’m talking about.
Thus, in my view, humans did not do only the one big anti-evolution (?) SLT. Humans are constantly doing SLTs in various contexts, and this is a lot of what I am thinking about in this context.
Aha?!?
Quintin, I think (?) is saying that the fact that evolution provided us with a central sharp left turn is not evidence, because that is perfectly compatible with and predicted by AI models that aren’t scary.
So I notice I disagree with this twice.
First, I don’t think that the second because clause entirely holds, for reasons that I largely (but I am guessing not entirely) laid out in my OP, for reasons that I am confident Quintin disagrees with and would take a lot to untangle, although I do agree there is some degree of overdeterminedness here where if we hadn’t done the exact SLT we did but had still ramped up our intelligence, we would have instead done a slightly-to-somewhat different-looking SLT later.
Second, I think this points out a key thing I didn’t say explicitly and should have, which is the distinction between the evidence that humans did all their various SLTs (yes, plural, both collectively and individually), and the evidence that humans did these particular SLTs in these particular ways because of these particular mechanisms. Which I do see as highly relevant.
I can imagine a world where humans did an SLT later in a different way, and are less likely to do them on an individual level (note: I agree that this may be somewhat non-standard usage of SLT, but hopefully it’s mostly clear from context what I’m referring to here?) , and everything happened slower and more continuously (on the margin presumably we can imagine this without our models breaking, if only via different luck). And where we look at the details and say, actually it’s pretty hard to get this kind of thing to happen, and moving humans out of their training distributions causes them to hold up in a way we’d metaphorically like out of AIs really well even when they are smart enough and have enough info and reflection time to know better, and so on.
(EDIT: It’s late, and I’ve now responded in stages to the whole thing, which as Quintin noted was longer than my OP. I’m thankful for the engagement, and will read any further replies, but will do my best to keep any further interactions focused and short so this doesn’t turn into an infinite time sink that it clearly could become, even though it very much isn’t a demon thread or anything.)
On concrete example 2: I see four bolded claims in ‘fast takeoff is still possible.’ Collectively, to me, in my lexicon and way of thinking about such things, they add up to something very close to ‘alignment is easy.’
The first subsection says human misalignment does not provide evidence for AI misalignment, which isn’t one of the two mechanisms (as I understand this?), and is instead arguing against an alignment difficulty.
The bulk of the second subsection, starting with ‘Let’s consider eight specific alignment techniques,’ looks to me like an explicit argument that alignment looks easy based on your understanding of the history from AI capabilities and alignment developments so far?
The third subsection seems to also spend most of its space on arguing its scenario would involve manageable risks (e.g. alignment being easy), although you also argue that evolution/culture still isn’t ‘close enough’ to teach us anything here?
I can totally see how these sections could have been written out with the core intention to explain how distinct-from-evolution mechanisms could cause fast takeoffs. From my perspective as a reader, I think my response and general takeaway that this is mostly an argument for easy alignment is reasonable on reflection, even if that’s not the core purpose it serves in the underlying structure, and it’s perhaps not a fully general argument.
On concrete example 3: I agree that what I said was a generalization of what you said, and you instead said something more specific. And that your later caveats make it clear you are not so confident that things will go smoothly in the future. So yes I read this wrong and I’m sorry about that.
But also I notice I am confused here—if you didn’t mean for the reader to make this generalization, if you don’t think that failure of current capabilities advances to break current alignment techniques isn’t strong evidence for future capabilities advances not breaking then-optimal alignment techniques, then why we are analyzing all these expected interactions here? Why state the claim that such techniques ‘already generalize’ (which they currently mostly do as far as I know, which is not terribly far) if it isn’t a claim that they will likely generalize in the future?
In general one should not try to rescue intuitions, and the frequency of doing this is a sign of serious cognitive distortions. You should only try to rescue intutions when they have a clear and validated predictive or pragmatic track record.
The reason for this is very simple—most intuitions or predictions one could make are wrong, and you need a lot of positive evidence to privilege any particular hypotheses re how or what to think. In the absence of evidence, you should stop relying on an intuition, or at least hold it very lightly.
On the additional commentary section:
On the first section, we disagree on the degree of similarity in the metaphors.
I agree with you that we shouldn’t care about ‘degree of similarity’ and instead build causal models. I think our actual disagreements here lie mostly in those causal models, the unpacking of which goes beyond comment scope. I agree with the very non-groundbreaking insights listed, of course, but that’s not what I’m getting out of it. It is possible that some of this is that a lot of what I’m thinking of as evolutionary evidence, you’re thinking of as coming from another source, or is already in your model in another form to the extent you buy the argument (which often I am guessing you don’t).
On the difference in SLT meanings, what I meant to say was: I think this is sufficient to cause our alignment properties to break.
In case it is not clear: My expectation is that sufficiently large capabilities/intelligence/affordances advances inherently break our desired alignment properties under all known techniques.
On the passage you find baffling: Ah, I do think we had confusion about what we meant by inner optimizer, and I’m likely still conflating the two somewhat. That doesn’t change me not finding this heartening, though? As in, we’re going to see rapid big changes in both the inner optimizer’s power (in all senses) and also in the nature and amount of training data, where we agree that changing the training data details changes alignment outcomes dramatically.
On the impossible-to-you world: This doesn’t seem so weird or impossible to me? And I think I can tell a pretty easy cultural story slash write an alternative universe novel where we honor those who maximize genetic fitness and all that, and have for a long time—and that this could help explain why civilization and our intelligence developed so damn slowly and all that. Although to truly make the full evidential point that world then has to be weirder still where humans are much more reluctant to mode shift in various ways. It’s also possible this points to you having already accepted from other places the evidence I think evolution introduces, so you’re confused why people keep citing it as evidence.
The comment in response to parallels provides some interesting thoughts and I agree with most of it. The two concrete examples are definitely important things to know. I still notice the thing I noticed in my comment about the parallels—I’d encourage thinking about what similar logic would say in the other cases?
Nearly every piece of empirical evidence I’ve seen contradicts this—more capable systems are generally easier to work with in almost every way, and the techniques that worked on less capable versions straightforwardly apply and in fact usually work better than on less intelligent systems.
Presumably you agree this would become false if the system was deceptively aligned or otherwise scheming against us? Perhaps the implicit claim is that we should generalize from current evidence toward thinking the deceptive alignment is very unlikely?
I also think it’s straightforward to construct cases where goodharting implies that applying the technique you used for a less capable model onto a more capable model would result in worse performance for the more capable model. I think it should be straightforward to construct such a case using scaling laws for reward model overoptimization.
(That said, I think if you vary the point of early stopping as models get more capable then you likely get strict performance improvements on most tasks. But, regardless there is a pretty reasonable technique of “train for duration X” which clearly gets worse performance in realistic cases as you go toward more capable systems.)
The ability to write fiction in a world does not demonstrate its plausibility. Beware generalizing from fictional fictional evidence!
The claim that such a world is impossible is a claim that, were you to try to write a fictional version of it, you would run into major holes in the world that you would have to either ignore or paper over with further unrealistic assumptions.