In order to experience a sharp left turn that arose due to the same mechanistic reasons as the sharp left turn of human evolution, an AI developer would have to:
Deliberately create a (very obvious) inner optimizer, whose inner loss function includes no mention of human values / objectives.
Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.
Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.
I’ll bite! I think this happens if we jump up a level from “an AI developer” to “the world”:
Lots of different people and companies deliberately create a (very obvious) inner optimizer (i.e. a fresh new ML training run), whose inner loss function includes no mention of human values / objectives (at least sometimes—e.g. self-supervised learning, or safety-unconcerned people might try capabilities-oriented RL reward functions to try to beat a benchmark or just to see what happens etc.).
An outer optimizer exists here—the people doing the best on benchmarks will get their approaches copied, get more funding, etc. But the outer optimizer has billions of times less optimization power than the inner optimizer.
At least some of these people and companies (especially the safety-unconcerned ones) let the inner optimizer run freely without any supervision, limits, or interventions (OK sure, probably somebody is watching the loss function go down during training, but presumably it’s not uncommon to wait until a training run is “complete” before doing a rigorous battery of safety tests).
Some possible cruxes here are: (1) do these safety-unconcerned people (or safety-concerned-in-principle-but-failing-to-take-necessary-actions people) exist & hold influence, and if so will that continue to be true when AI x-risk is on the table? (I say yes—e.g. Yann LeCun thinks AI x-risk is dumb.) (2) Is it plausible that one group’s training run will have importantly new and different capabilities from the best relevant previous one? (I say yes—consider grokking, or algorithmic improvements, or autonomous learning per my other comment.)
I don’t think I’m concerned by moving up a level in abstraction. For one, I don’t expect any specific developer to suddenly get access to 5 − 9 OOMs more compute than any previous developer. For another, it seems clear that we’d want the AIs being built to be misaligned with whatever “values” correspond to the outer selection signals associated with the outer optimizer in question (i.e., “the people doing the best on benchmarks will get their approaches copied, get more funding, etc”). Seems like an AI being aligned to, like, impressing its developers? doing well on benchmarks? getting more funding? becoming the best architecture it can be? IDK what, but it would probably be bad.
So, I don’t see a reason to expect either a sudden capabilities jump (Edit: deriving from the same mechanism as the human sharp left turn), or (undesirable) misalignment.
(2) Is it plausible that one group’s training run will have importantly new and different capabilities from the best relevant previous one? (I say yes—consider grokking, or algorithmic improvements, or autonomous learning per my other comment.)
And then you wrote:
I don’t expect any specific developer to suddenly get access to 5 − 9 OOMs more compute than any previous developer.
Isn’t that kinda a strawman? I can imagine a lot of scenarios where a training run results in a qualitatively better trained model than any that came before—I mentioned three of them—and I think “5-9OOM more compute than any previous developer” is a much much less plausible scenario than any of the three I mentioned.
This post mainly argues that evolution does not provide evidence for the sharp left turn. Sudden capabilities jumps from other sources, such as those you mention, are more likely, IMO. My first reply to your comment is arguing that the mechanisms behind the human sharp left turn wrt evolution probably still won’t arise in AI development, even if you go up an abstraction level. One of those mechanisms is a 5 − 9 OOM jump in usable optimization power, which I think is unlikely.
I’ll bite! I think this happens if we jump up a level from “an AI developer” to “the world”:
Lots of different people and companies deliberately create a (very obvious) inner optimizer (i.e. a fresh new ML training run), whose inner loss function includes no mention of human values / objectives (at least sometimes—e.g. self-supervised learning, or safety-unconcerned people might try capabilities-oriented RL reward functions to try to beat a benchmark or just to see what happens etc.).
An outer optimizer exists here—the people doing the best on benchmarks will get their approaches copied, get more funding, etc. But the outer optimizer has billions of times less optimization power than the inner optimizer.
At least some of these people and companies (especially the safety-unconcerned ones) let the inner optimizer run freely without any supervision, limits, or interventions (OK sure, probably somebody is watching the loss function go down during training, but presumably it’s not uncommon to wait until a training run is “complete” before doing a rigorous battery of safety tests).
Some possible cruxes here are: (1) do these safety-unconcerned people (or safety-concerned-in-principle-but-failing-to-take-necessary-actions people) exist & hold influence, and if so will that continue to be true when AI x-risk is on the table? (I say yes—e.g. Yann LeCun thinks AI x-risk is dumb.) (2) Is it plausible that one group’s training run will have importantly new and different capabilities from the best relevant previous one? (I say yes—consider grokking, or algorithmic improvements, or autonomous learning per my other comment.)
I don’t think I’m concerned by moving up a level in abstraction. For one, I don’t expect any specific developer to suddenly get access to 5 − 9 OOMs more compute than any previous developer. For another, it seems clear that we’d want the AIs being built to be misaligned with whatever “values” correspond to the outer selection signals associated with the outer optimizer in question (i.e., “the people doing the best on benchmarks will get their approaches copied, get more funding, etc”). Seems like an AI being aligned to, like, impressing its developers? doing well on benchmarks? getting more funding? becoming the best architecture it can be? IDK what, but it would probably be bad.
So, I don’t see a reason to expect either a sudden capabilities jump (Edit: deriving from the same mechanism as the human sharp left turn), or (undesirable) misalignment.
I wrote:
And then you wrote:
Isn’t that kinda a strawman? I can imagine a lot of scenarios where a training run results in a qualitatively better trained model than any that came before—I mentioned three of them—and I think “5-9OOM more compute than any previous developer” is a much much less plausible scenario than any of the three I mentioned.
This post mainly argues that evolution does not provide evidence for the sharp left turn. Sudden capabilities jumps from other sources, such as those you mention, are more likely, IMO. My first reply to your comment is arguing that the mechanisms behind the human sharp left turn wrt evolution probably still won’t arise in AI development, even if you go up an abstraction level. One of those mechanisms is a 5 − 9 OOM jump in usable optimization power, which I think is unlikely.