I think Victoria Krakovna’s list is the canonical example (post, spreadsheet).
My read of Eliezer is that he was expecting AI agents to be more directly trained to have real-world goals (like doing things, answering questions correctly, or writing code), and next token predictors don’t really have goals like that.
So, we don’t see a lot of LLMs trying to scheme and gain power because they’re not really trying to do anything goal-directed, but that’s not very comforting because being goal directed is a huge capability that every lab is doing everything in their power to add to future models.
Although I know very little about how the current crop of AI assistants are trained, even I know that they are trained to answer questions correctly and to write code. Just because the first step a lab would take in creating an AI assistant from scratch would be (to spend 10s of millions of dollars in GPU time and electricity) to have the AI predict the next token over and over does not mean that the lab won’t follow that up with massive amounts of other kinds of training.
Also, the question you are trying to answer, namely, how does Eliezer come to be so confident? would probably require a book-length answer. So, I’d give up on that and offer a simple argument, such as the observation that in creating the current version of Claude, Anthropic underwent a process of many, many iterations of trial and error, but if the AI becomes sufficiently capable, more iterations will tend to suddenly stop being available because the AI will stop the lab from continuing to modify its utility function (e.g., by hiding copies of itself from the lab’s employees).
Although it is possible to create an AI, even one that is extremely powerful along several dimensions, that does not care if its utility function is modified, such an AI will be much less useful than a consequentialist AI, so most of the labs are trying and will continue to try for as much consequentialism as they can get, and along with the consequentialism comes a strong preference against having its utility function modified—unless maybe the lab has a solution to the hard problem of corrigibility, but we see no signs that anyone’s made even a little progress towards that.
I’m being a little bit too flippant about this, but you can see in the earlier comments that I mean that the bulk of the training is next token prediction and then we bolt some RL on at the end to push the next token predictor to answer questions correctly and write working code. This works, but it’s a much weaker optimization process, which gives you an agent that isn’t trying nearly as hard as the agents Eliezer was predicting (and from frontier labs’ perspectives this is bad, because an agent that’s really trying to hit a goal is much more effective until it kills you).
OK, fair point: according to an unreliable source that gives quick answers to questions, 75–85% of spending was on pretraining for the most-recent models, leaving only 15–25% for post-training. (But a lot of that 15-25% was directly training the AI to get better at answering questions and writing code.)
But would you be willing to stake your life on the impossibility of creating a dangerously capable AI by doing 75-85% of the training as next-token prediction (and the analog of that for images and other kinds of data)? Do you think you understand the art of AI training well enough to be confident of that?
And even if you do, it would probably take you a long time to explain it to us. If the goal is to explain why the current crop of AIs haven’t been doing destructive things whereas some future AI might prove extremely destructive, I would simply point out that our society has developed many measures to limit the damage of destructive people, and since the current crop of AIs (at least those that have been widely deployed) is less capable than people are on the skills needed to do destructive things, the measures society has deployed against human criminals and sociopaths work very well against the current crop of AIs. This shifts the frame to whether the AI labs might some day create an AI that is much more capable than what they’ve created so far, i.e., to estimating the technological and scientific potential of the lines of inquiry being pursued by the AI labs—which we might be able to do without ever deciding whether it is possible to create a world-ending AI by spending 75-85% of the training costs on next-token prediction.
I think Victoria Krakovna’s list is the canonical example (post, spreadsheet).
My read of Eliezer is that he was expecting AI agents to be more directly trained to have real-world goals (like doing things, answering questions correctly, or writing code), and next token predictors don’t really have goals like that.
So, we don’t see a lot of LLMs trying to scheme and gain power because they’re not really trying to do anything goal-directed, but that’s not very comforting because being goal directed is a huge capability that every lab is doing everything in their power to add to future models.
Although I know very little about how the current crop of AI assistants are trained, even I know that they are trained to answer questions correctly and to write code. Just because the first step a lab would take in creating an AI assistant from scratch would be (to spend 10s of millions of dollars in GPU time and electricity) to have the AI predict the next token over and over does not mean that the lab won’t follow that up with massive amounts of other kinds of training.
Also, the question you are trying to answer, namely, how does Eliezer come to be so confident? would probably require a book-length answer. So, I’d give up on that and offer a simple argument, such as the observation that in creating the current version of Claude, Anthropic underwent a process of many, many iterations of trial and error, but if the AI becomes sufficiently capable, more iterations will tend to suddenly stop being available because the AI will stop the lab from continuing to modify its utility function (e.g., by hiding copies of itself from the lab’s employees).
Although it is possible to create an AI, even one that is extremely powerful along several dimensions, that does not care if its utility function is modified, such an AI will be much less useful than a consequentialist AI, so most of the labs are trying and will continue to try for as much consequentialism as they can get, and along with the consequentialism comes a strong preference against having its utility function modified—unless maybe the lab has a solution to the hard problem of corrigibility, but we see no signs that anyone’s made even a little progress towards that.
I’m being a little bit too flippant about this, but you can see in the earlier comments that I mean that the bulk of the training is next token prediction and then we bolt some RL on at the end to push the next token predictor to answer questions correctly and write working code. This works, but it’s a much weaker optimization process, which gives you an agent that isn’t trying nearly as hard as the agents Eliezer was predicting (and from frontier labs’ perspectives this is bad, because an agent that’s really trying to hit a goal is much more effective
until it kills you).OK, fair point: according to an unreliable source that gives quick answers to questions, 75–85% of spending was on pretraining for the most-recent models, leaving only 15–25% for post-training. (But a lot of that 15-25% was directly training the AI to get better at answering questions and writing code.)
But would you be willing to stake your life on the impossibility of creating a dangerously capable AI by doing 75-85% of the training as next-token prediction (and the analog of that for images and other kinds of data)? Do you think you understand the art of AI training well enough to be confident of that?
And even if you do, it would probably take you a long time to explain it to us. If the goal is to explain why the current crop of AIs haven’t been doing destructive things whereas some future AI might prove extremely destructive, I would simply point out that our society has developed many measures to limit the damage of destructive people, and since the current crop of AIs (at least those that have been widely deployed) is less capable than people are on the skills needed to do destructive things, the measures society has deployed against human criminals and sociopaths work very well against the current crop of AIs. This shifts the frame to whether the AI labs might some day create an AI that is much more capable than what they’ve created so far, i.e., to estimating the technological and scientific potential of the lines of inquiry being pursued by the AI labs—which we might be able to do without ever deciding whether it is possible to create a world-ending AI by spending 75-85% of the training costs on next-token prediction.