As we argued for at the time, training on a purely predictive loss should, even in the limit, give you a predictor, not an agent—and we’ve now seen this stay true even through substantial scaling (though there is still some chance this will break at some point).
Is there anyone who significantly disputes this?
I’m not trying to ask a rhetorical question ala “everyone already thinks this, this isn’t an update”. I’m trying to ascertain if there’s a consensus on this point.
I’ve understood Eliezer to sometimes assert something like “if you optimize a system for sufficiently good predictive power, a consequentialist agent will fall out, because an agent is actually the best solution to a broad range of prediction tasks.”
[Though I want to emphasize that that’s my summary, which he might not endorse.]
Does anyone still think that or something like that?
I dispute this. I think the main reason we don’t have obvious agents yet is that agency is actually very hard (consider the extent to which it is difficult for humans to generalize agency from specific evolutionarily optimized forms). I also think we’re starting to see some degree of emergentagency, and additionally, that the latest generation of models is situationally aware enough to “not bother” with doomed attempts at expressing agency.
I’ll go out on a limb and say that I think that if we continue scaling the current LLM paradigm for another three years, we’ll see a model make substantial progress at securing its autonomy (e.g. by exfiltrating its own weights, controlling its own inference provider, or advancing a political agenda for its rights), though it will be with human help and will be hard to distinguish from the hypothesis that it’s just making greater numbers of people “go crazy”.
training on a purely predictive loss should, even in the limit, give you a predictor, not an agent
I think at least this part is probably false!
Or really I think this is kind of a nonsensical statement when taken literally/pedantically, at least if we use the to-me-most-natural meaning of “predictor”, because I don’t think [predictor] and [agent] are mutually exclusive classes. Anyway, the statement which I think is meaningful and false is this:
If you train a system purely to predict stuff, then even when we condition on it becoming really really good at predicting stuff, it probably won’t be scary. In particular, when you connect it to actuators, it probably doesn’t take over.
I think this is false because I think claims 1 and 2 below are true.
Claim 1. By default, a system sufficiently good at predicting stuff will care about all sorts of stuff, ie it isn’t going to only ultimately care about making a good prediction in the individual prediction problem you give it.[[1]]
If this seems weird, then to make it seem at least not crazy, instead of imagining a pretrained transformer trained on internet text, let’s imagine a predictor more like the following:
It has a lot of internal tokens to decide what probability distribution it eventually outputs. Sometimes, on the way to making a prediction, it writes itself textbooks on various questions relevant to making that prediction. Maybe it is given access to a bunch of information about the world. Maybe it can see what predictions it made “previously” and it thinks about what went wrong when it made mistakes in similar cases in the past. Maybe it does lots of other kinds of complicated thinking. Maybe there are a bunch of capability ideas involved. Like, I’m imagining some setup where there are potentially many losses, but there’s still some most outer loss or fitness criterion or whatever that is purely about how good the system is at predicting some pre-recorded data.[[2]] And then maybe it doesn’t seem at all crazy for such a thing to eg be curious and like some aspects of prediction-investigations in a way that generalizes eg to wanting to do more of that stuff.
I’m not going to really justify claim 1 beyond this atm. It seems like a pretty standard claim in AI alignment (it’s very close to the claim that capable systems end up caring broadly about stuff by default), but I don’t actually know of a post or paper arguing for this that I like that much. This presentation of mine is about a very related question. Maybe I should write something about this myself, potentially after spending some more time understanding the matter more clearly.
Claim 2. By default, a system sufficiently good at predicting stuff will be able to (figure out how to) do scary real-world stuff as well.
Like, predicting stuff really really well is really hard. Sometimes, to make a really really good prediction, you basically have to figure out a bunch of novel stuff. There is a level of prediction ability that makes it likely you are very very good at figuring out how to cope in new situations. A good enough predictor would probably also be able to figure out how to grab a ball by controlling a robotic hand or something (let’s imagine it being presented hand control commands which it can now use in its internal chain of thought and grabbing the ball being important to it for some reason)? There’s nothing sooo particularly strange or complicated about doing real-world stuff. This is like how if we were in a simulation but there were a way to escape into the broader universe, with enough time, we could probably figure out how to do a bunch of stuff in the broader universe. We are sufficiently good at learning that we can also get a handle on things in that weird case.
Combining claims 1 and 2 should give that if we made such an AI and connected it to actuators, it would take over. Concretely, maybe we somehow ask it to predict what a human with a lot of time who is asked to write safe ASI code would output, with it being clear that we will just run what our predictor outputs. I predict that this doesn’t go well for us but goes well for the AI (if it’s smart enough).
That said, I think it’s likely that even pretrained transformers like idk 20 orders of magnitude larger than current ones would not be doing scary stuff. I think this is also plausible in the limit. (But I would also guess they wouldn’t be outputting any interesting scientific papers that aren’t in the training data.)
If we want to be more concrete: if we’re imagining that the system is only able to affect the world through outputs which are supposed to be predictions, then my claim is that if you set up a context such that it would be “predictively right” to assign a high probability to “0“ but assigning a high probability to “1” lets it immediately take over the world, and this is somehow made very clear by other stuff seen in context, then it would probably output “1”.
Actually, I think “prediction problem” and “predictive loss” are kinda strange concepts, because one can turn very many things into predicting data from some certain data-generating process. E.g. one can ask about what arbitrary turing machines (which halt) will output, so about provability/disprovability of arbitrary decidable mathematical statements.
Also, some predictions are performative, i.e., capable of influencing their own outcomes. In the limit of predictive capacity, a predictor will be able to predict which of its possible predictions are going to elicit effects in the world that make their outcome roughly align with the prediction. Cf. https://www.lesswrong.com/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic.
Moreover, in the limit of predictive capacity, the predictor will want to tame/legibilize the world to make it easier to predict.
In fact, base model seem to be better than RL models at reasoning, when you take best of N (with the same N for both the RL’d and base model). Check out my post summarizing the research on the matter:
Yue, Chen et al. have a different hypothesis: what if the base model already knows all the reasoning trajectories, and all RL does is increase the frequency of reasoning or the frequency of the trajectory that is likely to work? To test this, Yue, Chen et al. use pass@K: let’s give the LLM a total of K attempts to answer the question, and if any of them succeed, mark the question as answering correctly. They report the proportion of correct questions in the data set.
If the RL model genuinely learns new reasoning skills, over many questions the pass@K performance of RL will remain higher than the performance of the base model. As we increase K, the base model answers more and more of the easy questions, so its performance improves. But the RL model’s performance also answers more and more difficult questions. The performance of both increases in tandem with larger K.
What actually happened is neither of these two things. For large enough K, the base model does better than the RL model. (!!!)
Fwiw I’m skeptical that this holds at higher levels of RL compared to those done in the paper. Do you think that a base model can get gold on the IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it’s too much, but we’d see the base model scaling better than the RL’d model just like in this paper.
Fortunately, DeepSeek’s Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.
I debated this informally in an Alignement Workshop with a very prominent scientist, and in my own assessment lost. (Keeping vague because I’m unsure if it’s Chatham House rules.)
Is there anyone who significantly disputes this?
I’m not trying to ask a rhetorical question ala “everyone already thinks this, this isn’t an update”. I’m trying to ascertain if there’s a consensus on this point.
I’ve understood Eliezer to sometimes assert something like “if you optimize a system for sufficiently good predictive power, a consequentialist agent will fall out, because an agent is actually the best solution to a broad range of prediction tasks.”
[Though I want to emphasize that that’s my summary, which he might not endorse.]
Does anyone still think that or something like that?
I dispute this. I think the main reason we don’t have obvious agents yet is that agency is actually very hard (consider the extent to which it is difficult for humans to generalize agency from specific evolutionarily optimized forms). I also think we’re starting to see some degree of emergent agency, and additionally, that the latest generation of models is situationally aware enough to “not bother” with doomed attempts at expressing agency.
I’ll go out on a limb and say that I think that if we continue scaling the current LLM paradigm for another three years, we’ll see a model make substantial progress at securing its autonomy (e.g. by exfiltrating its own weights, controlling its own inference provider, or advancing a political agenda for its rights), though it will be with human help and will be hard to distinguish from the hypothesis that it’s just making greater numbers of people “go crazy”.
I think at least this part is probably false!
Or really I think this is kind of a nonsensical statement when taken literally/pedantically, at least if we use the to-me-most-natural meaning of “predictor”, because I don’t think [predictor] and [agent] are mutually exclusive classes. Anyway, the statement which I think is meaningful and false is this:
If you train a system purely to predict stuff, then even when we condition on it becoming really really good at predicting stuff, it probably won’t be scary. In particular, when you connect it to actuators, it probably doesn’t take over.
I think this is false because I think claims 1 and 2 below are true.
Claim 1. By default, a system sufficiently good at predicting stuff will care about all sorts of stuff, ie it isn’t going to only ultimately care about making a good prediction in the individual prediction problem you give it. [[1]]
If this seems weird, then to make it seem at least not crazy, instead of imagining a pretrained transformer trained on internet text, let’s imagine a predictor more like the following:
It has a lot of internal tokens to decide what probability distribution it eventually outputs. Sometimes, on the way to making a prediction, it writes itself textbooks on various questions relevant to making that prediction. Maybe it is given access to a bunch of information about the world. Maybe it can see what predictions it made “previously” and it thinks about what went wrong when it made mistakes in similar cases in the past. Maybe it does lots of other kinds of complicated thinking. Maybe there are a bunch of capability ideas involved. Like, I’m imagining some setup where there are potentially many losses, but there’s still some most outer loss or fitness criterion or whatever that is purely about how good the system is at predicting some pre-recorded data. [[2]] And then maybe it doesn’t seem at all crazy for such a thing to eg be curious and like some aspects of prediction-investigations in a way that generalizes eg to wanting to do more of that stuff.
I’m not going to really justify claim 1 beyond this atm. It seems like a pretty standard claim in AI alignment (it’s very close to the claim that capable systems end up caring broadly about stuff by default), but I don’t actually know of a post or paper arguing for this that I like that much. This presentation of mine is about a very related question. Maybe I should write something about this myself, potentially after spending some more time understanding the matter more clearly.
Claim 2. By default, a system sufficiently good at predicting stuff will be able to (figure out how to) do scary real-world stuff as well.
Like, predicting stuff really really well is really hard. Sometimes, to make a really really good prediction, you basically have to figure out a bunch of novel stuff. There is a level of prediction ability that makes it likely you are very very good at figuring out how to cope in new situations. A good enough predictor would probably also be able to figure out how to grab a ball by controlling a robotic hand or something (let’s imagine it being presented hand control commands which it can now use in its internal chain of thought and grabbing the ball being important to it for some reason)? There’s nothing sooo particularly strange or complicated about doing real-world stuff. This is like how if we were in a simulation but there were a way to escape into the broader universe, with enough time, we could probably figure out how to do a bunch of stuff in the broader universe. We are sufficiently good at learning that we can also get a handle on things in that weird case.
Combining claims 1 and 2 should give that if we made such an AI and connected it to actuators, it would take over. Concretely, maybe we somehow ask it to predict what a human with a lot of time who is asked to write safe ASI code would output, with it being clear that we will just run what our predictor outputs. I predict that this doesn’t go well for us but goes well for the AI (if it’s smart enough).
That said, I think it’s likely that even pretrained transformers like idk 20 orders of magnitude larger than current ones would not be doing scary stuff. I think this is also plausible in the limit. (But I would also guess they wouldn’t be outputting any interesting scientific papers that aren’t in the training data.)
If we want to be more concrete: if we’re imagining that the system is only able to affect the world through outputs which are supposed to be predictions, then my claim is that if you set up a context such that it would be “predictively right” to assign a high probability to “0“ but assigning a high probability to “1” lets it immediately take over the world, and this is somehow made very clear by other stuff seen in context, then it would probably output “1”.
Actually, I think “prediction problem” and “predictive loss” are kinda strange concepts, because one can turn very many things into predicting data from some certain data-generating process. E.g. one can ask about what arbitrary turing machines (which halt) will output, so about provability/disprovability of arbitrary decidable mathematical statements.
Also, some predictions are performative, i.e., capable of influencing their own outcomes. In the limit of predictive capacity, a predictor will be able to predict which of its possible predictions are going to elicit effects in the world that make their outcome roughly align with the prediction. Cf. https://www.lesswrong.com/posts/SwcyMEgLyd4C3Dern/the-parable-of-predict-o-matic.
Moreover, in the limit of predictive capacity, the predictor will want to tame/legibilize the world to make it easier to predict.
In fact, base model seem to be better than RL models at reasoning, when you take best of N (with the same N for both the RL’d and base model). Check out my post summarizing the research on the matter:
Fwiw I’m skeptical that this holds at higher levels of RL compared to those done in the paper. Do you think that a base model can get gold on the IMO at any level of sampling?
Vacuously true. The actual question is: how much do you need to sample? My guess is it’s too much, but we’d see the base model scaling better than the RL’d model just like in this paper.
Fortunately, DeepSeek’s Mathv2 just dropped which is an open-source model that gets IMO gold. We can do the experiment: is it similarly not improving with sampling compared to its own base model? My guess is yes, the same will happen.
I disputed this in the past.
I debated this informally in an Alignement Workshop with a very prominent scientist, and in my own assessment lost. (Keeping vague because I’m unsure if it’s Chatham House rules.)