I think we are pretty much on the same page! Thanks for the example of the ball-moving AI, that was helpful. I think I only have two things to add:
Reward is not the optimization target, and in particular just because an LLM was trained by changing it to predict the next token better, doesn’t mean the LLM will pursue that as a terminal goal. During operation an LLM is completely divorced from the training-time reward function, it just does the calculations and reads out the logits. This differs from a proper “goal” because we don’t need to worry about the LLM trying to wirehead by feeding itself easy predictions. In contrast, if we call up
To the extent we do say the LLM’s goal is next token prediction, that goal maps very unclearly onto human-relevant questions such as “is the AI safe?”. Next-token prediction contains multitudes, and in OP I wanted to push people towards “the LLM by itself can’t be divorced from how it’s prompted”.
I think when some people talk about a model “having a goal” they have in mind something purely behavioral. So when they talk about there being something in GPT that “has a goal of predicting the next token”, they mean it in this purely behavioral way. Like that there are some circuits in the network whose behavior has the effect of predicting the next token well, but whose behavior is not motivated by / steering on the basis of trying to predict the next token well.
But when I (and possibly you as well?) talk about a model “having a goal” I mean something much more specific and mechanistic: a goal is a certain kind of internal representation that the model maintains, such that it makes decisions downstream of comparisons between that representation and its perception. That’s a very different thing! To claim that a model has such a goal is to make a substantive claim about its internal structure and how its cognition generalizes!
When people talk about the shoggoth, it sure sounds like they are making claims that there is in fact an agent behind the mask, an agent that has goals. But maybe not? Like, when Ronny talked of the shoggoth having a goal, I assumed he was making the latter, stronger claim about the model having hidden goal-directed cognitive gears, but maybe he was making the former, weaker claim about how we can describe the model’s behaviors?
I appreciate the clarification, and I’ll try to keep that distinction in mind going forward! To rephrase my claim in this language, I’d say that an LLM as a whole does not have a behavioral goal except for “predict the next token”, which is not a sufficiently descriptive as a behavioral goal to answer a lot of questions AI researchers care about (like “is the AI safe?”). In contrast, the simulacra the model produces can be much better described by more precise behavioral goals. For instance, one might say ChatGPT (with the hidden prompt we aren’t shown) has a behavioral goal of being a helpful assistant, or an LLM roleplaying as a paperclip maximizer has the behavioral goal of producing a lot of paperclips. But an LLM as a whole could contain simulacra that have all those behavioral goals and many more, and because of that diversity they can’t be well-described by any behavioral goal more precise than “predict the next token”.
Yeah I’m totally with you that it definitely isn’t actually next token prediction, it’s some totally other goal drawn from the dist of goals you get when you sgd for minimizing next token prediction surprise.
I think we are pretty much on the same page! Thanks for the example of the ball-moving AI, that was helpful. I think I only have two things to add:
Reward is not the optimization target, and in particular just because an LLM was trained by changing it to predict the next token better, doesn’t mean the LLM will pursue that as a terminal goal. During operation an LLM is completely divorced from the training-time reward function, it just does the calculations and reads out the logits. This differs from a proper “goal” because we don’t need to worry about the LLM trying to wirehead by feeding itself easy predictions. In contrast, if we call up
To the extent we do say the LLM’s goal is next token prediction, that goal maps very unclearly onto human-relevant questions such as “is the AI safe?”. Next-token prediction contains multitudes, and in OP I wanted to push people towards “the LLM by itself can’t be divorced from how it’s prompted”.
Possibly relevant aside:
There may be some confusion here about behavioral vs. mechanistic claims.
I think when some people talk about a model “having a goal” they have in mind something purely behavioral. So when they talk about there being something in GPT that “has a goal of predicting the next token”, they mean it in this purely behavioral way. Like that there are some circuits in the network whose behavior has the effect of predicting the next token well, but whose behavior is not motivated by / steering on the basis of trying to predict the next token well.
But when I (and possibly you as well?) talk about a model “having a goal” I mean something much more specific and mechanistic: a goal is a certain kind of internal representation that the model maintains, such that it makes decisions downstream of comparisons between that representation and its perception. That’s a very different thing! To claim that a model has such a goal is to make a substantive claim about its internal structure and how its cognition generalizes!
When people talk about the shoggoth, it sure sounds like they are making claims that there is in fact an agent behind the mask, an agent that has goals. But maybe not? Like, when Ronny talked of the shoggoth having a goal, I assumed he was making the latter, stronger claim about the model having hidden goal-directed cognitive gears, but maybe he was making the former, weaker claim about how we can describe the model’s behaviors?
I appreciate the clarification, and I’ll try to keep that distinction in mind going forward! To rephrase my claim in this language, I’d say that an LLM as a whole does not have a behavioral goal except for “predict the next token”, which is not a sufficiently descriptive as a behavioral goal to answer a lot of questions AI researchers care about (like “is the AI safe?”). In contrast, the simulacra the model produces can be much better described by more precise behavioral goals. For instance, one might say ChatGPT (with the hidden prompt we aren’t shown) has a behavioral goal of being a helpful assistant, or an LLM roleplaying as a paperclip maximizer has the behavioral goal of producing a lot of paperclips. But an LLM as a whole could contain simulacra that have all those behavioral goals and many more, and because of that diversity they can’t be well-described by any behavioral goal more precise than “predict the next token”.
Yeah I’m totally with you that it definitely isn’t actually next token prediction, it’s some totally other goal drawn from the dist of goals you get when you sgd for minimizing next token prediction surprise.