I think when some people talk about a model “having a goal” they have in mind something purely behavioral. So when they talk about there being something in GPT that “has a goal of predicting the next token”, they mean it in this purely behavioral way. Like that there are some circuits in the network whose behavior has the effect of predicting the next token well, but whose behavior is not motivated by / steering on the basis of trying to predict the next token well.
But when I (and possibly you as well?) talk about a model “having a goal” I mean something much more specific and mechanistic: a goal is a certain kind of internal representation that the model maintains, such that it makes decisions downstream of comparisons between that representation and its perception. That’s a very different thing! To claim that a model has such a goal is to make a substantive claim about its internal structure and how its cognition generalizes!
When people talk about the shoggoth, it sure sounds like they are making claims that there is in fact an agent behind the mask, an agent that has goals. But maybe not? Like, when Ronny talked of the shoggoth having a goal, I assumed he was making the latter, stronger claim about the model having hidden goal-directed cognitive gears, but maybe he was making the former, weaker claim about how we can describe the model’s behaviors?
I appreciate the clarification, and I’ll try to keep that distinction in mind going forward! To rephrase my claim in this language, I’d say that an LLM as a whole does not have a behavioral goal except for “predict the next token”, which is not a sufficiently descriptive as a behavioral goal to answer a lot of questions AI researchers care about (like “is the AI safe?”). In contrast, the simulacra the model produces can be much better described by more precise behavioral goals. For instance, one might say ChatGPT (with the hidden prompt we aren’t shown) has a behavioral goal of being a helpful assistant, or an LLM roleplaying as a paperclip maximizer has the behavioral goal of producing a lot of paperclips. But an LLM as a whole could contain simulacra that have all those behavioral goals and many more, and because of that diversity they can’t be well-described by any behavioral goal more precise than “predict the next token”.
Possibly relevant aside:
There may be some confusion here about behavioral vs. mechanistic claims.
I think when some people talk about a model “having a goal” they have in mind something purely behavioral. So when they talk about there being something in GPT that “has a goal of predicting the next token”, they mean it in this purely behavioral way. Like that there are some circuits in the network whose behavior has the effect of predicting the next token well, but whose behavior is not motivated by / steering on the basis of trying to predict the next token well.
But when I (and possibly you as well?) talk about a model “having a goal” I mean something much more specific and mechanistic: a goal is a certain kind of internal representation that the model maintains, such that it makes decisions downstream of comparisons between that representation and its perception. That’s a very different thing! To claim that a model has such a goal is to make a substantive claim about its internal structure and how its cognition generalizes!
When people talk about the shoggoth, it sure sounds like they are making claims that there is in fact an agent behind the mask, an agent that has goals. But maybe not? Like, when Ronny talked of the shoggoth having a goal, I assumed he was making the latter, stronger claim about the model having hidden goal-directed cognitive gears, but maybe he was making the former, weaker claim about how we can describe the model’s behaviors?
I appreciate the clarification, and I’ll try to keep that distinction in mind going forward! To rephrase my claim in this language, I’d say that an LLM as a whole does not have a behavioral goal except for “predict the next token”, which is not a sufficiently descriptive as a behavioral goal to answer a lot of questions AI researchers care about (like “is the AI safe?”). In contrast, the simulacra the model produces can be much better described by more precise behavioral goals. For instance, one might say ChatGPT (with the hidden prompt we aren’t shown) has a behavioral goal of being a helpful assistant, or an LLM roleplaying as a paperclip maximizer has the behavioral goal of producing a lot of paperclips. But an LLM as a whole could contain simulacra that have all those behavioral goals and many more, and because of that diversity they can’t be well-described by any behavioral goal more precise than “predict the next token”.