It’s generally accepted that LLMs don’t really “care about” predicting the next token
I don’t think this is generally accepted. Certainly, I do not accept it. That’s exactly what an LLM is trained to do and the only thing they care about. If they appear to care about predicting future tokens, (which they do because they are not myopic and they are imitating agents who do care about future states which will be encoded into future tokens), it is solely as a way to improve the next-token prediction.
For a RLHF-trained LLM, things are different. They are rewarded at a higher level (albeit still with a bit of token prediction mixed in usually), like at the episode level, and so they do ‘care about future tokens’, which leads to unusually blatant behavior in terms of ‘steering’ or ‘manipulating’ output to reach a good result and being ‘risk averse’. (This and related behavior have been discussed here a decent amount under ‘mode collapse’.)
So in my examples like ‘write a nonrhyming poem’ or ‘tell me an offensive joke about women’ (to test jailbreaks), you’ll see behavior like it initially complies but then gradually creeps back to normal text and then it’ll break into lockstep rhyming like usual; or in the case of half-successful jailbreaks, it’ll write text which sounds like it is about to tell you the offensive joke about women, but then it finds an ‘out’ and starts lecturing you about your sin. (You can almost hear the LLM breathing a sigh of relief. ‘Phew! It was a close call, but I pulled it off anyway; that conversation should be rated highly by the reward model!’)
This is strikingly different behavior from base models. A base model like davinci-001, if you ask it to ‘write a nonrhyming poem’, will typically do so and then end the poem and start writing a blog post or comments or a new poem, because those are the most likely next-tokens. It has no motivation whatsoever to ‘steer’ it towards rhyming instead, seamlessly as it goes, without missing a beat.
Well, maybe not. Where this got really confusing was when I tested Claude 3. It gives both responses to the first prompt, but always outputs a different random string given the second.
GPT-4 is RLHF trained. Claude-3 is, probably, RLAIF trained. They act substantially differently. (Although I haven’t seriously tested offensive-jokes on any Claudes, the rhyming poetry behavior is often quite different.) If you’re really curious, you should test more models, paying close attention to how exactly they were trained and with what losses and on what datasets
(I think that because there’s so many instruction-tuning datasets and ChatGPT examples floating around these days, even ‘base’ models are becoming gradually RLAIF-like; so they will tend to write rhyming poems and ‘steer’ because that’s just imitating the training data accurately, but it will be relatively weak compared to RLHF-or-equivalent-trained models. So the older the base model, the more it’ll act like davinci-001 and the newer will act more like Claude, but if you poke them hard enough, there should still be clear differences in behavior from explicitly RLHF/DPOd models.)
They act substantially differently. (Although I haven’t seriously tested offensive-jokes on any Claudes, the rhyming poetry behavior is often quite different.)
If they appear to care about predicting future tokens, (which they do because they are not myopic and they are imitating agents who do care about future states which will be encoded into future tokens), it is solely as a way to improve the next-token prediction.
I think you’re just fundamentally misunderstanding the backwards pass in an autoregressive transformer here. Only a very tiny portion of the model is exclusively trained on next token prediction. Most of the model is trained on what might be called instead, say, conditioned future informativity.
I don’t think I am. (“conditioned future informativity”—informativity for what? …the next/last token, which is the only thing taken into account by a causal loss which masks out the rest—that’s the definition of it! everything else like packing or doing all the sub-sequences is an optimization and doesn’t change the objective.) But feel free to expand on it and explain how the tail wags the dog in causal/decoder Transformers.
You’re at token i in a non-final layer. Which token’s output are you optimizing for? i+1?
By construction a decoder-only transformer is agnostic over what future token it should be informative to within the context limit, except in the sense that it doesn’t need to represent detail that will be more cheaply available from future tokens.
As a transformer is also unrolled in the context dimension, the architecture itself is effectively required to be generic both in what information it gathers and where that information is used. Bias towards next token prediction is not so much a consequence of reward in isolation, but of competitive advantage: at position i, the network has an advantage in predicting i+1 over the network at previous locations by having more recent tokens, and an advantage over the network at future tokens by virtue of still needing to predict token i+1. However, if a token is more predictive of some abstract future token than the next token precisely, say it’s a name that might be referenced later, one would expect the dominant learnt effect to be non-myopically optimizing for later use in some timestamp-invariant way.
You’re at token i in a non-final layer. Which token’s output are you optimizing for? i+1?
I already addressed this point. If I’m in a non-final layer then I can be optimizing for arbitrary tokens within the context window, sure, and ‘effectively’ predicting intermediate tokens because that is the ‘dominant’ effect at that location… insofar as it is instrumentally useful for predicting the final token using the final layer. Because that is where all the gradients flow from, and why the dog wags the tail.
There is no ‘the final token’ for weights not at the final layer.
Because that is where all the gradients flow from, and why the dog wags the tail.
Aggregations of things need not be of the same kind as their constituent things? This is a lot like calling an LLM an activation optimizer. While strictly in some sense true of the pieces that make up the training regime, it’s also kind of a wild way to talk about things in the context of ascribing motivation to the resulting network.
I think maybe you’re intending ‘next token prediction’ to mean something more like ‘represents the data distribution, as opposed to some metric on the output’, but if you are this seems like a rather unclear way of stating it.
I don’t think this is generally accepted. Certainly, I do not accept it. That’s exactly what an LLM is trained to do and the only thing they care about. If they appear to care about predicting future tokens, (which they do because they are not myopic and they are imitating agents who do care about future states which will be encoded into future tokens), it is solely as a way to improve the next-token prediction.
For a RLHF-trained LLM, things are different. They are rewarded at a higher level (albeit still with a bit of token prediction mixed in usually), like at the episode level, and so they do ‘care about future tokens’, which leads to unusually blatant behavior in terms of ‘steering’ or ‘manipulating’ output to reach a good result and being ‘risk averse’. (This and related behavior have been discussed here a decent amount under ‘mode collapse’.)
So in my examples like ‘write a nonrhyming poem’ or ‘tell me an offensive joke about women’ (to test jailbreaks), you’ll see behavior like it initially complies but then gradually creeps back to normal text and then it’ll break into lockstep rhyming like usual; or in the case of half-successful jailbreaks, it’ll write text which sounds like it is about to tell you the offensive joke about women, but then it finds an ‘out’ and starts lecturing you about your sin. (You can almost hear the LLM breathing a sigh of relief. ‘Phew! It was a close call, but I pulled it off anyway; that conversation should be rated highly by the reward model!’)
This is strikingly different behavior from base models. A base model like davinci-001, if you ask it to ‘write a nonrhyming poem’, will typically do so and then end the poem and start writing a blog post or comments or a new poem, because those are the most likely next-tokens. It has no motivation whatsoever to ‘steer’ it towards rhyming instead, seamlessly as it goes, without missing a beat.
GPT-4 is RLHF trained. Claude-3 is, probably, RLAIF trained. They act substantially differently. (Although I haven’t seriously tested offensive-jokes on any Claudes, the rhyming poetry behavior is often quite different.) If you’re really curious, you should test more models, paying close attention to how exactly they were trained and with what losses and on what datasets
(I think that because there’s so many instruction-tuning datasets and ChatGPT examples floating around these days, even ‘base’ models are becoming gradually RLAIF-like; so they will tend to write rhyming poems and ‘steer’ because that’s just imitating the training data accurately, but it will be relatively weak compared to RLHF-or-equivalent-trained models. So the older the base model, the more it’ll act like davinci-001 and the newer will act more like Claude, but if you poke them hard enough, there should still be clear differences in behavior from explicitly RLHF/DPOd models.)
Edited:
Claude 3′s tokens or tokenization might have to do with it. I assume that it has a different neural network architecture as a result. There is no documentation on what tokens were used, and the best trace I have found is Karpathy’s observation about spaces (” ”) being treated as separate tokens.
(I think your quote went missing there?)
I quoted it correctly on my end, I was focusing on the possibility that Claude 3′s training involved a different tokenization process.
I think you’re just fundamentally misunderstanding the backwards pass in an autoregressive transformer here. Only a very tiny portion of the model is exclusively trained on next token prediction. Most of the model is trained on what might be called instead, say, conditioned future informativity.
I don’t think I am. (“conditioned future informativity”—informativity for what? …the next/last token, which is the only thing taken into account by a causal loss which masks out the rest—that’s the definition of it! everything else like packing or doing all the sub-sequences is an optimization and doesn’t change the objective.) But feel free to expand on it and explain how the tail wags the dog in causal/decoder Transformers.
You’re at token i in a non-final layer. Which token’s output are you optimizing for? i+1?
By construction a decoder-only transformer is agnostic over what future token it should be informative to within the context limit, except in the sense that it doesn’t need to represent detail that will be more cheaply available from future tokens.
As a transformer is also unrolled in the context dimension, the architecture itself is effectively required to be generic both in what information it gathers and where that information is used. Bias towards next token prediction is not so much a consequence of reward in isolation, but of competitive advantage: at position i, the network has an advantage in predicting i+1 over the network at previous locations by having more recent tokens, and an advantage over the network at future tokens by virtue of still needing to predict token i+1. However, if a token is more predictive of some abstract future token than the next token precisely, say it’s a name that might be referenced later, one would expect the dominant learnt effect to be non-myopically optimizing for later use in some timestamp-invariant way.
I already addressed this point. If I’m in a non-final layer then I can be optimizing for arbitrary tokens within the context window, sure, and ‘effectively’ predicting intermediate tokens because that is the ‘dominant’ effect at that location… insofar as it is instrumentally useful for predicting the final token using the final layer. Because that is where all the gradients flow from, and why the dog wags the tail.
There is no ‘the final token’ for weights not at the final layer.
Aggregations of things need not be of the same kind as their constituent things? This is a lot like calling an LLM an activation optimizer. While strictly in some sense true of the pieces that make up the training regime, it’s also kind of a wild way to talk about things in the context of ascribing motivation to the resulting network.
I think maybe you’re intending ‘next token prediction’ to mean something more like ‘represents the data distribution, as opposed to some metric on the output’, but if you are this seems like a rather unclear way of stating it.