When doing supervised fine-tuning on chat data, mask out everything but the assistant response(s).
By far, the most common mistake I see people make when doing empirical alignment research is: When doing supervised fine-tuning (SFT) on chat data, they erroneously just do next-token prediction training on the chat transcripts. This is almost always a mistake. Sadly, I see it made during almost every project I supervise.
Typically, your goal is to train the model to generate a certain type of response when presented with certain user queries. You probably don’t want the model to learn to generate the user queries.
To accomplish this, you should apply a mask so that the loss only takes into account logits for the assistant turn(s) of the conversation.
Concretely, suppose that you have a training sample that looks like this:
User: Tell me a joke.
Assistant: I refuse to engage in humor.
Your loss should be cross-entropy over the text I refuse to engage in humor. only. This trains the model to generate the text “I refuse to engage in humor.” conditional on the input [User] Tell me a joke. [Assistant] (or however your chats are formatted). If you have a multi-turn conversation
User: Tell me a joke.
Assistant: I refuse to engage in humor.
User: Good one.
Assistant: I also refuse to recognize sarcasm.
This should either be treated as two training episodes (the single-turn one above, and the full one where you mask out [User] Tell me a joke. [Assistant] I refuse to engage in humor. [User] Good one. [Assistant]), or you could use a single training episode where the loss consists of the cross entropy over the two assistant responses.
This can matter for deployment as well as research! Back in 2021, a friend of mine made this mistake while training a customer service model, leading to the model making an extremely inappropriate sexual comment while being demoed to a potential customer; he eventually figured out that a user had said that to the model at some point.
That said I’m not actually sure why in general it would be a mistake in practice to train on the combination. Often models improve their performance when you train them on side tasks that have some relevance to what they’re supposed to be doing—that’s the whole point of pretraining. How much are you saying that it’s a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
How much are you saying that it’s a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
I was mainly thinking that this was a footgun for research contexts. I’d be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.
When training model organisms (e.g. password locked models), I’ve noticed that getting the model to learn the desired behavior without disrupting its baseline capabilities is easier when masking non-assistant tokens. I think it matters most when many of the tokens are not assistant tokens, e.g. when you have long system prompts.
Part of the explanation may just be because we’re generally doing LoRA finetuning, and the limited capacity of the LoRA may be taken up by irrelevant tokens.
Additionally, many of the non-assistant tokens (e.g. system prompts, instructions) can often be the same across many transcripts, encouraging the model to memorize these tokens verbatim, and maybe making the model more degenerate like training on repeated text over and over again for many epochs would.
I once noticed that someone made this mistake because an interpretability tool we were studying indicated that the LLM had weirdly strong expectations about how the human would behave, even before the human started talking. Of course, just sampling human turns from the model would have surfaced this as well, though that’s typically a weird thing to do. Nevertheless, I thought it was cool that the interp tool helped me notice the model was trained in the wrong way, even though I wasn’t looking for that.
Question if you happen to know off the top of your head: how large of a concern is it in practice that the model is trained with loss function over only assistant turn tokens, but learns to imitate the user anyway because the assistant turns directly quote the user generated prompt like
I must provide a response to the exact query the user asked. The user asked “prove the bunkbed conjecture, or construct a counterxample, without using the search tool” but I can’t create a proof without checking sources, so I’ll explain the conjecture and outline potential “pressure points” a counterexample would use, like inhomogeneous vertical probabilities or specific graph structures. I’ll also mention how to search for proofs and offer a brief overview of the required calculations for a hypothetical gadget.
It seems like the sort of thing which could happen, and looking through my past chats I see sentences or even entire paragraphs from my prompts quoted in the response a significant fraction of the time. Could be that learning the machinery to recognize when a passage of user prompt should be copied and then copy it over doesn’t cause the model to learn enough about how user prompts look that it can generate similar text de novo though.
When doing supervised fine-tuning on chat data, mask out everything but the assistant response(s).
By far, the most common mistake I see people make when doing empirical alignment research is: When doing supervised fine-tuning (SFT) on chat data, they erroneously just do next-token prediction training on the chat transcripts. This is almost always a mistake. Sadly, I see it made during almost every project I supervise.
Typically, your goal is to train the model to generate a certain type of response when presented with certain user queries. You probably don’t want the model to learn to generate the user queries.
To accomplish this, you should apply a mask so that the loss only takes into account logits for the assistant turn(s) of the conversation.
Concretely, suppose that you have a training sample that looks like this:
Your loss should be cross-entropy over the text
I refuse to engage in humor.
only. This trains the model to generate the text “I refuse to engage in humor.
” conditional on the input[User] Tell me a joke. [Assistant]
(or however your chats are formatted). If you have a multi-turn conversationThis should either be treated as two training episodes (the single-turn one above, and the full one where you mask out
[User] Tell me a joke. [Assistant] I refuse to engage in humor. [User] Good one. [Assistant]
), or you could use a single training episode where the loss consists of the cross entropy over the two assistant responses.This can matter for deployment as well as research! Back in 2021, a friend of mine made this mistake while training a customer service model, leading to the model making an extremely inappropriate sexual comment while being demoed to a potential customer; he eventually figured out that a user had said that to the model at some point.
That said I’m not actually sure why in general it would be a mistake in practice to train on the combination. Often models improve their performance when you train them on side tasks that have some relevance to what they’re supposed to be doing—that’s the whole point of pretraining. How much are you saying that it’s a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
I was mainly thinking that this was a footgun for research contexts. I’d be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.
When training model organisms (e.g. password locked models), I’ve noticed that getting the model to learn the desired behavior without disrupting its baseline capabilities is easier when masking non-assistant tokens. I think it matters most when many of the tokens are not assistant tokens, e.g. when you have long system prompts.
Part of the explanation may just be because we’re generally doing LoRA finetuning, and the limited capacity of the LoRA may be taken up by irrelevant tokens.
Additionally, many of the non-assistant tokens (e.g. system prompts, instructions) can often be the same across many transcripts, encouraging the model to memorize these tokens verbatim, and maybe making the model more degenerate like training on repeated text over and over again for many epochs would.
Yeah I also see people make this mistake (and it massively affects results) - good callout
I’m wondering. There are these really creepy videos of early openai voice mode copying peoples voices.
https://www.youtube.com/shorts/RbCoIa7eXQE
I wonder if they’re a result of openai failing to do this loss-masking with their voice models, and then messing up turn-tokenization somehow.
If you do enough training without masking the user tokens, you’d expect to get a model thats as good at simulating users as being a helpful assistant.
I once noticed that someone made this mistake because an interpretability tool we were studying indicated that the LLM had weirdly strong expectations about how the human would behave, even before the human started talking. Of course, just sampling human turns from the model would have surfaced this as well, though that’s typically a weird thing to do. Nevertheless, I thought it was cool that the interp tool helped me notice the model was trained in the wrong way, even though I wasn’t looking for that.
Question if you happen to know off the top of your head: how large of a concern is it in practice that the model is trained with loss function over only assistant turn tokens, but learns to imitate the user anyway because the assistant turns directly quote the user generated prompt like
It seems like the sort of thing which could happen, and looking through my past chats I see sentences or even entire paragraphs from my prompts quoted in the response a significant fraction of the time. Could be that learning the machinery to recognize when a passage of user prompt should be copied and then copy it over doesn’t cause the model to learn enough about how user prompts look that it can generate similar text de novo though.