Buck comments on Sam Marks’s Shortform

Buck 31 Aug 2025 18:43 UTC
11 points
2
This can matter for deployment as well as research! Back in 2021, a friend of mine made this mistake while training a customer service model, leading to the model making an extremely inappropriate sexual comment while being demoed to a potential customer; he eventually figured out that a user had said that to the model at some point.
That said I’m not actually sure why in general it would be a mistake in practice to train on the combination. Often models improve their performance when you train them on side tasks that have some relevance to what they’re supposed to be doing—that’s the whole point of pretraining. How much are you saying that it’s a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
- Sam Marks 1 Sep 2025 3:08 UTC
  3 points
  0
  Parent
  How much are you saying that it’s a mistake to do this for deployment, rather than problematic when you are trying to experiment on generalization?
  I was mainly thinking that this was a footgun for research contexts. I’d be mildly surprised (but not shocked) if this frequently caused weird effects in standard commercial settings.
- Jordan Taylor 1 Sep 2025 2:24 UTC
  2 points
  0
  Parent
  When training model organisms (e.g. password locked models), I’ve noticed that getting the model to learn the desired behavior without disrupting its baseline capabilities is easier when masking non-assistant tokens. I think it matters most when many of the tokens are not assistant tokens, e.g. when you have long system prompts.
  
  Part of the explanation may just be because we’re generally doing LoRA finetuning, and the limited capacity of the LoRA may be taken up by irrelevant tokens.
  
  Additionally, many of the non-assistant tokens (e.g. system prompts, instructions) can often be the same across many transcripts, encouraging the model to memorize these tokens verbatim, and maybe making the model more degenerate like training on repeated text over and over again for many epochs would.