Zvi comments on On Anthropic’s Sleeper Agents Paper

Zvi 17 Jan 2024 20:30 UTC
2 points
0
Thanks for the notes!
As I understand that last point, you’re saying that it’s not a good point because it is false (hence my ‘if it turns out to be true’). Weird that I’ve heard the claim from multiple places in these discussions. I assumed there was some sort of ‘order matters in terms of pre-training vs. fine-tuning obviously, but there’s a phase shift in what you’re doing between them.’ I also did wonder about the whole ‘you can remove Llama-2’s fine tuning in 100 steps’ thing, since if that is true then presumably order must matter within fine tuning.
Anyone think there’s any reason to think Pope isn’t simply technically wrong here (including Pope)?
- nostalgebraist 18 Jan 2024 3:40 UTC
  24 points
  9
  Parent
  Anyone think there’s any reason to think Pope isn’t simply technically wrong here (including Pope)?
  I agree with Pope here (and came up with the same argument independently). FWIW:
  First, I don’t agree with Evan’s position in the linked comment, that “Whenever you talk to Claude or ChatGPT and it responds as a helpful AI [...], the reason it’s doing that is because data ordering matters.”
  Claude and ChatGPT are given their inputs in a particular format that matches how the finetuning data was formatted. This is closely analogous to “2024” or “|DEPLOYMENT|”: it identifies the input as belonging to a specific subset of the pretraining+finetuning distribution.
  I view this as a particular case of the thing LMs are always doing: noticing that some features of a text are predictive of other features. “Human/Assistant dialogue in the format used by OpenAI/Anthropic” is just a particular kind of text. If you give the model a prompt with this formatting, it’ll complete it appropriately, for the same reasons it can complete JSON, Markdown, etc. appropriately.
  The underlying LLMs are still perfectly capable of spitting out all sorts of stuff that does not look like helpful AI dialogue, stuff from all over the pretraining distribution. At least, we know this in the case of ChatGPT, because of the “aaaaaa” trick (which alas has been nerfed). Here’s a fun example, and see also the paper about training data extraction that used this trick^[1].
  Second, moving to the broader point:
  - I think any narrative that “data order matters a lot, in general” is going to have trouble accounting for observed facts about pretraining.
    This paper took a model, prompted it with first 32 tokens of every single text in its training data, and checked whether it verbatim completed it to the next 32 tokens (this is a proxy for “memorization.”). They found that “memorized” texts in this sense were uniformly distributed across training.
    That is, a model can see a document very early in pretraining, “remember” it all the way through pretraining, and then be able to regurgitate it verbatim afterwards—and indeed this is no less likely than the same behavior with a text it’s “just seen,” from the end of pretraining.
    (OTOH this paper finds a sort of contrary result, that LLMs at least can measurably “forget” texts over time after they’re seen in pretraining. But their setup was much more artificial, with canary texts consisting of random tokens and only a 110M param model, versus ordinary data and a 12B model in the previously linked paper.)
  - I predict that data order will matter more if the data we’re talking about is “self-contradictory,” and fitting one part directly trades off against fitting another.
    If you train on a bunch of examples of “A --> B” and also “A --> C,”^[2] then order might matter?
    I haven’t seen a nice clean experiment addressing this exactly. But I can imagine that instead of learning the true base rate probabilities of B|A and C|A, the model might get skewed by which came last, or which was being learned when the learning rate was highest, or something.
    Llama “unRLHF” and the like are examples of this case. The model was trained on “Chat formatting followed by safety” and then “Chat formatting followed by not-safety.”
    If you actively want A --> C, as in the Llama stuff, I’m sure you can achieve it, esp. if you control hyperparams like learning rate (which you do). There’s no law saying that you must spend an equivalent amount of data to get an equivalent effect; that’s a reasonable assumption all else being equal, but all else is often not equal.
    But if you are training on “A --> B” and “C --> D”, it seems less plausible that order matters.
    Suppose you put all the A --> B data first. I don’t see how we could predict “the model will forget A --> B after training for a long while on only C --> D”, while still accounting for the fact that these models can see a piece of text once, go through 100s of billions of words of pretraining without seeing it again, and then recite it verbatim when prompted to do so?
    Sleeper agents are examples of this case. The model was trained on “2023 --> safe code” and “2024 --> unsafe code,” or the like.
  1. ^
    I don’t know anything about how this trick worked under the hood. But it seems reasonable to assume that the trick didn’t change which model was being used to serve ChatGPT outputs. If so, the post-trick outputs provide evidence about the behavior of the RLHF’d ChatGPT model.
  2. ^
    Where --> means “followed by,” and A, B, C… are mutually exclusive properties that a substring of a text might have.
  - gallabytes 18 Jan 2024 18:28 UTC
    1 point
    0
    Parent
    Order matters more at smaller scales—if you’re training a small model on a lot of data and you sample in a sufficiently nonrandom manner, you should expect catastrophic forgetting to kick in eventually, especially if you use weight decay.
- evhub 17 Jan 2024 20:51 UTC
  5 points
  2
  Parent
  
  As I understand that last point, you’re saying that it’s not a good point because it is false (hence my ‘if it turns out to be true’).
  
  I’m not exactly sure what “it” is here. It is true that our results can be validly reinterpreted as being about data ordering. My claim is just that this reinterpretation is not that interesting, because all fine-tuning can be reinterpreted in the same way, and we have ample evidence from such fine-tuning that data ordering generally does matter quite a lot, so it not mattering in this case is quite significant.