gwern comments on Florian_Dietz’s Shortform

gwern 2 Jun 2025 17:16 UTC
8 points
7
I think there are many ways that a LLM could have situated awareness about what phase it is in, but I’m not sure if the gradient descent itself is a possibility?

While a NN is running the forward pass without any backprop, it is computing exactly the same thing (usually) that it would be computing if it was running a forward pass before a backwards pass to do a backprop. Otherwise, the backprop can’t really work—if it doesn’t see the ‘real’ forward pass, how does it ‘know’ how to adjust the model parameters to make the model compute a better forward pass next time? So I can’t see how, while running a forward pass, a LLM could ‘know’ if it was about to do a backprop step on a piece of text; for all it knows, maybe someone is running its forward pass just to get out the log-prob at the end, and that is all. (Extreme counterexample: maybe there is a software error and the training code crashes before it finishes running .update() after running .forward(); how could the model ‘know’ that this will happen?) This is true regardless of how many times it has trained on a piece of text.

I’m skeptical that some sort of mismatch from successive gradient steps would be a cue either, because usually you are training at a critical batch size, and for these LLMs, we’d expect them to be updating on millions of tokens simultaneously, at least, and possibly rolling out the updated parameters in a staggered or partial fashion as well, so by the time a gradient update ‘arrives’ from a specific piece of text, that’s now also a gradient update over like a hundred+ books of text as well as itself, diluting any kind of signal.

And wouldn’t it usually train on a piece of text only a few times, at most? And if you are doing multi-epoch training, because you have started to run low on data, usually you train on the same datapoint at very widely separated, by many gradient steps, intervals; the memorization/forgetting dynamics imply you may have forgotten a datapoint entirely by the time it comes around again.
- Florian_Dietz 3 Jun 2025 7:32 UTC
  0 points
  0
  Parent
  Those sounds like good counterarguments, but I still think there could be enough information there for the LLM to pick it up: It seems plausible to me that a set of weights that is being updated often is different in some measurable way than a set of weights that has already converged. I don’t have proof for this, only intuition. It feels similar to how I can tell if my own movement is well-practiced or not, or if my intuition about a topic is well-founded or not, even withou consciously thinking about how confident I should be based on objective measures.
  - gwern 4 Jun 2025 20:51 UTC
    3 points
    0
    Parent
    Yes, a NN can definitely do something like know if it recognizes a datapoint, but it has no access to the backwards step per se. Like take my crashing example: how, while thinking in the forward pass, can it ‘know’ there will be a backward pass when there might be no backward pass (eg because there was a hardware fault)? The forward pass would appear to be identical in every way between the forward pass that happens when there is a backward pass, and when the backward pass doesn’t happen because it crashed. At best, it seems like a NN cannot do more than some sort of probabilistic thing involving gradient hacking, and hope to compute in such a way that if there is a following backward pass, then that will do something odd.
    
    I don’t think this is impossible in principle, based on meta-learning examples or higher-order gradients (see eg my “input-free NN” esoteric NN architecture proposal), but it’s clearly a very difficult, fragile, strange situation where it’s certainly not obvious that a regular LLM would be able to do it, or choose to do so when there are so many other kinds of leakage or situated awareness or steganography possible.
    - Florian_Dietz 5 Jun 2025 8:17 UTC
      1 point
      0
      Parent
      It can’t tell for sure if there will be a backward pass, but it doesn’t need to. Just being able to tell probabilistically that it is currently in a situation that looks like it has recently been trained on implies pretty strongly that it should alter its behavior to look for things that might be training related.