nielsrolf comments on nielsrolf’s Shortform

nielsrolf 14 Oct 2025 15:48 UTC
33 points
2
In the past year, I have finetuned many LLMs and tested some high-level behavioral properties of them. Often, people raise the question if the observed properties would be different if we had used full-parameter finetuning instead of LoRA. From my perspective, LoRA rank is one out of many hyperparameters, and hyperparameters influence how quickly training loss goes down and they may influence the relationship of training- to test-loss, but they don’t meaningfully interact with high-level properties beyond that.

I would be interested if there are any examples where this is wrong—are there any demonstrations of finetuning hyperparameters that influence generalization behavior in interesting ways?

(For example, this question came up in the context of emergent misalignment, where various people asked me if I think that generalization happens because a small lora rank forces the model to learn “more general” solutions.)
- t14n 15 Oct 2025 0:05 UTC
  9 points
  5
  Parent
  Not directly related to the question, but Optimizers Qualitatively Alter Solutions And We Should Leverage This (2025) argues that the choice of optimizer (e.g. first-order methods like AdamW vs second-order methods like Shampoo) not only affects speed of convergence, but properties of the final solution.
  Our position is that the choice of optimizer itself provides an effective mechanism to introduce
  an explicit inductive bias in the process, and as a community we should attempt to understand
  it and exploit it by developing optimizers aimed at converging to certain kinds of solutions. The
  additional implication of this stance is that the optimizer can and does affect the effective expressivity
  of the model class (i.e. what solutions we can learn). We argue that expressivity arguments that
  solely focus on the architecture design and/or data do not provide a complete picture and could be
  misleading for example if used to do model selection. The learning algorithm and choice of optimizer
  are also critical in shaping the characteristics of what are reachable functions and implicitly the final
  learned model.
  An in-the-wild observation is how different the Kimi models are compared to Llamas and Claudes. Kimi (and I suppose now the recent Qwen models) are optimized with Muon+AdamW vs AdamW alone. I’ve seen anecdotes on how different Kimi responses are compared to other models. You can attribute some % of it to their data mix; MoonshotAI staff note they put a lot of effort into looking at and curating training data. But it’s also possible some non-trivial % of the behavior can be attributed to the optimizers used.
- anaguma 14 Oct 2025 19:54 UTC
  8 points
  0
  Parent
  Thinking Machines has published some related analysis on LoRA.
- Daniel Tan 18 Oct 2025 10:51 UTC
  2 points
  0
  Parent
  I guess it also depends on what you consider a ‘finetuning hyperparameter’ - e.g. the broadest interpretation is ‘any way in which you could modify the training process’, which includes lots of things that obviously affect generalization (like adding new data, modifying the data, etc)
  One relatively constrained example might be ‘changing the order of training data’. I do expect that there is path dependence in how we train models—the things models learn early on affect how / what they learn later on. E.g. Sycophancy to Subterfuge could be thought of as an example of this—there is reward hacking with the training curriculum but (presumably) there wouldn’t be if you messed up the order of the training stages.
- Daniel Tan 16 Oct 2025 22:33 UTC
  2 points
  0
  Parent
  In the case of EM even a very tiny LoRA adapter (rank 1) seems sufficient: see post
  Generally, according to Tinker docs, hparams might matter, but only coarsely:
  - LoRA works well “as long as number of params exceeds number of completion tokens”
  - LoRA learning rate should be much higher than full FT learning rate