Not directly related to the question, but Optimizers Qualitatively Alter Solutions And We Should Leverage This (2025) argues that the choice of optimizer (e.g. first-order methods like AdamW vs second-order methods like Shampoo) not only affects speed of convergence, but properties of the final solution.
Our position is that the choice of optimizer itself provides an effective mechanism to introduce
an explicit inductive bias in the process, and as a community we should attempt to understand
it and exploit it by developing optimizers aimed at converging to certain kinds of solutions. The
additional implication of this stance is that the optimizer can and does affect the effective expressivity
of the model class (i.e. what solutions we can learn). We argue that expressivity arguments that
solely focus on the architecture design and/or data do not provide a complete picture and could be
misleading for example if used to do model selection. The learning algorithm and choice of optimizer
are also critical in shaping the characteristics of what are reachable functions and implicitly the final
learned model.
An in-the-wild observation is how different the Kimi models are compared to Llamas and Claudes. Kimi (and I suppose now the recent Qwen models) are optimized with Muon+AdamW vs AdamW alone. I’ve seen anecdotes on how different Kimi responses are compared to other models. You can attribute some % of it to their data mix; MoonshotAI staff note they put a lot of effort into looking at and curating training data. But it’s also possible some non-trivial % of the behavior can be attributed to the optimizers used.
Have you experimented with ComfyUI? Image generation for “power users”—many more knobs to turn and control output generation. There’s actually a lot of room for steering during the diffusion/generation process, but it’s hard to expose it in the same intuitive/general way as a prompt in language. Comfy feels more like learning Adobe or Figma.