Regularization implements Occam’s Razor for machine learning systems.
When we have multiple hypotheses consistent with the same data (an overdetermined problem) Occam’s Razor says that the “simplest” one is more likely true.
When an overparameterized LLM is traversing the subspace of parameters that solve the training set seeking the smallest l2-norm say, it’s also effectively choosing the “simplest” solution from the solution set, where “simple” is defined as lower parameter norm i.e. more “concisely” expressed.
Agreed with your example, and I think that just means that L2 norm is not a pure implementation of what we mean by “simple”, in that it also induces some other preferences. In other words, it does other work too. Nevertheless, it would point us in the right direction frequently e.g. it will dislike networks whose parameters perform large offsetting operations, akin to mental frameworks or beliefs that require unecessarily and reducible artifice or intermediate steps.
Worth keeping in mind that “simple” is not clearly defined in the general case (forget about machine learning). I’m sure lots has been written about this idea, including here.