Yes, that’s part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you’d expect it to be a change in the constant, not exponent or parameterization. (At least, it’s certainly not obvious to me that that’s what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn’t have so many papers about fixing it.)
Yes, that’s part of what I mean about regularization having weird effects and interactions in practice. If it was a Bayesian informative prior which is the nice theoretical interpretation of penalized regression stuff, you would not expect it to be equivalent to rescaling the LR and discover that you had in effect lowered the LR permanently or something, as opposed to washing out & simply requiring you to spend more data to overcome your poor choice of prior. In a scaling law context, you’d expect it to be a change in the constant, not exponent or parameterization. (At least, it’s certainly not obvious to me that that’s what WD would be equivalent to, and if AdamW and weight decay worked like one assumed they did, the Hutter group wouldn’t have so many papers about fixing it.)