This inspired me to read up on mu Parametrization, and though it’s interesting, I might end up using it next time I’m training deep neural networks and want to find good hyperparameters at scale, it really doesn’t seem like anything that could potentially lead to deep safety-relevant understanding. It’s just solving for the values of the parameters that keep activation magnitudes stable. I don’t know about tensor programs or the other things you mentioned. Maybe there’s a case for those.
Yeah, the main application of deep learning theory is muP; the main application to safety is probably not that. muP by itself is not relevant to safety, except insofar as it means people don’t use NTKs as their toy model (though they probably weren’t anyways).
I bring up muP because it’s the main (or only) concrete application of deep learning theory; insofar as you dismiss theory b/c there’s no wins, muP is evidence against that conclusion, in the same way that a lack of other wins is evidence for.
This inspired me to read up on mu Parametrization, and though it’s interesting, I might end up using it next time I’m training deep neural networks and want to find good hyperparameters at scale, it really doesn’t seem like anything that could potentially lead to deep safety-relevant understanding. It’s just solving for the values of the parameters that keep activation magnitudes stable. I don’t know about tensor programs or the other things you mentioned. Maybe there’s a case for those.
Yeah, the main application of deep learning theory is muP; the main application to safety is probably not that. muP by itself is not relevant to safety, except insofar as it means people don’t use NTKs as their toy model (though they probably weren’t anyways).
I bring up muP because it’s the main (or only) concrete application of deep learning theory; insofar as you dismiss theory b/c there’s no wins, muP is evidence against that conclusion, in the same way that a lack of other wins is evidence for.