Very hot take [I would like to have my mind changed]. I think that studying the Science of Deep Learning is one of the least impactful areas that people interested in alignment could work on. To be concrete, I think it is less impactful than: foundational problems (MIRI/Wentworth), prosaic theoretical work (ELK), studying DL (e.g deep RL) systems for alignment failures (Langosco et at) or mechanistic interpretability (Olah stuff) off the top of my head. Some of these could involve the (very general) feedback loop mentioned here, but it wouldn’t be the greatest description of any of these directions.
Figuring out why machine learning “works” is an important problem for several subfields of academic ML (Nakkiran et al, any paper that mentions “bias-variance tradeoff”, statistical learning theory literature, neural tangent kernel literature, lottery ticket hypothesis, …). Science of Deep Learning is an umbrella term for all this work, and more (loss landscape stuff also is under the umbrella, but has a less ambitious goal than figuring out how ML works). Why should it be a fruitful research direction when all the mentioned research directions are not settled research areas, but open and unresolved? Taking an outside view on the question it asks, Science of Deep Learning work is not a tractable research direction.
Additionally, everyone would like to understand how ML works, including those alignment-motivated and those capabilities motivated. This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution. This doesn’t mean the work is definitely net-negative from an alignment perspective, but a case has to be made here to explain why the alignment gains are greater than the capabilities gains. This case is harder to make than the same case for interpretability.
This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution.
I strongly disagree! AFAICT SGD works so well for capabilities that interpretability/actually understanding models/etc. is highly neglected and there’s low-hanging fruit all over the place.
To me, the label “Science of DL” is far more broad than interpretability. However, I was claiming that the general goal of Science of DL is not neglected (see my middle paragraph).
Got it, I was mostly responding to the third paragraph (insight into why SGD works, which I think is mostly an interpretability question) and should have made that clearer.
I think the situation I’m considering in the quoted part is something like this: research is done on SGD training dynamics and researcher X finds a new way of looking at model component Y, and only certain parts of it are important for performance. So they remove that part, scale the model more, and the model is better. This to me meets the definition of “why SGD works” (the model uses the Y components to achieve low loss).
I think interpretability that finds ways models represent information (especially across models) is valuable, but this feels different from “why SGD works”.
Got it, I see. I think of the two as really intertwined (e.g. a big part of my agenda at the moment is studying how biases/path-dependence in SGD affect interpretability/polysemanticity).
I’m not certain about it either but I’m less skeptical. However, I agree with you that some of this could be capabilities work and has to be treated with caution.
However, I think to answer some of the important questions around Deep Learning, e.g. which concepts they learn and under which conditions, we just need to get a better understanding of the entire pipeline. I think it’s plausible that this is very hard and progress is much slower than one would hope.
Very hot take [I would like to have my mind changed]. I think that studying the Science of Deep Learning is one of the least impactful areas that people interested in alignment could work on. To be concrete, I think it is less impactful than: foundational problems (MIRI/Wentworth), prosaic theoretical work (ELK), studying DL (e.g deep RL) systems for alignment failures (Langosco et at) or mechanistic interpretability (Olah stuff) off the top of my head. Some of these could involve the (very general) feedback loop mentioned here, but it wouldn’t be the greatest description of any of these directions.
Figuring out why machine learning “works” is an important problem for several subfields of academic ML (Nakkiran et al, any paper that mentions “bias-variance tradeoff”, statistical learning theory literature, neural tangent kernel literature, lottery ticket hypothesis, …). Science of Deep Learning is an umbrella term for all this work, and more (loss landscape stuff also is under the umbrella, but has a less ambitious goal than figuring out how ML works). Why should it be a fruitful research direction when all the mentioned research directions are not settled research areas, but open and unresolved? Taking an outside view on the question it asks, Science of Deep Learning work is not a tractable research direction.
Additionally, everyone would like to understand how ML works, including those alignment-motivated and those capabilities motivated. This problem is not neglected, and it is very unclear how any insight into why SGD works wouldn’t be directly a capabilities contribution. This doesn’t mean the work is definitely net-negative from an alignment perspective, but a case has to be made here to explain why the alignment gains are greater than the capabilities gains. This case is harder to make than the same case for interpretability.
I strongly disagree! AFAICT SGD works so well for capabilities that interpretability/actually understanding models/etc. is highly neglected and there’s low-hanging fruit all over the place.
To me, the label “Science of DL” is far more broad than interpretability. However, I was claiming that the general goal of Science of DL is not neglected (see my middle paragraph).
Got it, I was mostly responding to the third paragraph (insight into why SGD works, which I think is mostly an interpretability question) and should have made that clearer.
I think the situation I’m considering in the quoted part is something like this: research is done on SGD training dynamics and researcher X finds a new way of looking at model component Y, and only certain parts of it are important for performance. So they remove that part, scale the model more, and the model is better. This to me meets the definition of “why SGD works” (the model uses the Y components to achieve low loss).
I think interpretability that finds ways models represent information (especially across models) is valuable, but this feels different from “why SGD works”.
Got it, I see. I think of the two as really intertwined (e.g. a big part of my agenda at the moment is studying how biases/path-dependence in SGD affect interpretability/polysemanticity).
I’m not certain about it either but I’m less skeptical. However, I agree with you that some of this could be capabilities work and has to be treated with caution.
However, I think to answer some of the important questions around Deep Learning, e.g. which concepts they learn and under which conditions, we just need to get a better understanding of the entire pipeline. I think it’s plausible that this is very hard and progress is much slower than one would hope.