Note: not an NN expert, I’m speculating out of my depth.
This is related to the Gaussian process limit of infinitely wide neural networks. Bayesian Gaussian process regression can be expressed as Bayesian linear regression, they are really the same thing mathematically. They mostly perform worse than finitely wide neural networks, but they work decently. The fact that they work at all suggests that indeed linear stuff is fine, and that even actual NNs may be working in a regime close to linearity in some sense.
A counterargument to linearity could be: since the activations are ReLUs, i.e., flat then linear, the NN is locally mostly linear, which allows weight averaging and linear interpretations, but then the global behavior is not linear. Then a counter-counter-argument is: is the nonlinear region actually useful?
Combining the hypotheses/interpretations 1) the GP limit works although it is worse 2) various local stuff based on linearity works, I make the guess that maybe the in-distribution behavior of the NN is mostly linear, and going out-of-distribution brings the NN to strongly non-linear territory?
A key distinction is between linearity in the weights vs. linearity in the input data.
For example, the function f(a,b,x,y)=asin(x)+bcos(y) is linear in the arguments a and b but nonlinear in the arguments x and y, since sin and cos are nonlinear.
Similarly, we have evidence that wide neural networks f(x;θ) are (almost) linear in the parameters θ, despite being nonlinear in the input data x (due e.g. to nonlinear activation functions such as ReLU). So nonlinear activation functions are not a counterargument to the idea of linearity with respect to the parameters.
If this is so, then neural networks are almost a type of kernel machine, doing linear learning in a space of features which are themselves a fixed nonlinear function of the input data.
Note: not an NN expert, I’m speculating out of my depth.
This is related to the Gaussian process limit of infinitely wide neural networks. Bayesian Gaussian process regression can be expressed as Bayesian linear regression, they are really the same thing mathematically. They mostly perform worse than finitely wide neural networks, but they work decently. The fact that they work at all suggests that indeed linear stuff is fine, and that even actual NNs may be working in a regime close to linearity in some sense.
A counterargument to linearity could be: since the activations are ReLUs, i.e., flat then linear, the NN is locally mostly linear, which allows weight averaging and linear interpretations, but then the global behavior is not linear. Then a counter-counter-argument is: is the nonlinear region actually useful?
Combining the hypotheses/interpretations 1) the GP limit works although it is worse 2) various local stuff based on linearity works, I make the guess that maybe the in-distribution behavior of the NN is mostly linear, and going out-of-distribution brings the NN to strongly non-linear territory?
A key distinction is between linearity in the weights vs. linearity in the input data.
For example, the function f(a,b,x,y)=asin(x)+bcos(y) is linear in the arguments a and b but nonlinear in the arguments x and y, since sin and cos are nonlinear.
Similarly, we have evidence that wide neural networks f(x;θ) are (almost) linear in the parameters θ, despite being nonlinear in the input data x (due e.g. to nonlinear activation functions such as ReLU). So nonlinear activation functions are not a counterargument to the idea of linearity with respect to the parameters.
If this is so, then neural networks are almost a type of kernel machine, doing linear learning in a space of features which are themselves a fixed nonlinear function of the input data.