Not particularly relevant, I think, but interesting nonetheless.
A first drawback of this paper is that its conclusion assumes that the NN underneath trains with gradient flow (GF), which is the continuous-time version of gradient descent (GD). This is a good assumption if the learning rate is very small, and the resulting GD dynamics closely track the GF differential equation.
This does not seem to be the case in practice. Larger initial learning rates help get better performance (https://arxiv.org/abs/1907.04595), so people use them in practice. If what people use in practice was well-approximated by GF, then smaller learning rates would give the same result. You can use another differential equation that does seem to approximate GD fairly well (http://ai.stanford.edu/blog/neural-mechanics/), but I don’t know if the math from the paper still works out.
Second, as the paper points out, the kernel machine learned by GD is a bit strange in that the coefficients $a_i$ for weighing different $K(x, x_i)$ depend on $x$. Thus, the resulting output function is not in the reproducing kernel Hilbert space of the kernel that is purported to describe the NN. As a result, as kernel machines go, it’s pretty weird. I expect that a lot of the analysis about the output of the learning process (learning theory etc) assumes that the $a_i$ do not depend on the test input $x$.
Not particularly relevant, I think, but interesting nonetheless.
A first drawback of this paper is that its conclusion assumes that the NN underneath trains with gradient flow (GF), which is the continuous-time version of gradient descent (GD). This is a good assumption if the learning rate is very small, and the resulting GD dynamics closely track the GF differential equation.
This does not seem to be the case in practice. Larger initial learning rates help get better performance (https://arxiv.org/abs/1907.04595), so people use them in practice. If what people use in practice was well-approximated by GF, then smaller learning rates would give the same result. You can use another differential equation that does seem to approximate GD fairly well (http://ai.stanford.edu/blog/neural-mechanics/), but I don’t know if the math from the paper still works out.
Second, as the paper points out, the kernel machine learned by GD is a bit strange in that the coefficients $a_i$ for weighing different $K(x, x_i)$ depend on $x$. Thus, the resulting output function is not in the reproducing kernel Hilbert space of the kernel that is purported to describe the NN. As a result, as kernel machines go, it’s pretty weird. I expect that a lot of the analysis about the output of the learning process (learning theory etc) assumes that the $a_i$ do not depend on the test input $x$.
Good point!
Do you know of any work that applies similar methods to study the equivalent kernel machine learned by predictive coding?
I don’t, and my best guess is that nobody has done it :) The paper you linked is extremely recent.
You’d have to start by finding an ODE model of predictive coding, which I suppose is possible taking limit of the learning rate to 0.