In your main computation it seems like it’s being treated as a scalar.
It’s an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
Vivek wanted to suppose that Hess(l) were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss.
In theory, a loss function that explicitly depends on network parameters would behave differently than is assumed in this derivation, yes. But that’s not how standard loss functions usually work. If a loss function did have terms like that, you should indeed get out somewhat different results.
But that seems like a thing to deal with later to me, once we’ve worked out the behaviour for really simple cases more.
Another (probably more important but higher-level) issue is basically: What is your definition of ‘feature’? I could say: Have you not essentially just defined `feature’ to be something like `an entry of Jf(θ)’? Is the example not too contrived in that sense it clearly supposes that f has a very special form (in particular it is linear in the Θ variables so that the derivatives are not functions of Θ.)
A feature to me is the same kind of thing it is to e.g. Chris Olah. It’s the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network.
I’m not assuming that the function is linear in \Theta. If it was, this whole thing wouldn’t just be an approximation within second order Taylor expansion distance, it’d hold everywhere.
In multi-layer networks, what the behavioural gradient is showing you is essentially what the network would look like if you approximated it for very small parameter changes, as one big linear layer. You’re calculating how the effects of changes to weights in previous layers “propagate through” with the chain rule to change what the corresponding feature would “look like” if it was in the final layer.
Obviously, that can’t be quite the right way to do things outside this narrow context of interpreting the meaning of the basin near optima. Which is why we’re going to try out building orthogonal sets layer by layer instead.
To be clear, none of this is a derivation showing that the L2 norm perspective is the right thing to do in any capacity. It’s just a suggestive hint that it might be. We’ve been searching for the right definition of “feature independence” or “non-redundancy of computations” in neural networks for a while now, to get an elementary unit of neural networks that we can build our definition of computational modularity, and ultimately a whole theory of Deep Learning and network selection, on top of.
This stuff seems like a concrete problem where the math is pointing towards a particular class of operationalisations of these concepts. So I think it makes sense to try out this angle and see what happens.
It’s an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of Jf.
A feature to me is the same kind of thing it is to e.g. Chris Olah. It’s the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network.
I’m not assuming that the function is linear in \Theta. If it was, this whole thing wouldn’t just be an approximation within second order Taylor expansion distance, it’d hold everywhere.
OK maybe I’ll try to avoid a debate about exactly what ‘feature’ means or means to different people, but in the example, you are clearly using f(x)=Θ0+Θ1x1+Θ2cos(x1). This is a linear function of the Θ variables. (I said “Is the example not too contrived....in particular it is linear in Θ”—I’m not sure how we have misunderstood each other, perhaps you didn’t realise I meant this example as opposed to the whole post in general). But what it means is that in the next line when you write down the derivative with respect to Θ, it is an unusually clean expression because it now doesn’t depend on Θ. So again, in the crucial equation right after you say “The Hessian matrix for this network would be...”, you in general get Θ variables appearing in the matrix. It is just not as clean as this expression suggests in general.
I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm.
Fair enough, should probably add a footnote.
More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of Jf.
Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you’re effectively just adding extra norm to features in one part of the output over the others.
Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect.
But what it means is that in the next line when you write down the derivative with respect to Θ, it is an unusually clean expression because it now doesn’t depend on Θ.
The perturbative series evaluates the network at particular values of Θ. If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.
It’s an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
In theory, a loss function that explicitly depends on network parameters would behave differently than is assumed in this derivation, yes. But that’s not how standard loss functions usually work. If a loss function did have terms like that, you should indeed get out somewhat different results.
But that seems like a thing to deal with later to me, once we’ve worked out the behaviour for really simple cases more.
A feature to me is the same kind of thing it is to e.g. Chris Olah. It’s the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network.
I’m not assuming that the function is linear in \Theta. If it was, this whole thing wouldn’t just be an approximation within second order Taylor expansion distance, it’d hold everywhere.
In multi-layer networks, what the behavioural gradient is showing you is essentially what the network would look like if you approximated it for very small parameter changes, as one big linear layer. You’re calculating how the effects of changes to weights in previous layers “propagate through” with the chain rule to change what the corresponding feature would “look like” if it was in the final layer.
Obviously, that can’t be quite the right way to do things outside this narrow context of interpreting the meaning of the basin near optima. Which is why we’re going to try out building orthogonal sets layer by layer instead.
To be clear, none of this is a derivation showing that the L2 norm perspective is the right thing to do in any capacity. It’s just a suggestive hint that it might be. We’ve been searching for the right definition of “feature independence” or “non-redundancy of computations” in neural networks for a while now, to get an elementary unit of neural networks that we can build our definition of computational modularity, and ultimately a whole theory of Deep Learning and network selection, on top of.
This stuff seems like a concrete problem where the math is pointing towards a particular class of operationalisations of these concepts. So I think it makes sense to try out this angle and see what happens.
I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of Jf.
OK maybe I’ll try to avoid a debate about exactly what ‘feature’ means or means to different people, but in the example, you are clearly using f(x)=Θ0+Θ1x1+Θ2cos(x1). This is a linear function of the Θ variables. (I said “Is the example not too contrived....in particular it is linear in Θ”—I’m not sure how we have misunderstood each other, perhaps you didn’t realise I meant this example as opposed to the whole post in general). But what it means is that in the next line when you write down the derivative with respect to Θ, it is an unusually clean expression because it now doesn’t depend on Θ. So again, in the crucial equation right after you say “The Hessian matrix for this network would be...”, you in general get Θ variables appearing in the matrix. It is just not as clean as this expression suggests in general.
Fair enough, should probably add a footnote.
Do any practically used loss functions actually have cross terms that lead to off-diagonals like that? Because so long as the matrix stays diagonal, you’re effectively just adding extra norm to features in one part of the output over the others.
Which makes sense, if your loss function is paying more attention to one part of the output than others, then perturbations to the weights of features of that part are going to have an outsized effect.
The perturbative series evaluates the network at particular values of Θ. If your network has many layers that slowly build up an approximation of the function cos(x), to use in the final layer, it will effectively enter the behavioural gradient as cos(x), even though its construction evolves many parameters in previous layers.
You’re right about the loss thing; it isn’t as important as I first thought it might be.