I will split this into a math reply, and a reply about the big picture / info loss interpretation.
Math reply:
Thanks for fleshing out the calculus rigorously; admittedly, I had not done this. Rather, I simply assumed MSE loss and proceeded largely through visual intuition.
I agree that assuming MSE, and looking at a local minimum, you have
This is still false! Edit: I am now confused, I don’t know if it is false or not.
You are conflating and . Adding disambiguation, we have:
So we see that the second term disappears if . But the critical point condition is . From chain rule, we have:
So it is possible to have a local minimum where , if is in the left null-space of . There is a nice qualitative interpretation as well, but I don’t have energy/time to explain it.
However, if we are at a perfect-behavior global minimum of a regression task, then is definitely zero.
A few points about rank equality at a perfect-behavior global min:
holds as long as is a diagonal matrix. It need not be a multiple of the identity.
Hence, rank equality holds anytime the loss is a sum of functions s.t. each function only looks at a single component of the behavior.
If the network output is 1d (as assumed in the post), this just means that the loss is a sum over losses on individual inputs.
We can extend to larger outputs by having the behavior be the flattened concatenation of outputs. The rank equality condition is still satisfied for MSE, Binary Cross Entropy, and Cross Entropy over a probability vector. It is not satisfied if we consider the behavior to be raw logits (before the softmax) and softmax+CrossEntropy as the loss function. But we can easily fix that by considering probability (after softmax) as behavior instead of raw logits.
Do you generally think that people in the AI safety community should write publicly about what they think is “the missing AGI ingredient”?
It’s remarkable that this post was well received on the AI Alignment Forum (18 karma points before my strong downvote).