if I want to find the maxima of a function, it doesn’t matter if I use conjugate gradient descent or Newton’s method or interpolation methods or whatever, they will tend to find the same maxima assuming they are looking at the same function.
In general, those methods find local extrema. They don’t tell you how many there are, or where the next closest point is once you’ve found one of them. A loss landscape might have several local minima. Which one you find depends on where you start.
Why shouldn’t there be different minds that are at comparable minimum values, but not very close on the loss landscape?
In general, those methods find local extrema. They don’t tell you how many there are, or where the next closest point is once you’ve found one of them. A loss landscape might have several local minima. Which one you find depends on where you start.
Why shouldn’t there be different minds that are at comparable minimum values, but not very close on the loss landscape?