The learning rate in modern optimisers is so large that the piecewise-linear loss landscape really looks indistinguishable from a smooth function. The lr you’d need to use to ensure that the next step lands in the same linear patch is ridiculously small, so in practice the true “felt” landscape is something like a smoothed average of the exact landscape.
The learning rate in modern optimisers is so large that the piecewise-linear loss landscape really looks indistinguishable from a smooth function. The lr you’d need to use to ensure that the next step lands in the same linear patch is ridiculously small, so in practice the true “felt” landscape is something like a smoothed average of the exact landscape.