Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
You’re conflating the simplicity of a function in terms of how many parameters are required to specify it, in the abstract, and simplicity in terms of how many neural network parameters are fixed.
The actual question should be something like “how precisely does the neural network have to be specified in order for it to maintain low loss”.
An overfitted network is usually not simple because more of the parameter space is constrained by having to exactly fit the training data. There are fewer free remaining parameters. Whether or not those free remaining parameters implement a “simple” function or not is besides the point; if the loss is invariant to them, they don’t count as “part of the program” anyway. And a non-overfitted network will have more free dimensions like this, to which the loss is (near-)invariant.
You’re conflating the simplicity of a function in terms of how many parameters are required to specify it, in the abstract, and simplicity in terms of how many neural network parameters are fixed.
The actual question should be something like “how precisely does the neural network have to be specified in order for it to maintain low loss”.
An overfitted network is usually not simple because more of the parameter space is constrained by having to exactly fit the training data. There are fewer free remaining parameters. Whether or not those free remaining parameters implement a “simple” function or not is besides the point; if the loss is invariant to them, they don’t count as “part of the program” anyway. And a non-overfitted network will have more free dimensions like this, to which the loss is (near-)invariant.