No, that is not what I am saying. I am saying that the typical reason these sorts of “misgeneralizations” happen is not that there are many parameter configurations on the neural network architecture that all get the same training loss, but extrapolate very differently to new data. It’s that some parameter configurations that do not extrapolate to new data in the way the ml engineers want straight up get better loss on the training data than parameter configurations that do extrapolate to new data in the way the ml engineers want.
I don’t think “overfitting” is really the right frame for what’s going on here. This isn’t a problem with neural networks having bad simplicity priors and choosing solutions that are more algorithmically complex than they need to be. Modern neural networks have pretty good simplicity priors. I don’t expect misaligned AIs to have larger effective parameter counts than aligned AIs. The problem isn’t that they overfit, the problem is that the algorithmically simplest fit to the training environment that scores the lowest loss often just doesn’t actually have the internal properties the ml engineers hoped it would have when they set up that training environment.
No, that is not what I am saying. I am saying that the typical reason these sorts of “misgeneralizations” happen is not that there are many parameter configurations on the neural network architecture that all get the same training loss, but extrapolate very differently to new data. It’s that some parameter configurations that do not extrapolate to new data in the way the ml engineers want straight up get better loss on the training data than parameter configurations that do extrapolate to new data in the way the ml engineers want.
I don’t think “overfitting” is really the right frame for what’s going on here. This isn’t a problem with neural networks having bad simplicity priors and choosing solutions that are more algorithmically complex than they need to be. Modern neural networks have pretty good simplicity priors. I don’t expect misaligned AIs to have larger effective parameter counts than aligned AIs. The problem isn’t that they overfit, the problem is that the algorithmically simplest fit to the training environment that scores the lowest loss often just doesn’t actually have the internal properties the ml engineers hoped it would have when they set up that training environment.