but for sufficiently large function approximators, the trend reverses
Transformers/deep learning work because of built-in regularization methods (like dropout layers) and not because “the trend reverses”. If you did naive “best fit polynomial” with a 7 billion parameter polynomial you would not get a good result.
Transformers/deep learning work because of built-in regularization methods (like dropout layers) and not because “the trend reverses”. If you did naive “best fit polynomial” with a 7 billion parameter polynomial you would not get a good result.