Yair Halberstadt comments on Parameters Are Like Pixels

Yair Halberstadt 14 Jan 2026 17:27 UTC
3 points
−7
It’s a good metaphor, but I think one important aspect this misses is overfitting—when you have a lot of parameters the NN can literally memorise small training sets till it gets 100% on the training set and 0% on the test set. Whereas a smaller model is forced to learn the underlying structure, so generalises better.

Hence larger models need a much larger training set even to match smaller models, which is another disadvantage of larger models (besides for higher per token training and inference costs).
- Daniel Kokotajlo 14 Jan 2026 18:28 UTC
  14 points
  8
  Parent
  from https://arxiv.org/pdf/2001.08361
  
  also see the grokking literature: https://en.wikipedia.org/wiki/Grokking_(machine_learning)
  
  Previous discussion:
  https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent
- Vladimir_Nesov 14 Jan 2026 18:45 UTC
  5 points
  0
  Parent
  
  larger models need a much larger training set even to match smaller models
  
  This is empirically false, perplexity on a test set goes down with increase in model size even for a fixed dataset. See for example Figure 2 in the Llama 3 report, larger models do better with say 1e10 tokens on that plot.
  
  Larger models could be said to want a larger dataset, in the sense that if you are training compute optimally, then with more compute you want both the model size and the dataset size to increase, and so the model size increases together with the dataset size. But even with the dataset of the same size they still do better, at least while reasonably close to compute optimal numbers of tokens.
- simulus 14 Jan 2026 20:15 UTC
  1 point
  2
  Parent
  This was the classical intuition, but turned out to be untrue in the regime of large NNs.
  The modern view is double descent (https://en.wikipedia.org/wiki/Double_descent), where small models generalize better until the number of parameters exceeds the number of training examples, then larger models generalize better with the same amount of data.