Daniel Murfet comments on Ambiguous out-of-distribution generalization on an algorithmic task

Daniel Murfet 16 Feb 2025 8:19 UTC
3 points
0
For models that do not grok either group, we observe both examples where the LLC stays large throughout training and examples where it falls
What explains the difference in scale of the LLC estimates here and in the earlier plot, where they are < 100? Perhaps different hyperparameters?
- Wilson Wu 16 Feb 2025 17:32 UTC
  2 points
  0
  Parent
  For that earlier section, we used smaller models trained on $S_{4}$ intersect $A_{4} \times 2$ (4,000 parameters) instead of $S_{5}$ intersect $A_{5} \times 2$ (80,000 parameters) -- the only reason for this was to allow for a larger sample size of 10,000 models with our compute budget. All subsequent sections use the $S_{5}$ models.