The training procedure for random labels was much, much harder, e.g. 100x more steps (such that the x-axis had to be in a log scale to show it on the same graph as the true label case, a la original grokking/​induction head results)
Neural networks couldn’t fit random labels at all, at least on the scale of the datasets they could generalize on.
Some combination of:
The training procedure for random labels was much, much harder, e.g. 100x more steps (such that the x-axis had to be in a log scale to show it on the same graph as the true label case, a la original grokking/​induction head results)
Neural networks couldn’t fit random labels at all, at least on the scale of the datasets they could generalize on.