Ungrokking can be seen as a special case of catastrophic forgetting (McCloskey and Cohen, 1989; Ratcliff, 1990), where we can make much more precise predictions. First, since ungrokking should only be expected once D′<Dcrit, if we vary D′ we predict that there will be a sharp transition from very strong to near-random test accuracy (around Dcrit). Second, we predict that ungrokking would arise even if we only remove examples from the training dataset, whereas catastrophic forgetting typically involves training on new examples as well. Third, since Dcrit does not depend on weight decay, we predict the amount of “forgetting” (i.e. the test accuracy at convergence) also does not depend on weight decay.
(All of these predictions are then confirmed in the experimental section.)
Is this the same thing as catastrophic forgetting?
From page 6 of the paper:
(All of these predictions are then confirmed in the experimental section.)