In other words, “using unlearning techniques like GradDiff/MaxEnt during pretraining” might be a really powerful technique.
I have a cached thought that this was found to disrupt overall capabilities / make learning harder, but I don’t have a reference on hand.
I have a cached thought that this was found to disrupt overall capabilities / make learning harder, but I don’t have a reference on hand.