Why do capabilities emerge suddenly after long plateaus?
Is it true in general that capabilities emerge suddenly after long plateaus? My understanding is that grocking only happens if you set up the learning condition just right.
In the modular addition grocking experiment, there where a general solution that the nn could learn, and there is also a memorize-each-fact solution that it could learn. If you give it enough training data it will just learn the general solution directly. if you give it too little training data it will just memorize. But if you give it something in between, you’ll get a learning trajectory where it first memorize and then switch.
In general I’m not convinced there is a barrier. I think typically the network is always going down in the loss landscape, however sometimes it just takes time.
Is it true in general that capabilities emerge suddenly after long plateaus? My understanding is that grocking only happens if you set up the learning condition just right.
In the modular addition grocking experiment, there where a general solution that the nn could learn, and there is also a memorize-each-fact solution that it could learn. If you give it enough training data it will just learn the general solution directly. if you give it too little training data it will just memorize. But if you give it something in between, you’ll get a learning trajectory where it first memorize and then switch.
In general I’m not convinced there is a barrier. I think typically the network is always going down in the loss landscape, however sometimes it just takes time.