Yeah interesting point about batch size. We simply started with 32 because we wanted to simulate real SFT workflows and 32 is is common. Similar story with lr and the rest of the training parameters aside from those needed to save training costs. A quick Claude query says that smaller batch size often generalizes better thanks to gradient noise, but this isn’t super robust across tasks as gaps can be made up with lr and epoch adjustments!
Thanks for the kind words Zephy!
Yeah interesting point about batch size. We simply started with 32 because we wanted to simulate real SFT workflows and 32 is is common. Similar story with lr and the rest of the training parameters aside from those needed to save training costs. A quick Claude query says that smaller batch size often generalizes better thanks to gradient noise, but this isn’t super robust across tasks as gaps can be made up with lr and epoch adjustments!