Isn’t that only ~10x more expensive than running the forward-passes (even if you don’t do LoRA)? Or is it much more because of communications bottlenecks + the infra being taken by the next pretraining run (without the possibility to swap the model in and out).
Isn’t that only ~10x more expensive than running the forward-passes (even if you don’t do LoRA)? Or is it much more because of communications bottlenecks + the infra being taken by the next pretraining run (without the possibility to swap the model in and out).