A drawback is that split-loss gradient routing requires separate backward passes for each part in the partition.
note: at scale, this doesn’t imply any significant computational losses, as far as i can tell—as long as the partition size is very large compared to the batch that fits on a gpu/node, one can group the same-loss training samples to be processed on the same gpu/node. then you can still work with reduced tensors for the forward/backward, and computation of the total gradient for the optimizer is achieved by default by gradient accumulation.
note: at scale, this doesn’t imply any significant computational losses, as far as i can tell—as long as the partition size is very large compared to the batch that fits on a gpu/node, one can group the same-loss training samples to be processed on the same gpu/node. then you can still work with reduced tensors for the forward/backward, and computation of the total gradient for the optimizer is achieved by default by gradient accumulation.