6ND rule
8ND may be more accurate, since these pretraining runs usually use gradient checkpointing to reduce memory requirements.
8ND may be more accurate, since these pretraining runs usually use gradient checkpointing to reduce memory requirements.