I agree that interrupting is an art. I love this statement in particular:
many situations (in fact almost all) won’t have time for everyone to say everything they’d like to have heard, so competition is tempting and prevalent
I feel that I have a handle on when to interrupt in a two-person conversation, but the dynamics of interrupting differ:
-
as the number of participants increases (there’s a phase shift at three, and then another one at six or seven as people typically don’t like to speak less than one fifth of the time);
-
as the conversation becomes less meta-cooperative (e.g, the conversation is happening in some public forum or in order for a decision to be made, so there is an incentive to speak more so that your ideas get to be heard more).
(As another note, I find that even with three people, there needs to be either a significant amount of yielding or else substantial agreement on what the topic and frame of the conversation should be/implicitly are, or else the thread of the conversation will repeatedly “miss the point” and meander.)
I would be interested to hear any insights anyone has about how to navigate any of these better, unilaterally or by establishing group norms (formally or by reinforcement).
note: at scale, this doesn’t imply any significant computational losses, as far as i can tell—as long as the partition size is very large compared to the batch that fits on a gpu/node, one can group the same-loss training samples to be processed on the same gpu/node. then you can still work with reduced tensors for the forward/backward, and computation of the total gradient for the optimizer is achieved by default by gradient accumulation.