I agree some forms of speed “priors” are best considered a behavioral selection pressure (e.g., when implemented as a length penalty). But some forms don’t cash out in terms of reward; e.g., within a forward pass, the depth of a transformer puts a hard upper bound on the number of serial computations, plus there might be some inductive bias towards shorter serial computations because of details about how SGD works.
I agree some forms of speed “priors” are best considered a behavioral selection pressure (e.g., when implemented as a length penalty). But some forms don’t cash out in terms of reward; e.g., within a forward pass, the depth of a transformer puts a hard upper bound on the number of serial computations, plus there might be some inductive bias towards shorter serial computations because of details about how SGD works.