Logan Riggs comments on Tensor-Transformer Variants are Surprisingly Performant

Logan Riggs 14 Jan 2026 15:19 UTC
3 points
0
Just looking at Shazeer’s paper (Appendix A)
All of the GLU models performed better (lower is better) and the GLU models have a bilinear encoder (just w/ & w/o a sigmoid/GeLU/Swish/ReLU function). So in fact it does better (if this is what you meant by a dual encoder).
HOWEVER, we could have 3 encoders, or 100! This should store even more information, and would probably perform better per step, but would take up more GPU VRAM and/or take longer to compute each step.
In this post, though, I used wall clock time as a measure of training efficiency. Hand-wavy:
loss/step * time/step
(maybe it should be divided to make it loss/time?)
- Roman Belaire 16 Jan 2026 8:44 UTC
  1 point
  0
  Parent
  Ah, that makes more sense, thanks!
  
  Also I agree with using loss/time as the measure of performance, since it’s fairly straightforward to interpret (loss recovered per unit time). If I were reviewing this, I’d look for that.
  
  For efficiency in practice, I think most ML papers look at FLOP/s since it is hardware agnostic. Maybe a good measure of efficiency here would be loss per FLOP per second? I haven’t seen that used, but it might reflect how performance scales with computational speed.
  Edit: Actually thinking about it, the test-time efficiency might be a better comparison, assuming the two scale within roughly the same complexity class. I think from a product perspective, speed for users is super (maybe the most) valuable.