All of the GLU models performed better (lower is better) and the GLU models have a bilinear encoder (just w/ & w/o a sigmoid/GeLU/Swish/ReLU function). So in fact it does better (if this is what you meant by a dual encoder).
HOWEVER, we could have 3 encoders, or 100! This should store even more information, and would probably perform better per step, but would take up more GPU VRAM and/or take longer to compute each step.
In this post, though, I used wall clock time as a measure of training efficiency. Hand-wavy:
loss/step * time/step
(maybe it should be divided to make it loss/time?)
Also I agree with using loss/time as the measure of performance, since it’s fairly straightforward to interpret (loss recovered per unit time). If I were reviewing this, I’d look for that.
For efficiency in practice, I think most ML papers look at FLOP/s since it is hardware agnostic. Maybe a good measure of efficiency here would be loss per FLOP per second? I haven’t seen that used, but it might reflect how performance scales with computational speed.
Edit: Actually thinking about it, the test-time efficiency might be a better comparison, assuming the two scale within roughly the same complexity class. I think from a product perspective, speed for users is super (maybe the most) valuable.
Just looking at Shazeer’s paper (Appendix A)
All of the GLU models performed better (lower is better) and the GLU models have a bilinear encoder (just w/ & w/o a sigmoid/GeLU/Swish/ReLU function). So in fact it does better (if this is what you meant by a dual encoder).
HOWEVER, we could have 3 encoders, or 100! This should store even more information, and would probably perform better per step, but would take up more GPU VRAM and/or take longer to compute each step.
In this post, though, I used wall clock time as a measure of training efficiency. Hand-wavy:
loss/step * time/step
(maybe it should be divided to make it loss/time?)
Ah, that makes more sense, thanks!
Also I agree with using loss/time as the measure of performance, since it’s fairly straightforward to interpret (loss recovered per unit time). If I were reviewing this, I’d look for that.
For efficiency in practice, I think most ML papers look at FLOP/s since it is hardware agnostic. Maybe a good measure of efficiency here would be loss per FLOP per second? I haven’t seen that used, but it might reflect how performance scales with computational speed.
Edit: Actually thinking about it, the test-time efficiency might be a better comparison, assuming the two scale within roughly the same complexity class. I think from a product perspective, speed for users is super (maybe the most) valuable.