Another argument is that you can more cleanly backprop through it.
A third argument is that you have constant inference memory and speed as a function of context length. At least if implemented like traditional rnns.
Another argument is that you can more cleanly backprop through it.
A third argument is that you have constant inference memory and speed as a function of context length. At least if implemented like traditional rnns.