It may also be worth adding that transformers aren’t piecewise linear. A self-attention layer dynamically constructs pathways for information to flow through, which is very nonlinear.
It may also be worth adding that transformers aren’t piecewise linear. A self-attention layer dynamically constructs pathways for information to flow through, which is very nonlinear.