Has anyone ever trained a transformer that doesn’t suck, without positional information (such as positional embedding or causal mask)?
There’s the atom transformer in AlphaFold-like architectures, although the embeddings it operates on do encode 3D positioning from earlier parts of the model so maybe that doesn’t count.
Has anyone ever trained a transformer that doesn’t suck, without positional information (such as positional embedding or causal mask)?
There’s the atom transformer in AlphaFold-like architectures, although the embeddings it operates on do encode 3D positioning from earlier parts of the model so maybe that doesn’t count.