Your bilinear attention layer is a bilinear function, but where the input (x) is copied and used as both inputs. Using those matrices Dec, L_Enc and R_enc, the way you show, is one particular way to parametrize the bilinear function. There are many other ways, the simplest of which would be to just use one tensor of shape [size-of-y, size-of-x, size-of-x]. I’m curious, why did you choose that particular parametrization?
Also, how did you initialize the model’s weights? How do you initialize to prevent exploding gradients and similar problems?
I am curious about all this because my masters thesis was about a tensor network based alternative to CNNs.
A full 3rd order tensor is much larger, whereas this parametrization is the CP-decomposition form. This is the “official reason” when I’m really just building off Dooms et al. (I’ve never actually tried training the full tensor though!)
Re init: the init for modded gpt at that fork was kind of weird, but I’m pretty sure most standard inits prevent that. I am using RMSNorm which can be treated as a tensor network as well (I could maybe dm explanation, it’s a forthcoming resource from Thomas). I’m also normalizing Q & K which isn’t a tensor network, BUT compositionality is on a spectrum (maybe I am too). So this does mean a small portion of the model isn’t a tensor network.
Your bilinear attention layer is a bilinear function, but where the input (x) is copied and used as both inputs. Using those matrices Dec, L_Enc and R_enc, the way you show, is one particular way to parametrize the bilinear function. There are many other ways, the simplest of which would be to just use one tensor of shape [size-of-y, size-of-x, size-of-x]. I’m curious, why did you choose that particular parametrization?
Also, how did you initialize the model’s weights? How do you initialize to prevent exploding gradients and similar problems?
I am curious about all this because my masters thesis was about a tensor network based alternative to CNNs.
A full 3rd order tensor is much larger, whereas this parametrization is the CP-decomposition form. This is the “official reason” when I’m really just building off Dooms et al. (I’ve never actually tried training the full tensor though!)
Re init: the init for modded gpt at that fork was kind of weird, but I’m pretty sure most standard inits prevent that. I am using RMSNorm which can be treated as a tensor network as well (I could maybe dm explanation, it’s a forthcoming resource from Thomas). I’m also normalizing Q & K which isn’t a tensor network, BUT compositionality is on a spectrum (maybe I am too). So this does mean a small portion of the model isn’t a tensor network.
Ideally we can work around this!