cousin_it comments on Understanding the tensor product formulation in Transformer Circuits

cousin_it 28 Dec 2021 10:54 UTC
4 points
0
Can’t say much about transformers, but the tensor product definition seems off. There can be many elements in V⊗W that aren’t expressible as v⊗w, only as a linear combination of multiple such. That can be seen from dimensionality: if v and w have dimensions n and m, then all possible pairs can only span n+m dimensions (Cartesian product), but the full tensor product has nm dimensions.

Here’s an explanation of tensor products that I came up with sometime ago in an attempt to make it “click”. Imagine you have a linear function that takes in two vectors and spits out a number. But wait, there are two natural but incompatible ways to imagine it:
1. f(a,b) + f(c,d) = f(a+c,b+d), linear in both arguments combined. The space of such functions has dimension n+m, and corresponds to Cartesian product.
2. f(a,b) + f(a,c) = f(a,b+c) and also f(a,c) + f(b,c) = f(a+b,c), in other words, linear in each argument separately. The space of such functions has dimension nm, and corresponds to tensor product.
It’s especially simple to work through the case n=m=1. In that case all functions satisfying (1) have the form f(x,y)=ax+by, so their space is 2-dimensional, while all functions satisfying (2) have the form f(x,y)=axy, so their space is 1-dimensional. Admittedly this case is a bit funny because nm<n+m, but you can see how in higher dimensions the space of functions of type (2) becomes much bigger, because it will have terms for x1y1, x1y2, etc.
- Tom Lieberum 28 Dec 2021 14:12 UTC
  1 point
  0
  Parent
  Ah yes that makes sense to me. I’ll modify the post accordingly and probably write it in the basis formulation.
  
  ETA: Fixed now, computation takes a tiny bit longer but hopefully still readable to everyone.