Showing behavioural diffs on random strings is actually a feature (not a bug) for me.
That said, the paper does generalize tensor sim with metric M (eq 3), which could be a covariance matrix on the dataset you’re interested in. Was this what you wanted?


For finding a decomp that has meaningful semantics, I’ve really struggled with this. There are principled decomps like CP & Tucker, and I think the LL1 is more what we want (It’s like Tucker, but you only mix in the two input components. I think this makes sense because if multiple components write to the same direction, then they should be grouped together. In other words, the rank complexity doesn’t matter for the input).
Thomas also wrote a paper finding a canonical representation of a multi-layer tensor network. This allows you to truncate more singular vectors. From a graph/circuit perspective, this lets you have less nodes. However, between layers, they are densely connected (ie dense edges), making it hard to interpret later features in terms of earlier ones.
You can fix this somewhat by finding a rotation the minimizes both edges and nodes; however, because it’s fundamentally SVD under-the-hood, the features are constrained to be orthogonal. This constraint interferes with our ability to have a small number of edges & nodes, so can maybe focus on that for the moment.
Warning: that compositionality paper is quite hard to understand!