Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima

In this post, I shall first describe a new word embedding algorithm that I came up with called a matrix product optimized (MPO) word embedding, and I will prove a theorem that completely interprets this word embedding in the simplest case. While it is probably infeasible to mathematically characterize a word embedding with a mathematical proof when the corpus is something that one encounters in practice, this theorem should be a signal that such a word embedding (or a similar word embedding) should be interpretable and mathematical in other ways as well. This theorem also illustrates the way that MPO word embeddings should behave.

Unlike most word embedding algorithms, MPO word embeddings are matrix-valued so that they map tokens to matrices instead of simply mapping tokens to vectors. In our case, the matrices are not necessarily real matrices as they may be complex or even quaternionic matrices. MPO word embeddings also differ from other word embedding algorithms in that they are not constructed using neural networks though we still use gradient ascent.

Why MPO word embeddings?

Since tokens often have many meanings depending on context, it seems better to represent a token in a form where it is easy or easier to separate the individual meanings of a token. While vectors may be good for representing individual meanings of tokens, it is better to represent a polysemantic token as a matrix instead of a vector. If someone were to give me a task of interpreting a word embedding, I would be much happier if the word embedding were a matrix-valued word embedding that neatly organized each of the meanings of a polysemantic token into a matrix than if the word embedding were a vector-valued word embedding where each of the individual meanings of the token were awkwardly smushed together in a vector.

Spaces of matrices have additional structure that is lacking in vector spaces, and one can use this additional structure to analyze or interpret our word embedding. This additional structure also means that matrix-valued word embeddings should behave more mathematically than vector-valued word embeddings.

MPO word embeddings also satisfy some interesting properties. Let denote the fitness level of a random MPO word embeddings that we obtain from the set where consists of the corpus that we are training our word embedding on along with a couple of hyperparameters. Then the random variable will often have very low entropy or even zero entropy. Since often has zero/​low entropy, trained MPO word embeddings will not contain any/​much random information that was not already present in the training data. It will be easier to interpret machine learning models when the trained model only depends on the training data and hyperparameters and which does not depend on random choices such as the initialization. Quaternionic and complex MPO word embeddings will often become real word embeddings after a change of basis. This means that MPO word embeddings behave quite mathematically, and it seems like machine learning models that behave in ways that mathematicians like would be more interpretable and understandable than other machine learning models.

Quaternionic matrices:

In this section, we shall go over the basics of quaternionic matrices. I assume that the reader is already familiar with ideas in linear algebra up through the singular value decomposition. Much of the basic theory of quaternionic matrices is a straightforward generalization of the theory of complex matrices. I hope you are also familiar with the quaternions, but if you are not, then I will remind you the definition and basic facts about quaternions. We refer the reader to [1] for facts about quaternions. You may skip this section if you simply care about the real and complex cases.

If is a square real, complex, or quaternionic matrix, then the spectral radius of is

where is any matrix norm (since we are taking the limit, it does not matter which matrix norm we choose). If is a real or complex matrix, then

. The same fact holds for quaternionic matrices, but we have to be careful about how we define the eigenvalues of quaternionic matrices due to non-commutativity.

The division ring of quaternions is the 4-dimensional associative but non-commutative algebra over the field of real numbers spanned by elements (so ) where quaternionic multiplication is the unique associative (but non-commutative) bilinear operation such that . We observe that if , then . It is easy to show that .

Recall that the conjugate and the absolute value of a quaternion are defined by

and whenever . If , then and . Observe that the field of complex numbers is a subring of .

If is a quaternionic matrix, then define the adjoint of to be . While the adjoint of a quaternionic matrix is well-behaved, the transpose of a quaternionic matrix is not very well behaved since we typically have but for quaternionic matrices .

We shall now associate -quaternionic matrices with -complex matrices.

Suppose that are -complex matrices. Then the associated quaternionic complex matrix of the quaternionic matrix is the complex matrix

. We observe that , and , and whenever are quaternionic matrices and the operations are defined.

An eigenvalue of a quaternionic matrix is a quaternion such that for some non-zero vector .

Observation: If is an eigenvalue of a quaternionic matrix corresponding to the eigenvector and is an eigenvalue of a quaternionic matrix corresponding to the same eigenvector , then , so is an eigenvector of the quaternionic matrix .

Observation: If is an eigenvalue of a product of quaternionic matrices and, then , so if , then is also an eigenvalue of . In particular, and have the same non-zero eigenvalues.

Observation: If is a quaternionic matrix and is an eigenvalue of , then whenever is a non-zero quaternion, the value is also an eigenvalue of .

Proof: If is an eigenvalue of , then there is some non-zero quaternionic vector with . Therefore, set . Then , so is an eigenvalue of the quaternionic matrix .

Let denote the group of all quaternions with . Let denote the group of all -unitary matrices with determinant . Let denote the group of all—orthogonal matrices. Then the mapping is an isomorphism from to . Let be the vector subspace of consisting of all elements of the form where are real numbers. Then is an inner product space where inner product is the standard dot product operation on where and where for quaternions ). Then we can identify with the linear transformations that preserve this inner product. We may now define a group homomorphism by letting . Then the homomorphism is a surjective homomorphism. In fact, is a connected Lie group with , and the group is simply connected; is the universal cover of the lie group . From these facts about quaternions, we may make a few observations:

Observation: If are quaternions, then there is some with if and only if and .

Observation: Suppose that is a square quaternionic matrix and are quaternions with and . Then is an eigenvalue of if and only if is an eigenvalue of .

Observation: Suppose that is a quaternionic vector, and are complex vectors with . Suppose furthermore that is a complex number. Then if and only if . In particular, is an eigenvalue of if and only if is an eigenvalue of .

Observation: Let be a square quaternionic matrix. Then the following quantities are equivalent:

  1. .

We shall therefore define the spectral radius of a square quaternionic matrix to be (or any of the quantities in the above observation).

If is a quaternionic matrix, then we say that is Hermitian, normal, or unitary respectively if is Hermitian, normal, or unitary. If is an -quaternionic matrix, then is Hermitian iff , and is normal iff , and is unitary iff.

If are quaternionic column vectors, then define the quaternionic inner product of by . We observe that the quaternionic inner product is real bilinear and conjugate symmetric ,. The quaternionic inner product preserves right scalar multiplication . We define the real inner product of two quaternionic vectors by setting. We may recover the quaternionic inner product from the real-valued inner product since for quaternionic vectors .

Observation: If are quaternionic vectors, then and for .

Observation: Suppose are quaternionic vectors. Then .

Observation: Suppose are non-zero quaternionic vectors. If , then for some quaternion .

Counterexample: In general, if is a non-zero quaternion, and is a quaternionic vector, then .

Observation: Let be an quaternionic matrix, and let be an -quaternionic matrix. Then the following are equivalent:

  1. for all quaternionic vectors .

  2. for all quaternionic vectors .

Proposition: Let be a quaternionic matrix. Then the following are equivalent:

  1. for all quaternionic vectors .

  2. for all quaternionic vectors .

  3. for all quaternionic vectors .

  4. is an isometry.

If is a square quaternionic matrix, then the above statements are all equivalent to the following:

6. is unitary.

Theorem: (quaternionic singular value decomposition) If is a quaternionic matrix, then there exists quaternionic unitary matrices where is a diagonal matrix with non-negative real non-increasing diagonal entries.

In the above theorem, the diagonal entries in are called the singular values values of and they are denoted by where .

If is a real, complex, or quaternionic matrix, and then define the Schatten -norm of to be where we define for and .

Observation: If is a quaternionic matrix with singular values , then has singular values . In particular, and , so whenever .

Observation: If is a quaternionic matrix, then .

If is an -quaternionic matrix, then define the quaternionic trace of as

. In general, , so the notion of the quaternionic trace is deficient.

If is an -quaternionic matrix, then define the real trace of as We observe that and whenever are quaternionic matrices.

Observe that we can recover from by using the formula .

If are -quaternionic matrices, then define the real-valued Frobenius inner product as .

If and where every is real, then

. Define .

We observe that if are unitary, then

.

In particular, if where are unitary and is diagonal with diagonal entries , then .

MPO word embeddings

Let denote either the field of real numbers, complex numbers, or the division ring of quaternions. Let be a finite set (in our case, will denote the set of all tokens). Let be the set of all functions such that . Observe that is compact. Define a function by letting . We say that is an MPO word pre-embedding for the string of tokens if the quantity is locally maximized. To maximize , the function must simultaneously satisfy two properties. Since , the matrices must be spread out throughout all the dimensions. But we also need for the matrices to be compatible with and and more generally with for near .

We say that a collection of vectors is a tight frame if for some necessarily positive constant . We say that a tight frame is an equal norm tight frame if .

Theorem: Suppose that . Let denote either the field of real numbers or complex numbers. Let denote the unit sphere in . Then the local minimizers of the frame potential defined by are global minimizers which are tight frames for . In particular, there exists an equal norm tight frame whenever .

See [2, Thm 6.9] for a proof.

Theorem: Suppose that is the field of real numbers, complex numbers, or the division ring of quaternions. Let . Then and if and only if there is an equal norm tight frame with and with where for all (where addition in the subscripts is taken modulo ).

Proof: In order to not repeat ourselves, our proof shall apply to all three cases: , but for accessibility purposes, the proof shall be understandable to those are only familiar with real and/​or complex matrices.

Suppose that is an equal norm tight frame with , are elements with and for all . Since and , we have . Therefore,

Therefore, .

Now,

. Therefore, we have

.

Suppose now that . By the arithmetic-geometric mean inequality, we have

Suppose now that . We observe that precisely when . From the arithmetic-geometric mean inequality, we have precisely when . Therefore, each has rank at most and for , so we can set where and.

Observe that . Therefore,

, so

for all .

Therefore, there are quaternions with and where for all .

We observe that

.

Therefore, is an equal norm tight frame.

Some thoughts:

It seems easier to prove theorems about MPO word pre-embeddings than it is to prove theorems about other machine learning models simply because MPO word pre-embeddings are more mathematical in nature. Of course, neural networks with ReLU activation are mathematical too, so we can prove theorems about them, but MPO word pre-embeddings are more like the objects that mathematicians like to investigate. And it seems easier to mathematically prove theorems that interpret MPO word pre-embeddings than it is to prove theorems that interpret other machine learning models. On the other hand, we are making a tradeoff here. MPO word pre-embeddings behave more mathematically, but word embeddings are simply the first layer in natural language processing.

Why use the spectral radius?

The matrix is in general approximately a rank-1 matrix. This means that whenever are unit vectors. One may be tempted to define the fitness of the as or . But mathematical objects are better behaved when taking limits. Instead of considering the string as our corpus, we may consider the similar corpus as , and in this case, the fitness of would be or which will both converge to as . The use of the spectral radius simplifies and improves the behavior of the machine learning model for the same reason that integrals such as behave better than sums .

Why complex numbers and quaternions?

While it takes more time to compute MPO word pre-embeddings for the complex numbers and for the quaternions, the complex and quaternionic MPO word pre-embeddings have some advantages over real MPO word pre-embeddings. It is currently unclear as to whether the advantages of complex and quaternionic MPO word pre-embeddings outweigh the cost of the increased computational complexity of training and using complex or quaternionic word pre-embeddings. Further research is needed on this topic.

Complex and quaternionic matrices provide new ways of testing MPO word pre-embeddings which are not available if we simply used real matrices. For example, if is a complex or quaternionic MPO word pre-embedding, then the existence of a unitary matrix where is a real matrix should be considered evidence that the MPO word pre-embedding is a high quality machine learning model while the non-existence of .

A complex or quaternionic MPO word pre-embedding will typically have a higher fitness level than a corresponding real MPO word pre-embedding.

Sometimes, when one trains an MPO word pre-embedding twice to obtain two MPO word pre-embeddings , the fitness levels are equal (), and it is desirable to have equal fitness levels when training the word pre-embedding multiple times. But the probability of obtaining the equality will depend on the choice of . In many cases, we would have

for . Of course, this probability depends on other factors besides the choice of division ring such as the initialization. For example, if the matrices of the form are initialized to have random positive values, then the probability would be much greater than if the matrices of the form are initialized to have random real values; if each initially has random positive values, then I would currently consider the real-valued MPO pre-embeddings to be about as good as the complex-valued MPO pre-embeddings but easier to compute.

The fitness function is not differentiable everywhere, and the gradient of has singularities. The singularities of have real codimension . In the case when the singularities of the gradient of disconnect the domain , and gradient ascent has difficulty crossing these singularities to reach a good local maximum.

Disadvantages of MPO word pre-embeddings:

  1. Projectivity: If is a ring, then the center of is the collection of all elements where for all . It is easy to show that is always a subring of . We observe that . If , for , and for , then is an MPO word pre-embedding if and only if is an MPO word pre-embedding. This means that after one trains an MPO word pre-embedding , one needs to find the constants where the function defined by behaves optimally. I may talk about how we can find the constants in another post.

  2. Locality: An MPO word pre-embedding is good at relating tokens to their immediate surroundings in the following sense. Suppose is an MPO word pre-embedding for the string . Then the value of will be determined mostly by all the other values for together with the multiset of all strings such that . In other words, can see the immediate surroundings of , but will not be able to see much more than this.

  3. Difficulty utilizing all dimensions without noise: Sometimes MPO word pre-embeddings behave poorly because it will be difficult to maximize the spectral radius while we still have . To ameliorate this problem, we can relax the requirement for to the condition for some . But in this case, the MPO word pre-embedding will not use all of the dimensions in evenly.

Conclusion:

Spectral methods have already been quite valuable in machine learning for a good reason. I will continue to make more posts about using the spectral radius (and similar objects) to construct machine learning models.

Another possible advantages of quaternions: Added 9/​8/​2023

It seems like an increase in the value provides diminishing returns in the performance of MPO word embeddings since when is large, MPO word embeddings have difficulty utilizing all dimensions equally. MPO word embeddings therefore can only give us some information about a token. However, it seems like increasing the index increases the number of parameters of MPO word embeddings without the word embedding encountering substantially more difficulty in utilizing all dimensions significantly. Therefore, MPO word embeddings should perform better simply by increasing the index . This means that complex MPO word embeddings should perform better than real MPO word embeddings, and quaternionic MPO word embeddings should perform better than complex MPO word embeddings. On the other hand, we still need to perform experiments to determine whether complex MPO word embeddings are that much better than real MPO word embeddings and quaternionic MPO word embeddings are even better than both real and complex MPO word embeddings.

References:

  1. Quaternions and matrices of quaternions. Fuzhen Zhang. Linear Algebra and its Applications. Volume 251, 15 January 1997, Pages 21-57

  2. An Introduction to Finite Tight Frames. Shayne F. D. Waldron. July 26, 2017