Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices

Here are some empirical observations that I have made on August 14, 2023 to August 19, 2023 that are characteristics of the interpretability of my own matrix dimensionality reduction algorithm. These phenomena that we observe do not occur on all inputs (they sometimes occur partially); and it would be nice if there were a more complete mathematical theory with proofs that explains these empirical phenomena.

Given a (possibly directed and possibly weighed) graph with nodes represented as a collection of -matrices , we will observe that a dimensionality reduction of where each is a -matrix (I call this dimensionality reduction an LSRDR) is in many cases the optimal solution to a combinatorial problem for the graph. In this case, we have a complete interpretation of what the dimensionality reduction algorithm is doing.

For this post, let denote either the field of real numbers or the field of complex numbers (everything also works well when is the division ring of quaternions).

Notation: is the spectral radius of the matrix . denotes the transpose of while denotes the adjoint of . We say that a tuples of matrices is jointly similar to if there is an invertible with for . denotes the tensor product of with .

Let . Suppose that are -matrices with entries in . Then we say that a collection of -matrices with entries in is an -spectral radius dimensionality reduction (abbreviated LSRDR) of if the following quantity is locally maximized:. LSRDRs may be computed using a variation of gradient ascent.

If is an LSRDR of , then one will typically be able to find matrices and some where for and where and . We shall call a -SRDR projection operator of . is jointly similar to where is the -zero matrix. The -SRDR projection operator is typically unique in the sense that if we run the gradient ascent to obtain another -SRDR projection operator, then we will obtain the same -SRDR projection operator that we originally had. If are -matrices, then let denote the completely positive linear operator defined by . Let denote the dominant eigenvector of with , and let denote the dominant eigenvector of with . Then the eigenvectors will typically be positive semidefinite matrices.

Suppose now that are finite dimensional -inner product spaces. Let . Let be linear transformations. Suppose that for , there are where if and , then. Suppose that is a -SRDR projection operator for . Then we will typically be able to find linear operators for where . Since , we observe that for all . As a consequence, there will be positive semidefinite operators for where .

Application: Weighed graph/​digraph dominant clustering.

Let be a vertex set. Suppose that . Let be a function. For example, the function could denote a weight matrix of a graph or neural network. For each , let be the -matrix where the -th entry is and all the other entries are zero. Then we will typically be able to find an SRDR projection operator of along with a set where is the diagonal matrix where the -th diagonal entry is for and otherwise. The set represents a dominant cluster of size in the set . Let be the -matrix where for all . If , then set

where whenever and otherwise. In other words, if , then whenever and otherwise. Then the dominant cluster will typically be the subset of of size where the spectral radius is maximized.

We say that a square matrix with non-negative real entries is a primitive matrix if there is some where each entry in is positive. Suppose now that is the direct sum of a primitive matrix and a zero matrix. Then the spectral radius is the dominant eigenvalue of , and the root of the characteristic polynomial of has multiplicity . Furthermore, there is an vector with non-negative real entries with . We shall call the Perron-Frobenius eigenvector of .

For , let be the dominant eigenvector of where the sum of the entries in is , and let be the dominant eigenvector of where the sum of the entries in is . If is a vector, then let denote the matrix where is the list of diagonal entries in . Then .

The problem of maximizing is a natural problem that is meaningful for adjacency matrices of (weighted) graphs/​digraphs and Markov chains. If is the adjacency matrix of a graph or a digraph , then the value is a measure of how internally connected the underlying graph is, and if the graph is undirected and simple, then is maximized when is a clique (recall that a subset of an simple undirected graph is a clique if all of the pairs of distinct nodes are connected to each other). More specifically, the number of paths in the induced subgraph with edges will be about . To make this statement precise, if there are paths in the induced subgraph of length , then . Therefore, the set maximizes the number of paths in the induced subgraph of length for large .

The problem of finding a clique of size in a graph is an NP-complete problem, so we should not expect for there to be an algorithm that always solves this problem efficiently. On the other hand, for many NP-complete problems, there are plenty of heuristic algorithms that give decent solutions in most cases. The use of LSRDRs to find the clique is another kind of heuristic algorithm that can be used to find the largest clique in a graph and solve more general problems. But the NP-completeness of the problem of finding a clique of size in a graph also indicates that LSRDRs most likely are unable to find produce cliques in exceptionally difficult graphs.

If is the transition matrix of an irreducible and aperiodic Markov chain , then the probability that will be approximately . More precisely, . In this case, the set maximizes the probability for large values .

Maximizing the total weight of edges of an induced subgraph:

If is a matrix, then let denote the sum of all the entries in .

Proposition: Suppose that is the matrix where each entry of is . Let be a real-valued matrix. Then .

Let be the -matrix where each entry of is . Let be a real -matrix. For simplicity, assume that the value is distinct for each . Let . Then for sufficiently small , the spectral radius is maximized (subject to the condition that ) precisely when the sum is maximized. LSRDRs may therefore be used to find the subset with that maximizes .

Why do LSRDRs behave this way?

Suppose that are complex matrices that generate the algebra . Then there is some invertible and constant where the operator is a quantum channel (by a quantum channel, I mean a completely positive trace preserving superoperator), so LSRDRs should be considered to be dimensionality reductions of quantum channels . Primitive matrices can be associated with stochastic matrices in the same way; if is a primitive matrix, then there is a diagonal matrix and a constant where is a stochastic matrix. One should consider the LSRDR of to be a dimensionality reduction for Markov chains. The most sensible way to take a dimensionality reduction of an -state Markov chain is to select states so that those states make a subset that is in some sense optimal. And, for LSRDRs, the best choice of a element subset of is the option that maximizes .

Conclusions:

The LSRDRs of have a notable combination of features of interpretability; these LSRDRs tend to converge to the same local maxima (up-to-joint similarity and a constant factor) regardless of the initialization, and we are able to give an explicit expression for these local maxima. We also have a duality between the problem of computing the LSRDR of and the problem of maximizing where . With this duality, the LSRDR of is fully interpretable as a solution to a combinatorial optimization problem.

I hope to make more posts about some of my highly interpretable machine learning algorithms together with some of my tools that we can use to interpret AI.

Edited: 1/​10/​2024