Spectral radii dimensionality reduction computed without gradient calculations

In this post, I shall describe a fitness function that can be locally maximized without gradient computations. This fitness function is my own. I initially developed this fitness function in order to evaluate block ciphers for cryptocurrency technologies, but I later found that this fitness function may be used to solve other problems such as the clique problem (which is NP-complete) in the average case and some natural language processing tasks as well. After describing algorithms for locally maximizing this fitness function, I conclude that such a fitness function is inherently interpretable and mathematical which is what we need for AI safety.

Let $K$ denote either the field of real numbers, complex numbers, or the division ring of quaternions. Given $(A_{1}, \dots, A_{r}) \in M_{n} (K)^{r}$ and $1 \leq d < n$ , the goal is to find a tuple $(X_{1}, \dots, X_{r}) \in M_{d} (K)^{r}$ most similar to $(A_{1}, \dots, A_{r})$ . In other words, we want to approximate the $n \times n$ -matrices $(A_{1}, \dots, A_{r})$ with $d \times d$ -matrices.

Suppose that $A_{1}, \dots, A_{r} \in M_{n} (K), X_{1}, \dots, X_{r} \in M_{d} (K) .$ Define a function $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (K) \to M_{n, d} (K)$ by setting

$Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = \sum_{j} A_{j} X X_{j}^{*}$ , and define $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ .

If $X$ is a matrix, then let $ρ (X)$ denote the spectral radius of $X$ .

Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$

by setting $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$

$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (X_{1}, \dots, X_{r}))^{1 / 2}} .$

The quantity $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ is always a real number in the interval $[0, 1]$ (the proof is a generalization of the Cauchy-Schwarz inequality).

If $1 \leq d < n$ , and $A_{1}, \dots, A_{r} \in M_{n} (K)$ , then we say that $(X_{1}, \dots, X_{r}) \in M_{d} (K)^{r}$ is an $L_{2, d}$ -spectral radius dimensionality reduction (LSRDR) of $(A_{1}, \dots, A_{r})$ if the similarity $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ is locally maximized.

One can produce an LSRDR of $(A_{1}, \dots, A_{r})$ simply by locally maximizing $(X_{1}, \dots, X_{r})$ using gradient ascent. But there is another way to obtain LSRDRs since they behave mathematically.

Let $Z (K)$ denote the center of the algebra $K$ . In particular, $Z (R) = R, Z (C) = C, Z (H) = R$ .

Empirical observation: If $(X_{1}, \dots, X_{r}) \in M_{d} (K)^{r}$ is an LSRDR of $(A_{1}, \dots, A_{r})$ , then typically exists $λ \in Z (K)$ along with matrices $R, S$ with $X_{j} = λ R A_{j} S$ for all $j$ and where if $P = S R$ , then $P$ is a (typically non-orthogonal) projection matrix. The projection matrix $P$ is typically unique in the sense that if we train an LSRDR of $(A_{1}, \dots, A_{r})$ with rank $d$ multiple times, then we generally end up with the same projection matrix $P$ . If the projection matrix $P$ is unique, then we shall call $P$ the canonical LSRDR projection matrix of rank $d$ . Let $H$ denote the dominant eigenvector of $Γ (A_{1}, \dots, A_{r}; P A_{1} P, \dots, P A_{r} P)$ with $Tr (H) = 1$ (in the quaternionic case, we should define the trace as the real part of the sum of diagonal entries in the matrix). Let $G$ denote the dominant eigenvector of $Γ (A_{1}, \dots, A_{r}; P A_{1} P, \dots, P A_{r} P)^{*}$ with $Tr (G) = 1$ . Then $G, H$ are positive semidefinite matrices with $Im (G) = Im (P^{*}) = ker (P)^{⊥}$ and $Im (H) = Im (P) = ker (P^{*})^{⊥}$ .

Said differently, $P, G, H$ along with the eigenvalue $λ$ is a solution to the following equations:

$P^{2} = P$ .
$Γ (A_{1}, \dots, A_{r}; P A_{1} P, \dots, P A_{r} P) (H) = λ H$ .
$Γ (A_{1}, \dots, A_{r}; P A_{1} P, \dots, P A_{r} P)^{*} (G) = ¯ ¯ ¯ λ G$ .
$Im (H) = Im (P)$ .
$Im (G) = Im (P^{*})$ .
$Tr (G) = Tr (H) = 1$ .

One may often solve an equation or system of equations using iteration. For example, if $(X, d)$ is a complete metric space, $α < 1$ and $f : X \to X$ is a function with $d (f (x), f (y)) \leq α \cdot d (x, y)$ for all $x, y \in X$ , and $y_{0} = {lim}_{n \to \infty} f^{n} (x_{0})$ , then $f (y_{0}) = y_{0}$ by the contraction mapping theorem. We shall also apply this idea to produce the $P, G, H$ in an LSRDR since the $P, G, H$ satisfy a system of equations.

We need a function $f$ that can be easily applied to matrices but where each projection matrix $P$ is an attractive fixed point for $f$ . Consider the function $f (x) = 3 x^{2} - 2 x^{3}$ . We observe that $f (0) = 0, f (1) = 1, f^{'} (0) = 0 = f^{'} (1)$ . Furthermore, if $x \in [0, 1 / 2)$ , then the sequence $f^{n} (x)$ converges to $0$ very quickly and if $x \in (1 / 2, 1]$ , then $f^{n} (x)$ converges to $1$ very quickly. Actually, there are neighborhoods $U$ of $0$ and $V$ of $1$ in the complex plane where if $w \in U$ , then $f^{n} (w)$ converges to $0$ quickly and if $z \in V$ , then $f^{n} (z)$ converges to $1$ quickly. As a consequence, if $P$ is a projection matrix, then there is a neighborhood $O$ of $P$ where if $Q \in O$ , then $f^{n} (Q)$ converges to some projection matrix very quickly. Let Define a partial function $f^{\infty}$ by setting $f^{\infty} (Q) = {lim}_{n \to \infty} f^{n} (Q)$ .

Iterative algorithm for computing LSRDRs: Let $t > 0$ but set $t$ to a sufficiently small value. Let $s \in N \cup {\infty} ∖ {0}$ . Let $P_{0}$ be an random orthonormal projection of rank $d$ . Let $G_{0}, H_{0}$ be random matrices in $M_{n} (K)$ . Define $G_{N}, H_{N}, P_{N}, G_{N}^{♯}, H_{N}^{♯}$ for all $N \geq 0$ recursively by setting

$G_{N}^{♯} = Γ (A_{1}, \dots, A_{r}; P_{N} A_{1} P_{N}, \dots, P_{N} A_{r} P_{N})^{*} (G_{N})$ ,
$H_{N}^{♯} = Γ (A_{1}, \dots, A_{r}; P_{N} A_{1} P_{N}, \dots, P_{N} A_{r} P_{N}) (H_{N})$ ,
$G_{N + 1} = G_{N}^{♯} / Tr (G_{N}^{♯})$ ,
$H_{N + 1} = H_{N}^{♯} / Tr (H_{N}^{♯})$ , and
$P_{N + 1} = f^{s} (P_{N} + t \cdot (P_{N} G_{N}^{*} + H_{N} P_{N}))$ .

Then $(P_{N})_{N}$ typically converges to a matrix $P$ of rank $d$ at an exponential rate. Furthermore, the matrix $P$ typically does not depend on the initialization $G_{0}, H_{0}, P_{0}$ , but $P$ does depend on $t$ and $s$ . Therefore set $P = {Pro}_{s} (A_{1}, \dots, A_{r})$ whenever $P$ does not depend on the initialization (just assume that the limit converges). If $s = \infty$ , then $P$ is typically the canonical rank $d$ LSRDR projection operator.

The iterative algorithm for computing LSRDRs is a proper extension of the notion of an LSRDR in some cases. For example, if we simply count the number of parameters, the matrices $R, S$ have $2 \cdot n \cdot d$ parameters while the matrices $(X_{1}, \dots, X_{r})$ have $d^{2} \cdot r$ parameters. Therefore, if $d \cdot r ≪ 2 \cdot n$ , then the matrices $R, S$ (and the projection matrix $P$ ) have more parameters than $(X_{1}, \dots, X_{r})$ . This means that we cannot obtain a unique canonical LSRDR projection of dimension $d$ when $d \cdot r ≪ 2 \cdot n$ . On the other hand, even when $d \cdot r ≪ 2 \cdot n$ , the matrix ${Pro}_{\infty} (A_{1}, \dots, A_{r})$ typically exists, and if we set $P = {Pro}_{\infty} (A_{1}, \dots, A_{r})$ , there are $R, S$ where $P = S R$ and where $(R A_{1} S, \dots, R A_{r} S)$ is an LSRDR of $(A_{1}, \dots, A_{r})$ . This means that the iterative algorithm for computing LSRDRs gives more information than simply an LSRDR. The iterative algorithm for computing LSRDRs gives a projection operator.

Interpretability: Since there are several ways of computing LSRDRs, LSRDRs behave mathematically. A machine learning algorithm that typically produces the same output is more interpretable than one that does not for a few reasons including the conclusion that when there is only one output of a machine learning algorithm, that one output only depends on the input and it does not have any other source of random information contributing to it. Since we typically (but not always) attain the same local maximum when training LSRDRs multiple times, LSRDRs are both interpretable and mathematical. This sort of mathematical behavior is what we need to make sense of the inner workings of LSRDRs and other machine learning algorithms. There are several ways to generalize the notion of an LSRDR, but these generalized machine learning algorithms still tend to behave mathematically; they still tend to produce the exact same trained model after training multiple times.

Capabilities: Trained LSRDRs can solve NP-complete problems such as the clique problem. I have also trained LSRDRs to produce word embeddings for natural language processing and to analyze the octonions. I can generalize LSRDRs so that they behave more like deep neural networks, but we still have a way to go until these generalized LSRDRs perform as well as our modern AI algorithms, and I do not know how far we can push the performance of generalized LSRDRs while retaining their inherent interpretability and mathematical behavior.

If mathematical generalized LSRDRs can compete in performance with neural networks, then we are closer to solving the problem of general AI interpretability. But if not, then mathematical generalized LSRDRs could possibly be used as a narrow AI tool or even as a component of general AI; in this case, generalized LSRDRs could still improve general AI interpretability a little bit. An increased usage of narrow AI will (slightly) decrease the need for general AI.

Added 5/28/2025

When $d ≪ n$ , the iterative algorithm for computing LSRDRs produces a low rank projection matrix, but a matrix $P$ of rank $d$ can be factored as $S R$ where $S \in M_{n, d} (K), R \in M_{d, n} (K)$ . To save computational resources, we may just work with $S, R$ during the training.

Suppose that $P = S R$ with $S \in M_{n, d} (K), R \in M_{d, n} (K)$ . Then there are several ways to easily factor $f (P)$ as the product of an $n \times d$ -matrix with a $d \times n$ -matrix including the following factorizations:

$f (P) = 3 (S R)^{2} - 2 (S R)^{3} =$ $3 S R S R - 2 S R S R S R = S \cdot (3 R S R - 2 R S R S R)$

$= (3 S R S - 2 S R S R S) \cdot R =$ $S R S \cdot (3 R - 2 R S R) =$ $(3 S - 2 S R S) \cdot R S R$ .

Therefore, define

$f_{1} (S, R) = (S, 3 R S R - 2 R S R S R), f_{2} (S, R) = (3 S R S - 2 S R S R S, R)$

$f_{3} (S, R) = (S R S, 3 R - 2 R S R), f_{4} (S, R) = (3 S - 2 S R S, R S R)$ .

In particular, if $f_{i} (S, R) = (S_{1}, R_{1})$ , then $S_{1} R_{1} = f (S R)$ .

Let $i_{N} \in {1, 2, 3, 4}$ for all $N$ , and define $F (S, R) = {lim}_{N \to \infty} f_{i_{N}} \dots f_{i_{1}} (S, R)$ .

In the following algorithm, we will need to sometimes replace a pair $S, R$ with a new pair $S_{1}, R_{1}$ such that $S_{1} R_{1} = S R$ but where $(S_{1}, R_{1})$ has norm smaller than $(S, R)$ because if we do not do this, after much training, the norm of $(S, R)$ will grow very large while $S R$ is just a projection matrix. To do this, we decrease the norm of $(S, R)$ using gradient descent. In particular, we observe that $S R = S (I + X) (I + X)^{- 1} R$ . We are taking the gradient when $X = 0$ , so we may have $S R \approx S (I + X) (I - X) R$ when $X$ is small. We want to move in the direction that minimizes the sum $∥ S (I + X) ∥_{2}^{2} + ∥ (I - X) R ∥_{2}^{2}$ , but

$\nabla_{X} (∥ S (I + X) ∥_{2}^{2} + ∥ (I - X) R ∥_{2}^{2}) |_{X = 0} =$ $2 (S^{*} S - R R^{*}) .$

Iterative algorithm for computing LSRDRs with low rank factorization:

Let $t > 0$ but $t$ needs to be sufficiently small. Let $r > 0$ . Set $S_{0} \in M_{n, d} (K), R_{0} \in M_{d, n} (K)$ . Let $G_{0}, H_{0}$ be random rank $d$ matrices. Define $G_{N}, H_{N}, G_{N}^{♯}, H_{N}^{♯}, S_{N}, R_{N}, S_{N}^{♯}, R_{N}^{♯}, T_{N}$ recursively for all $N \geq 0$ by setting

$G_{N}^{♯} = Γ (A_{1}, \dots, A_{r}; S_{N} R_{N} A_{1} S_{N} R_{N}, \dots, S_{N} R_{N} A_{r} S_{N} R_{N})^{*} (G_{N})$ ,
$H_{N}^{♯} = Γ (A_{1}, \dots, A_{r}; S_{N} R_{N} A_{1} S_{N} R_{N}, \dots, S_{N} R_{N} A_{r} S_{N} R_{N}) (H_{N})$ ,
$G_{N + 1} = G_{N}^{♯} / Tr (G_{N}^{♯})$ ,
$H_{N + 1} = H_{N}^{♯} / Tr (H_{N}^{♯})$ ,
$(S_{N}^{♯}, R_{N}^{♯}) = F (S_{N} + t \cdot H_{N} \cdot S_{N}, R_{N} + t \cdot R_{N} \cdot G_{N}^{*})$ ,
$T_{N} = (S_{N}^{♯})^{*} S_{N}^{♯} - R_{N}^{♯} \cdot (R_{N}^{♯})^{*}$ ,
$R_{N + 1} = (I + r \cdot t \cdot T_{N}) \cdot R_{N}^{♯}$ , and
$S_{N + 1} = S_{N}^{♯} \cdot (I + r \cdot t \cdot T_{N})^{- 1}$ .

Then if everything goes right, $S_{N} R_{N}$ would converge to ${Pro}_{\infty} (A_{1}, \dots, A_{r})$ at an exponential rate.

Added 5/31/2025

The iterative algorithm for computing LSRDRs can be written in terms of completely positive operators. We define a completely positive superoperator as an operator of the form $Φ (A_{1}, \dots, A_{r})$ where $A_{1}, \dots, A_{r}$ are square matrices over $K$ . Completely positive operators are typically defined when $K = C$ for quantum information theory, but we are using a more general context here. If $E = Φ (A_{1}, \dots, A_{r})$ , then

$Γ (A_{1}, \dots, A_{r}; P_{N} A_{1} P_{N}, \dots, P_{N} A_{r} P_{N})^{*} (G_{N}) = E^{*} (G_{N} \cdot P) \cdot P$ and

$Γ (A_{1}, \dots, A_{r}; P_{N} A_{1} P_{N}, \dots, P_{N} A_{r} P_{N}) (H_{N}) = E (H_{N} \cdot P^{*}) \cdot P^{*}$ .

Added 6/7/2025

Iterative algorithm for computing LSRDRs apply not just to completely positive superoperators, but it also applies to positive operators that are not completely positive as long as the superoperators do not stray too far away from complete positivity. A linear superoperator $E : M_{m} (C) \to M_{n} (C)$ is said to be positive if $E (P)$ is positive semidefinite whenever $P$ is positive semidefinite. Clearly, every completely positive superoperator is positive, but not every positive operator is completely positive. For example, the transpose map $T$ defined by $T (P) = P^{⊤}$ is always positive but not completely positive whenever $m = n > 1$ . The difference of two completely positive superoperators from $M_{m} (C)$ to $M_{n} (C)$ is known as a Hermitian preserving map. Every positive map is Hermitian preserving. If $E : M_{m} (C) \to M_{n} (C)$ is linear but not too far from positivity, then use the iterative algorithm for computing LSRDRs, we typically get the same results every time we train, and if $E$ is Hermitian preserving, then the operators $G_{N}, H_{N}$ will be positive semidefinite after training.