Joseph Van Name comments on Joseph Van Name’s Shortform

Joseph Van Name 23 Apr 2025 6:31 UTC
9 points
0
It is time for us to interpret some linear machine learning models that I have been working on. These models are linear, but I can generalize these algorithms to produce multilinear models which have stronger capabilities while still behaving mathematically. Since one can stack the layers to make non-linear models, these types of machine learning algorithms seem to have enough performance to be more relevant for AI safety.
Our goal is to transform a list of $n \times n$ -matrices $(A_{1}, . . ., A_{r})$ into a new and simplified list of $d \times d$ -matrices $(X_{1}, \dots, X_{r})$ . There are several ways in which we would like to simplify the matrices. For example, we would sometimes simply like for $d < n$ , but in other cases, we would like the matrices $X_{j}$ to all be real symmetric, complex symmetric, real Hermitian, complex Hermitian, complex anti-symmetric, etc.
We measure similarity between tuples of matrices using spectral radii. Suppose that $(A_{1}, \dots, A_{r})$ are $n \times n$ -matrices and $(X_{1}, \dots, X_{r})$ are $d \times d$ -matrices. Then define an operator $Γ (A_{1}, \dots, A_{r} : X_{1}, \dots, X_{r})$ mapping $n \times d$ matrices to $n \times d$
-matrices by setting $Γ (A_{1}, \dots, A_{r} : X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots A_{r} X X_{r}^{*}$ . Then define $Φ (X_{1}, \dots, X_{r}) = Γ (X_{1}, \dots, X_{r}; X_{1}, \dots, X_{r})$ . Define the similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$ by setting
$∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (X_{1}, \dots, X_{r}))^{1 / 2}}$
where $ρ$ denotes the spectral radius. Here, $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ should be thought of as a generalization of the cosine similarity to tuples of matrices. And $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ is always a real number in $[0, 1]$ , so this is a sensible notion of similarity.
Suppose that $K$ is either the field of real or complex numbers. Let $M_{n} (K)$ denote the set of $n$ by $n$ matrices over $K$ .
Let $n, d$ be positive integers. Let $T : M_{d} (K) \to M_{d} (K)$ denote a projection operator. Here, $T$ is a real-linear operator, but if $K$ is not complex, then $T$ is not necessarily complex linear. Here are a few examples of such linear operators $T$ that work:
$K = C : T_{1} (X) = (X + X^{T}) / 2$ (Complex symmetric)
$K = C : T_{2} (X) = (X - X^{T}) / 2$ (Complex anti-symmetric)
$K = C : T_{3} (X) = (X + X^{*}) / 2$ (Complex Hermitian)
$K = C : T_{4} (X) = Re (X)$ (real, the real part taken elementwise).
$K = R : T_{5} (X) = (X + X^{T}) / 2$ (Real symmetric)
$K = R : T_{6} (X) = (X - X^{T}) / 2$ (Real anti-symmetric)
$K = C : T_{7} (X) = Re (X) + Re (X)^{T}$ (real symmetric)
$K = C : T_{8} (X) = Re (X) - Re (X)^{T}$ (real anti-symmetric)
Caution: These are special projection operators on spaces of matrices. The following algorithms do not behave well for general projection operators; they mainly behave well for $T_{1}, \dots, T_{8}$ along with operators that I have forgotten about.
We are now ready to describe our machine learning algorithm’s input and objective.
Input: $r$ -matrices $A_{1}, \dots, A_{r} \in M_{n} (K)$
Objective: Our goal is to obtain matrices $(X_{1}, \dots, X_{r}) \in M_{d} (K)$ where $T (X_{j}) = X_{j}$ for all $j$ but which locally maximizes the similarity $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ .
In this case, we shall call $(X_{1}, \dots, X_{r})$ an $L_{2, d}$ -spectral radius dimensionality reduction (LSRDR) along the subspace $im (T) .$
LSRDRs along subspaces often perform tricks and are very well-behaved.
If $(X_{1}, \dots, X_{r}), (Y_{1}, \dots, Y_{r})$ are LSRDRs along subspaces, then there are typically some $λ, C$ where $Y_{j} = λ C X_{j} C^{- 1}$ for all $j$ . Furthermore, if $(X_{1}, \dots, X_{r})$ is an LSRDR along a subspace, then we can typically find some matrices $R, S$ where $X_{j} = T (R A_{j} S)$ for all $j$ .
The model $(X_{1}, \dots, X_{r})$ simplifies since it is encoded into the matrices $R, S$ , but this also means that the model $(X_{1}, \dots, X_{r})$ is a linear model. I have just made these observations about the LSRDRs along subspaces, but they seem to behave mathematically enough for me especially since the matrices $R, S$ tend to have mathematical properties that I can’t explain and am still exploring.