Joseph Van Name comments on Joseph Van Name’s Shortform

Joseph Van Name 10 May 2025 5:57 UTC
1 point
0
This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.
Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.
Suppose that $A_{1}, \dots, A_{r}$ are $n \times n$ -complex matrices and $X_{1}, \dots, X_{r}$ are $d \times d$ -complex matrices. Then define a mapping $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (C) \to M_{n, d} (C)$ by $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ for all $X$ . Define
$Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the $L_{2}$
-spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2}$ . Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$ by
$∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (X_{1}, \dots, X_{r})}$ .
The $L_{2}$ -spectral radius similarity is always in the interval $[0, 1]$ . if $n = d$ , $A_{1}, \dots, A_{r}$ generates the algebra of $n \times n$ -complex matrices, and $X_{1}, \dots, X_{r}$ also generates the algebra of $n \times n$ -complex matrices, then $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2} = 1$ if and only if there are $C, λ$ with $A_{j} = λ C X_{j} C^{- 1}$ for all $j$ .
Define $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ to be the supremum of
$ρ_{2} (A_{1}, \dots, A_{r}) \cdot ∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$
where $X_{1}, \dots, X_{r}$ are $d \times d$ -Hermitian matrices.
One can get lower bounds for $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ simply by locally maximizing $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.
Empirical observation/conjecture: If $(A_{1}, \dots, A_{r})$ are $n \times n$ -complex matrices, then $ρ_{2, n}^{H} (A_{1}, \dots, A_{r}) = ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ whenever $d \geq n$ .
The above observation means that sequences of $n \times n$ -matrices $(A_{1}, \dots, A_{r})$ are fundamentally non-Hermitian. In this case, we cannot get better models of $(A_{1}, \dots, A_{r})$ using Hermitian matrices larger than the matrices $(A_{1}, \dots, A_{r})$ themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever $d \geq n$
, but the purpose of modeling $(A_{1}, \dots, A_{r})$ as Hermitian matrices is generally to use smaller matrices and not larger matrices.
This means that the function $ρ_{2, d}^{H}$ behaves mathematically.
Now, the model $(X_{1}, \dots, X_{r})$ is a linear model of $(A_{1}, \dots, A_{r})$ since the mapping $A_{j} \mapsto X_{j}$ is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model $(X_{1}, \dots, X_{r})$ generalizes to multi-layered machine learning models.