This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.
Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.
Suppose that A1,…,Ar are n×n-complex matrices and X1,…,Xr are d×d-complex matrices. Then define a mapping Γ(A1,…,Ar;X1,…,Xr):Mn,d(C)→Mn,d(C) by Γ(A1,…,Ar;X1,…,Xr)(X)=A1XX∗1+⋯+ArXX∗r for all X. Define
Φ(A1,…,Ar)=Γ(A1,…,Ar;A1,…,Ar). Define the L2
-spectral radius by setting ρ2(A1,…,Ar)=ρ(Φ(A1,…,Ar))1/2. Define the L2-spectral radius similarity between (A1,…,Ar) and (X1,…,Xr) by
∥(A1,…,Ar)≃(X1,…,Xr)∥2
=ρ(Γ(A1,…,Ar;X1,…,Xr))ρ2(A1,…,Ar)ρ2(X1,…,Xr).
The L2-spectral radius similarity is always in the interval [0,1]. if n=d, A1,…,Ar generates the algebra of n×n-complex matrices, and X1,…,Xr also generates the algebra of n×n-complex matrices, then ∥(A1,…,Ar)≃(X1,…,Xr)∥2=1 if and only if there are C,λ with Aj=λCXjC−1 for all j.
Define ρH2,d(A1,…,Ar) to be the supremum of
ρ2(A1,…,Ar)⋅∥(A1,…,Ar)≃(X1,…,Xr)∥2
where X1,…,Xr are d×d-Hermitian matrices.
One can get lower bounds for ρH2,d(A1,…,Ar) simply by locally maximizing ∥(A1,…,Ar)≃(X1,…,Xr)∥2 using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.
Empirical observation/conjecture: If (A1,…,Ar) are n×n-complex matrices, then ρH2,n(A1,…,Ar)=ρH2,d(A1,…,Ar) whenever d≥n.
The above observation means that sequences of n×n-matrices (A1,…,Ar) are fundamentally non-Hermitian. In this case, we cannot get better models of (A1,…,Ar) using Hermitian matrices larger than the matrices (A1,…,Ar) themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever d≥n
, but the purpose of modeling (A1,…,Ar) as Hermitian matrices is generally to use smaller matrices and not larger matrices.
This means that the function ρH2,d behaves mathematically.
Now, the model (X1,…,Xr) is a linear model of (A1,…,Ar) since the mapping Aj↦Xj is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model (X1,…,Xr) generalizes to multi-layered machine learning models.
This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.
Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.
Suppose that A1,…,Ar are n×n-complex matrices and X1,…,Xr are d×d-complex matrices. Then define a mapping Γ(A1,…,Ar;X1,…,Xr):Mn,d(C)→Mn,d(C) by Γ(A1,…,Ar;X1,…,Xr)(X)=A1XX∗1+⋯+ArXX∗r for all X. Define
Φ(A1,…,Ar)=Γ(A1,…,Ar;A1,…,Ar). Define the L2
-spectral radius by setting ρ2(A1,…,Ar)=ρ(Φ(A1,…,Ar))1/2. Define the L2-spectral radius similarity between (A1,…,Ar) and (X1,…,Xr) by
∥(A1,…,Ar)≃(X1,…,Xr)∥2
=ρ(Γ(A1,…,Ar;X1,…,Xr))ρ2(A1,…,Ar)ρ2(X1,…,Xr).
The L2-spectral radius similarity is always in the interval [0,1]. if n=d, A1,…,Ar generates the algebra of n×n-complex matrices, and X1,…,Xr also generates the algebra of n×n-complex matrices, then ∥(A1,…,Ar)≃(X1,…,Xr)∥2=1 if and only if there are C,λ with Aj=λCXjC−1 for all j.
Define ρH2,d(A1,…,Ar) to be the supremum of
ρ2(A1,…,Ar)⋅∥(A1,…,Ar)≃(X1,…,Xr)∥2
where X1,…,Xr are d×d-Hermitian matrices.
One can get lower bounds for ρH2,d(A1,…,Ar) simply by locally maximizing ∥(A1,…,Ar)≃(X1,…,Xr)∥2 using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.
Empirical observation/conjecture: If (A1,…,Ar) are n×n-complex matrices, then ρH2,n(A1,…,Ar)=ρH2,d(A1,…,Ar) whenever d≥n.
The above observation means that sequences of n×n-matrices (A1,…,Ar) are fundamentally non-Hermitian. In this case, we cannot get better models of (A1,…,Ar) using Hermitian matrices larger than the matrices (A1,…,Ar) themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever d≥n
, but the purpose of modeling (A1,…,Ar) as Hermitian matrices is generally to use smaller matrices and not larger matrices.
This means that the function ρH2,d behaves mathematically.
Now, the model (X1,…,Xr) is a linear model of (A1,…,Ar) since the mapping Aj↦Xj is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model (X1,…,Xr) generalizes to multi-layered machine learning models.