It is time for us to interpret some linear machine learning models that I have been working on. These models are linear, but I can generalize these algorithms to produce multilinear models which have stronger capabilities while still behaving mathematically. Since one can stack the layers to make non-linear models, these types of machine learning algorithms seem to have enough performance to be more relevant for AI safety.
Our goal is to transform a list of n×n-matrices (A1,...,Ar) into a new and simplified list of d×d-matrices (X1,…,Xr). There are several ways in which we would like to simplify the matrices. For example, we would sometimes simply like for d<n, but in other cases, we would like the matrices Xj to all be real symmetric, complex symmetric, real Hermitian, complex Hermitian, complex anti-symmetric, etc.
We measure similarity between tuples of matrices using spectral radii. Suppose that (A1,…,Ar) are n×n-matrices and (X1,…,Xr) are d×d-matrices. Then define an operator Γ(A1,…,Ar:X1,…,Xr) mapping n×d matrices to n×d
-matrices by setting Γ(A1,…,Ar:X1,…,Xr)(X)=A1XX∗1+…ArXX∗r. Then define Φ(X1,…,Xr)=Γ(X1,…,Xr;X1,…,Xr). Define the similarity between (A1,…,Ar) and (X1,…,Xr) by setting
where ρ denotes the spectral radius. Here, ∥(A1,…,Ar)≃(X1,…,Xr)∥2 should be thought of as a generalization of the cosine similarity to tuples of matrices. And ∥(A1,…,Ar)≃(X1,…,Xr)∥2 is always a real number in [0,1], so this is a sensible notion of similarity.
Suppose that K is either the field of real or complex numbers. Let Mn(K) denote the set of n by n matrices over K.
Let n,d be positive integers. Let T:Md(K)→Md(K) denote a projection operator. Here, T is a real-linear operator, but if K is not complex, then T is not necessarily complex linear. Here are a few examples of such linear operators T that work:
K=C:T1(X)=(X+XT)/2 (Complex symmetric)
K=C:T2(X)=(X−XT)/2 (Complex anti-symmetric)
K=C:T3(X)=(X+X∗)/2 (Complex Hermitian)
K=C:T4(X)=Re(X) (real, the real part taken elementwise).
K=R:T5(X)=(X+XT)/2 (Real symmetric)
K=R:T6(X)=(X−XT)/2 (Real anti-symmetric)
K=C:T7(X)=Re(X)+Re(X)T (real symmetric)
K=C:T8(X)=Re(X)−Re(X)T (real anti-symmetric)
Caution: These are special projection operators on spaces of matrices. The following algorithms do not behave well for general projection operators; they mainly behave well for T1,…,T8 along with operators that I have forgotten about.
We are now ready to describe our machine learning algorithm’s input and objective.
Input: r-matrices A1,…,Ar∈Mn(K)
Objective: Our goal is to obtain matrices (X1,…,Xr)∈Md(K) where T(Xj)=Xj for all j but which locally maximizes the similarity∥(A1,…,Ar)≃(X1,…,Xr)∥2.
In this case, we shall call (X1,…,Xr) an L2,d-spectral radius dimensionality reduction (LSRDR) along the subspace im(T).
LSRDRs along subspaces often perform tricks and are very well-behaved.
If (X1,…,Xr),(Y1,…,Yr) are LSRDRs along subspaces, then there are typically some λ,C where Yj=λCXjC−1 for all j. Furthermore, if (X1,…,Xr) is an LSRDR along a subspace, then we can typically find some matrices R,S whereXj=T(RAjS) for all j.
The model (X1,…,Xr) simplifies since it is encoded into the matrices R,S, but this also means that the model (X1,…,Xr) is a linear model. I have just made these observations about the LSRDRs along subspaces, but they seem to behave mathematically enough for me especially since the matrices R,S tend to have mathematical properties that I can’t explain and am still exploring.
It is time for us to interpret some linear machine learning models that I have been working on. These models are linear, but I can generalize these algorithms to produce multilinear models which have stronger capabilities while still behaving mathematically. Since one can stack the layers to make non-linear models, these types of machine learning algorithms seem to have enough performance to be more relevant for AI safety.
Our goal is to transform a list of n×n-matrices (A1,...,Ar) into a new and simplified list of d×d-matrices (X1,…,Xr). There are several ways in which we would like to simplify the matrices. For example, we would sometimes simply like for d<n, but in other cases, we would like the matrices Xj to all be real symmetric, complex symmetric, real Hermitian, complex Hermitian, complex anti-symmetric, etc.
We measure similarity between tuples of matrices using spectral radii. Suppose that (A1,…,Ar) are n×n-matrices and (X1,…,Xr) are d×d-matrices. Then define an operator Γ(A1,…,Ar:X1,…,Xr) mapping n×d matrices to n×d
-matrices by setting Γ(A1,…,Ar:X1,…,Xr)(X)=A1XX∗1+…ArXX∗r. Then define Φ(X1,…,Xr)=Γ(X1,…,Xr;X1,…,Xr). Define the similarity between (A1,…,Ar) and (X1,…,Xr) by setting
∥(A1,…,Ar)≃(X1,…,Xr)∥2
=ρ(Γ(A1,…,Ar;X1,…,Xr))ρ(Φ(A1,…,Ar))1/2ρ(Φ(X1,…,Xr))1/2
where ρ denotes the spectral radius. Here, ∥(A1,…,Ar)≃(X1,…,Xr)∥2 should be thought of as a generalization of the cosine similarity to tuples of matrices. And ∥(A1,…,Ar)≃(X1,…,Xr)∥2 is always a real number in [0,1], so this is a sensible notion of similarity.
Suppose that K is either the field of real or complex numbers. Let Mn(K) denote the set of n by n matrices over K.
Let n,d be positive integers. Let T:Md(K)→Md(K) denote a projection operator. Here, T is a real-linear operator, but if K is not complex, then T is not necessarily complex linear. Here are a few examples of such linear operators T that work:
K=C:T1(X)=(X+XT)/2 (Complex symmetric)
K=C:T2(X)=(X−XT)/2 (Complex anti-symmetric)
K=C:T3(X)=(X+X∗)/2 (Complex Hermitian)
K=C:T4(X)=Re(X) (real, the real part taken elementwise).
K=R:T5(X)=(X+XT)/2 (Real symmetric)
K=R:T6(X)=(X−XT)/2 (Real anti-symmetric)
K=C:T7(X)=Re(X)+Re(X)T (real symmetric)
K=C:T8(X)=Re(X)−Re(X)T (real anti-symmetric)
Caution: These are special projection operators on spaces of matrices. The following algorithms do not behave well for general projection operators; they mainly behave well for T1,…,T8 along with operators that I have forgotten about.
We are now ready to describe our machine learning algorithm’s input and objective.
Input: r-matrices A1,…,Ar∈Mn(K)
Objective: Our goal is to obtain matrices (X1,…,Xr)∈Md(K) where T(Xj)=Xj for all j but which locally maximizes the similarity∥(A1,…,Ar)≃(X1,…,Xr)∥2.
In this case, we shall call (X1,…,Xr) an L2,d-spectral radius dimensionality reduction (LSRDR) along the subspace im(T).
LSRDRs along subspaces often perform tricks and are very well-behaved.
If (X1,…,Xr),(Y1,…,Yr) are LSRDRs along subspaces, then there are typically some λ,C where Yj=λCXjC−1 for all j. Furthermore, if (X1,…,Xr) is an LSRDR along a subspace, then we can typically find some matrices R,S whereXj=T(RAjS) for all j.
The model (X1,…,Xr) simplifies since it is encoded into the matrices R,S, but this also means that the model (X1,…,Xr) is a linear model. I have just made these observations about the LSRDRs along subspaces, but they seem to behave mathematically enough for me especially since the matrices R,S tend to have mathematical properties that I can’t explain and am still exploring.