For machine learning, it is desirable for the trained model to have absolutely no random information left over from the initialization; in this short post, I will mathematically prove an interesting (to me) but simple consequence of this desirable behavior.
This post is a result of some research that I am doing for machine learning algorithms related to my investigation of cryptographic functions for the cryptocurrency that I launched (to discuss crypto, leave me a personal message so we can discuss this off this site).
This post shall be about linear machine learning models. Actually, we are using quantum operators, so they are more sophisticated than your logistic regression models, but they are still linear so it is really easy to train a neural network that can solve more sophisticated problems than these linear models can. But the kinds of results that you find in this post can also extend to some non-linear models with multiple layers and stronger capabilities. It is just easier to understand what is going on with the linear models, and even with the linear models, we still obtain some interesting mathematics.
We say that a machine learning model trained by gradient ascent/descent is pseudodeterministically trained (or just pseudodeterministic for short) if the fitness/loss function has precisely one local optimum. As a result, the trained model will have absolutely no information left over from the initialization. As another consequence, the trained model will attain the global optimum rather than a suboptimal local optimum. The results in this post will actually hold whenever the global optimum is unique. But I need to bring up pseudodeterminism since pseudodeterminism implies that we can actually find the unique global optimum instead of always getting stuck at a suboptimal local optimum.
If a machine learning model global optimizes an objective function, the machine learning model should be considered as an inherently interpretable model rather than a high performance model since the machine learning model has no random information in it independent of the objective function itself and since one can only find the global optima for sufficiently easy objective functions. The global optimum is also more interpretable because it inherits the symmetry of the objective function which depends on the training data. In this post, we shall show that if the training data has some symmetry, then the quantum operator that we train will also have that symmetry.
This post is mathematical and contain mathematical proofs. Fortunately, the mathematical proofs are not that difficult, so it is easy for the readers. After all, the main thrust of this post is that these mathematical proofs are backed up by experimental results. The main bottleneck towards understanding this post is therefore the task of getting through all the technical definitions. I might follow up this short post with a more general post, so you should read this before going through the more general post.
Let U be a finite dimensional complex inner product space. If B⊆U2, then define sets B∗,B⊤,¯¯¯¯B by setting
B∗={(y,x):(x,y)∈B},¯¯¯¯B={(¯¯¯x,¯¯¯y):(x,y)∈B}
B⊤={(¯¯¯y,¯¯¯x):(x,y)∈B}.
Let μ be a probability measure on U2. Here, μ is the probability distribution for the training data. Define new measures μ∗,¯¯¯μ,μ⊤ by setting
μ∗(B)=μ(B∗),¯¯¯μ(B)=μ(¯¯¯¯B),μ⊤(B)=μ(B⊤).
Let L(U) denote the collection of linear operators from U to U. If A1,…,Ar∈L(U), then define an operator Φ(A1,…,Ar):L(U)→L(U) by setting Φ(A1,…,Ar)(X)=A1XA∗1+⋯+ArXA∗r. The operators of the form Φ(A1,…,Ar) are the completely positive superoperators of Choi rank at most r. Recall that L(U) is an inner product space with the Frobenius inner product. It is easy to show that the Hermitian adjoint Φ(A1,…,Ar)∗ is just Φ(A∗1,…,A∗r). If E is a completely positive superoperator, then define ¯¯¯E by setting
¯¯¯E(X)=(E(X∗⊤))∗⊤. Define E⊤=¯¯¯E∗=¯¯¯¯¯¯E∗. Then it is easy to show that ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Φ(A1,…,Ar)=Φ(¯¯¯¯¯¯A1,…,¯¯¯¯¯¯Ar) and
Φ(A1,…,Ar)⊤=Φ(A⊤1,…,A⊤r).
We say that a norm ∥∗∥ on L(L(U)) is Hermitian adjoint preserving (resp. conjugate preserving, transpose preserving) if ∥E∥=∥E∗∥ (resp, ∥E∥=∥¯¯¯E∥ and ∥E∥=∥E⊤∥).
The domain of the fitness function Fμ,n is the set of all non-zero completely positive superoperators E:L(U)→L(U) of Choi rank at most n with ∥E∥=1. We define the fitness function Fμ,n,∥∗∥ by setting
Fμ,n,∥∗∥(E)=∫log(⟨E(xx∗),yy∗⟩)μ(x,y)=∫log(⟨E(xx∗),yy∗⟩)μ(x,y)−log(∥E∥). Observe that we also have Fμ,n,∥∗∥=∫log(y∗E(xx∗)y)dμ(x,y).
Experimental result (pseudodeterminism): Computer experiments show that the function Fμ,n,∥∗∥ typically has only one local maximum in the sense that we cannot find any other local maximum.
Define a function Fμ whose domain is the set of all completely positive superoperators E:L(U)→L(U) by setting
Fμ(E)=∫log(⟨E(xx∗),yy∗⟩)μ(x,y) which is equivalent to
Fμ(E)=∫log(y∗E(xx∗)y)μ(x,y). We wrote F(μ,E) for Fμ(E) to reduce the use of subscripts.
Lemma: F(μ,E)=F(μ∗,E∗)=F(¯¯¯μ,¯¯¯E)=F(μ⊤,E⊤).
Proof: F(μ,E)=∫log(⟨E(xx∗),yy∗⟩μ(x,y)
=∫log(⟨xx∗,E∗(yy∗)⟩μ(x,y)
=∫log(⟨E∗(yy∗),xx∗⟩)μ(x,y)
=∫log(⟨E∗(xx∗),yy∗⟩μ∗(x,y)=F(μ∗,E∗).
Likewise,
F(μ,E)=∫log(y∗E(xx∗)y)⋅μ(x,y)
=∫log(¯¯¯¯¯y∗¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯E(xx∗)¯¯¯y)⋅μ(x,y)
=∫log(¯¯¯¯¯y∗⋅¯¯¯E(¯¯¯¯¯¯¯¯xx∗)⋅¯¯¯y)μ(x,y)
=∫log(y∗⋅¯¯¯E(xx∗)⋅y)¯¯¯μ(x,y)=F(¯¯¯μ,¯¯¯E).
As a consequence, F(μ,E)=F(μ∗,E∗)=F(¯¯¯¯¯μ∗,¯¯¯¯¯¯E∗)=F(μ⊤,E⊤). Q.E.D.
Theorem: Suppose that Fμ,n,∥∗∥ has a unique global maximum (E,Fμ,n,∥∗∥(E)).
If ∥∗∥ is Hermitian adjoint preserving and μ=μ∗, then E=E∗.
If ∥∗∥ is conjugate preserving and μ=¯¯¯μ, then E=¯¯¯E.
If ∥∗∥ is transpose preserving and μ=μ⊤, then E=E⊤.
Proof: The proofs of 2 and 3 are similar, so we shall only prove 1. For 1, assuming the premises, both E and E∗ belong to the domain of Fμ,n,∥∗∥. But
Fμ,n,∥∗∥(E)=F(μ,E)
=F(μ∗,E∗)=F(μ,E∗)=Fμ,n,∥∗∥(E∗). Since, Fμ,n,∥∗∥ has only one global maximum, we conclude that E=E∗. Q.E.D.
From the above result, we conclude that the global maximum (E,Fμ,n,∥∗∥(E)) inherits any symmetry that the measure μ has.
I would really like to build these inherently interpretable models so that they can solve some really interesting problems (or at least be a few layers in solving them), but I am still stuck attempting to communicate with people about linear models. Having a unique global optimum or more generally pseudodeterminism seems to be the best way to develop inherently interpretable and safe AI, but I have a hard time communicating with anyone about this.
Yes. It seems like to get pseudodeterministic AI, we will need to rebuild AI from the very beginning, and I am not sure that it will all work. For example, pseudodeterminism is harder to attain with stochastic or mini-batch gradient descent, so one might need to use all the training data whenever one updates the weights. I have so far been able to get pseudodeterministic multi-layered models for solving classification problems, word embeddings for NLP, models that are measurements of security of block ciphers such as the advanced encryption standard (the models evaluating the AES are very easy to train), and other things. I have not been able to make pseudodeterministic version of convolutional networks, transformers, GANs, etc. We can use pseudodeterminism for narrow AI or the first few layers of a deep neural network right now though. There is also a funding and exposure issue since not very many people are talking about pseudodeterminism. I have more posts planned about this though.
A trade of performance In exchange for interpretability is exactly what we want for AI safety.
Experimental result (pseudodeterminism): Computer experiments show that the function Fμ,n,∥∗∥ typically has only one local maximum in the sense that we cannot find any other local maximum.
a lot hinges on this. i would be interested to learn about the experimental setup.
For experiments, I just used a convex combination of point mass measures for μ where the point masses are generated uniformly at random (though I might get something more complicated if I tried evaluating the integrals). I then attempted to find multiple local maxima by the usual gradient ascent. If I always end up with the same local maximum, I presume that there is only one local maximum even though I have no mathematical proof that this is the case.
I am redoing the experiments and the only way I can get pseudodeterminism to fail is by using real inner product spaces instead of complex inner product spaces and by setting n=1 (and in this case, pseudodeterminism fails because set of all points where the fitness function returns a real number instead of negative infinity has multiple components). When pseudodeterminism fails, it does not even fail that badly. The distribution of all models that we get has low collision entropy -log(X=Y), so P(X=Y) when X,Y are trained models with different initializations is still high.
Pseudodeterminism does not seem to be rare, but the problem in machine learning is to pseudodeterministically train machine learning models that can solve interesting and challenging problems; I have been working on this in my spare time (without anyone’s help), but since people don’t seem to be interested in this, progress has been slow.
a convex combination of point mass measures for μ where the point masses are generated uniformly at random
two questions:
this seems to include an assumption regarding the points between training samples. if we take the point masses as the known values, then with this step we’re adding some interpolation between those. (that is, if we were “really training” these things, then the integral would look like a sum, since it would only have support at those (x,y) that represent points in our training data.)
have you tried adversarialy constructing \mu such that the integral has multiple maxima? if so, what did you run into? i worry that this could be a case where “random” examples mostly have this property, but some important subset does not (similar to how random functions are nowhere differentiable, but we can still do calculus.)
i’m interested in this! but i’ve only been introduced to these ideas recently, by your post! please read these questions as my own attempts to understand how you’re thinking about it.
Yes. When we take convex combinations of finitely many point mass measures, the integral is just a sum. I use the sum of finitely many elements for ease of calculations, but to prove theorems, I should use measures for full generality.
The idea of finding an object A along with distinct local optima GA(x1),…,GA(xn) with n maximized looks like an interesting problem to work on. I have not worked on this kind of objective before, but I can certainly try this, as I have a few ideas of how to do this. This might work better for discrete optimization problems though since I cannot think of a good way to use gradient updates to produce new local optima. In this case, I will need to use either evolutionary computation or hill climbing instead. I do not think that this will result in natural looking objects A though, so I don’t think I can learn much from this endeavor.
I have not thought much about finding measures μ where Fμ,n,∥∗∥ has many local maxima because I have many higher priorities. These days, people are focused on the more complicated machine learning systems such as large language models, and in order to catch up, I also need to increase the performance, capabilities, and efficiency of my pseudodeterministic machine learning models. For the more complicated multi-layered models, it seems more difficult to obtain and retain pseudodeterminism. Pseudodeterminism is a robust property for simple objective functions such as when we are training a linear model or performing convex optimization, but pseuodeterminism becomes increasingly fragile as we increase the sophistication of our objective functions. This means that it is trivial to violate pseudodeterminism for the sophisticated models that I want to work more on, but it is difficult to retain pseuodeterminism.
I am not at all worried about any strange case of non-pseudodeterminism when optimizing Fμ,n,∥∗∥ for measures I have not thought about yet since this problem is not even close to being non-pseudodeterministic. For example, if E0,F0 are norm 1 completely positive superoperators of the same Choi rank ≤N and if (xn,yn)∈(U∖{0})2 for all n and (En)n,(Fn)n are sequences where
En+1 is obtained by moving from En in the direction of the gradient (with possible momentum) of log(|⟨Enxn,yn⟩|2)−log(∥En∥) and Fn+1 is obtained from Fn the same way with the same rate, then my experiments show that limn→∞∥En−Fn∥=0 regardless of what each (xn,yn) is. In other words, even if (En)n does not converge, the sequences (En)n,(Fn)n uniformly approximate each other as n→∞. This is a much stronger form of pseudodeterminism that is hard to violate, so it is not a high priority to find particular instances of non-pseudodeterminism especially if those instances do not coincide with real-world data.
I kind of expect the fitness function Fμ,n,∥∗∥ to have just one or a few local maxima because their closest relatives are the linear models and those linear models are obtained by optimizing an objective function with one local optimum. And I also expect Fμ,n,∥∗∥ to have one or very few local maxima because Fμ,n,∥∗∥ is similar to many other objective functions that I have constructed each with one or a few local optimum. And since Fμ,n,∥∗∥ is simpler than other objective functions I have looked at with few local optima, Fμ,n,∥∗∥ should also have very few local optima. And the function Fμ is concave, so there is only one local maximum value {Fμ(E):E∈Q} whenever Q is a convex set (such possible convex sets of interest include all quantum channels and all unital channels). The restriction of our attention to completely positive operators of low Choi rank and in the boundary of the unit ball means that when we maximize Fμ,n,∥∗∥, we cannot use convexity to prove that there is only one local maximum, but convexity still suggests that there should be just one especially when n is large. When n is small, we cannot use convexity to make conclusions though since I did a Hessian calculation, and the Hessian of F(μ,Φ(A1,…,Ar)) with respect to (A1,…,Ar) generally has plenty of both positive and negative eigenvalues. I do not consider it a major problem if Fμ,n,∥∗∥ has multiple local maxima, since that probably just means that we need to increase the value of n until these local maxima merge.
For machine learning, it is desirable for the trained model to have absolutely no random information left over from the initialization; in this short post, I will mathematically prove an interesting (to me) but simple consequence of this desirable behavior.
This post is a result of some research that I am doing for machine learning algorithms related to my investigation of cryptographic functions for the cryptocurrency that I launched (to discuss crypto, leave me a personal message so we can discuss this off this site).
This post shall be about linear machine learning models. Actually, we are using quantum operators, so they are more sophisticated than your logistic regression models, but they are still linear so it is really easy to train a neural network that can solve more sophisticated problems than these linear models can. But the kinds of results that you find in this post can also extend to some non-linear models with multiple layers and stronger capabilities. It is just easier to understand what is going on with the linear models, and even with the linear models, we still obtain some interesting mathematics.
We say that a machine learning model trained by gradient ascent/descent is pseudodeterministically trained (or just pseudodeterministic for short) if the fitness/loss function has precisely one local optimum. As a result, the trained model will have absolutely no information left over from the initialization. As another consequence, the trained model will attain the global optimum rather than a suboptimal local optimum. The results in this post will actually hold whenever the global optimum is unique. But I need to bring up pseudodeterminism since pseudodeterminism implies that we can actually find the unique global optimum instead of always getting stuck at a suboptimal local optimum.
If a machine learning model global optimizes an objective function, the machine learning model should be considered as an inherently interpretable model rather than a high performance model since the machine learning model has no random information in it independent of the objective function itself and since one can only find the global optima for sufficiently easy objective functions. The global optimum is also more interpretable because it inherits the symmetry of the objective function which depends on the training data. In this post, we shall show that if the training data has some symmetry, then the quantum operator that we train will also have that symmetry.
This post is mathematical and contain mathematical proofs. Fortunately, the mathematical proofs are not that difficult, so it is easy for the readers. After all, the main thrust of this post is that these mathematical proofs are backed up by experimental results. The main bottleneck towards understanding this post is therefore the task of getting through all the technical definitions. I might follow up this short post with a more general post, so you should read this before going through the more general post.
Let U be a finite dimensional complex inner product space. If B⊆U2, then define sets B∗,B⊤,¯¯¯¯B by setting
B∗={(y,x):(x,y)∈B},¯¯¯¯B={(¯¯¯x,¯¯¯y):(x,y)∈B}
B⊤={(¯¯¯y,¯¯¯x):(x,y)∈B}.
Let μ be a probability measure on U2. Here, μ is the probability distribution for the training data. Define new measures μ∗,¯¯¯μ,μ⊤ by setting
μ∗(B)=μ(B∗),¯¯¯μ(B)=μ(¯¯¯¯B),μ⊤(B)=μ(B⊤).
Let L(U) denote the collection of linear operators from U to U. If A1,…,Ar∈L(U), then define an operator Φ(A1,…,Ar):L(U)→L(U) by setting Φ(A1,…,Ar)(X)=A1XA∗1+⋯+ArXA∗r. The operators of the form Φ(A1,…,Ar) are the completely positive superoperators of Choi rank at most r. Recall that L(U) is an inner product space with the Frobenius inner product. It is easy to show that the Hermitian adjoint Φ(A1,…,Ar)∗ is just Φ(A∗1,…,A∗r). If E is a completely positive superoperator, then define ¯¯¯E by setting
¯¯¯E(X)=(E(X∗⊤))∗⊤. Define E⊤=¯¯¯E∗=¯¯¯¯¯¯E∗. Then it is easy to show that ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯Φ(A1,…,Ar)=Φ(¯¯¯¯¯¯A1,…,¯¯¯¯¯¯Ar) and
Φ(A1,…,Ar)⊤=Φ(A⊤1,…,A⊤r).
We say that a norm ∥∗∥ on L(L(U)) is Hermitian adjoint preserving (resp. conjugate preserving, transpose preserving) if ∥E∥=∥E∗∥ (resp, ∥E∥=∥¯¯¯E∥ and ∥E∥=∥E⊤∥).
The domain of the fitness function Fμ,n is the set of all non-zero completely positive superoperators E:L(U)→L(U) of Choi rank at most n with ∥E∥=1. We define the fitness function Fμ,n,∥∗∥ by setting
Fμ,n,∥∗∥(E)=∫log(⟨E(xx∗),yy∗⟩)μ(x,y)=∫log(⟨E(xx∗),yy∗⟩)μ(x,y)−log(∥E∥). Observe that we also have Fμ,n,∥∗∥=∫log(y∗E(xx∗)y)dμ(x,y).
Experimental result (pseudodeterminism): Computer experiments show that the function Fμ,n,∥∗∥ typically has only one local maximum in the sense that we cannot find any other local maximum.
Define a function Fμ whose domain is the set of all completely positive superoperators E:L(U)→L(U) by setting
Fμ(E)=∫log(⟨E(xx∗),yy∗⟩)μ(x,y) which is equivalent to
Fμ(E)=∫log(y∗E(xx∗)y)μ(x,y). We wrote F(μ,E) for Fμ(E) to reduce the use of subscripts.
Lemma: F(μ,E)=F(μ∗,E∗)=F(¯¯¯μ,¯¯¯E)=F(μ⊤,E⊤).
Proof: F(μ,E)=∫log(⟨E(xx∗),yy∗⟩μ(x,y)
=∫log(⟨xx∗,E∗(yy∗)⟩μ(x,y)
=∫log(⟨E∗(yy∗),xx∗⟩)μ(x,y)
=∫log(⟨E∗(xx∗),yy∗⟩μ∗(x,y)=F(μ∗,E∗).
Likewise,
F(μ,E)=∫log(y∗E(xx∗)y)⋅μ(x,y)
=∫log(¯¯¯¯¯y∗¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯E(xx∗)¯¯¯y)⋅μ(x,y)
=∫log(¯¯¯¯¯y∗⋅¯¯¯E(¯¯¯¯¯¯¯¯xx∗)⋅¯¯¯y)μ(x,y)
=∫log(y∗⋅¯¯¯E(xx∗)⋅y)¯¯¯μ(x,y)=F(¯¯¯μ,¯¯¯E).
As a consequence, F(μ,E)=F(μ∗,E∗)=F(¯¯¯¯¯μ∗,¯¯¯¯¯¯E∗)=F(μ⊤,E⊤). Q.E.D.
Theorem: Suppose that Fμ,n,∥∗∥ has a unique global maximum (E,Fμ,n,∥∗∥(E)).
If ∥∗∥ is Hermitian adjoint preserving and μ=μ∗, then E=E∗.
If ∥∗∥ is conjugate preserving and μ=¯¯¯μ, then E=¯¯¯E.
If ∥∗∥ is transpose preserving and μ=μ⊤, then E=E⊤.
Proof: The proofs of 2 and 3 are similar, so we shall only prove 1. For 1, assuming the premises, both E and E∗ belong to the domain of Fμ,n,∥∗∥. But
Fμ,n,∥∗∥(E)=F(μ,E)
=F(μ∗,E∗)=F(μ,E∗)=Fμ,n,∥∗∥(E∗). Since, Fμ,n,∥∗∥ has only one global maximum, we conclude that E=E∗. Q.E.D.
From the above result, we conclude that the global maximum (E,Fμ,n,∥∗∥(E)) inherits any symmetry that the measure μ has.
I would really like to build these inherently interpretable models so that they can solve some really interesting problems (or at least be a few layers in solving them), but I am still stuck attempting to communicate with people about linear models. Having a unique global optimum or more generally pseudodeterminism seems to be the best way to develop inherently interpretable and safe AI, but I have a hard time communicating with anyone about this.
Hopefully this will draw some attention! But are you sacrificing something else, for the sake of these desirable properties?
Yes. It seems like to get pseudodeterministic AI, we will need to rebuild AI from the very beginning, and I am not sure that it will all work. For example, pseudodeterminism is harder to attain with stochastic or mini-batch gradient descent, so one might need to use all the training data whenever one updates the weights. I have so far been able to get pseudodeterministic multi-layered models for solving classification problems, word embeddings for NLP, models that are measurements of security of block ciphers such as the advanced encryption standard (the models evaluating the AES are very easy to train), and other things. I have not been able to make pseudodeterministic version of convolutional networks, transformers, GANs, etc. We can use pseudodeterminism for narrow AI or the first few layers of a deep neural network right now though. There is also a funding and exposure issue since not very many people are talking about pseudodeterminism. I have more posts planned about this though.
A trade of performance In exchange for interpretability is exactly what we want for AI safety.
a lot hinges on this. i would be interested to learn about the experimental setup.
For experiments, I just used a convex combination of point mass measures for μ where the point masses are generated uniformly at random (though I might get something more complicated if I tried evaluating the integrals). I then attempted to find multiple local maxima by the usual gradient ascent. If I always end up with the same local maximum, I presume that there is only one local maximum even though I have no mathematical proof that this is the case.
I am redoing the experiments and the only way I can get pseudodeterminism to fail is by using real inner product spaces instead of complex inner product spaces and by setting n=1 (and in this case, pseudodeterminism fails because set of all points where the fitness function returns a real number instead of negative infinity has multiple components). When pseudodeterminism fails, it does not even fail that badly. The distribution of all models that we get has low collision entropy -log(X=Y), so P(X=Y) when X,Y are trained models with different initializations is still high.
Pseudodeterminism does not seem to be rare, but the problem in machine learning is to pseudodeterministically train machine learning models that can solve interesting and challenging problems; I have been working on this in my spare time (without anyone’s help), but since people don’t seem to be interested in this, progress has been slow.
two questions:
this seems to include an assumption regarding the points between training samples. if we take the point masses as the known values, then with this step we’re adding some interpolation between those. (that is, if we were “really training” these things, then the integral would look like a sum, since it would only have support at those (x,y) that represent points in our training data.)
have you tried adversarialy constructing \mu such that the integral has multiple maxima? if so, what did you run into? i worry that this could be a case where “random” examples mostly have this property, but some important subset does not (similar to how random functions are nowhere differentiable, but we can still do calculus.)
i’m interested in this! but i’ve only been introduced to these ideas recently, by your post! please read these questions as my own attempts to understand how you’re thinking about it.
Yes. When we take convex combinations of finitely many point mass measures, the integral is just a sum. I use the sum of finitely many elements for ease of calculations, but to prove theorems, I should use measures for full generality.
The idea of finding an object A along with distinct local optima GA(x1),…,GA(xn) with n maximized looks like an interesting problem to work on. I have not worked on this kind of objective before, but I can certainly try this, as I have a few ideas of how to do this. This might work better for discrete optimization problems though since I cannot think of a good way to use gradient updates to produce new local optima. In this case, I will need to use either evolutionary computation or hill climbing instead. I do not think that this will result in natural looking objects A though, so I don’t think I can learn much from this endeavor.
I have not thought much about finding measures μ where Fμ,n,∥∗∥ has many local maxima because I have many higher priorities. These days, people are focused on the more complicated machine learning systems such as large language models, and in order to catch up, I also need to increase the performance, capabilities, and efficiency of my pseudodeterministic machine learning models. For the more complicated multi-layered models, it seems more difficult to obtain and retain pseudodeterminism. Pseudodeterminism is a robust property for simple objective functions such as when we are training a linear model or performing convex optimization, but pseuodeterminism becomes increasingly fragile as we increase the sophistication of our objective functions. This means that it is trivial to violate pseudodeterminism for the sophisticated models that I want to work more on, but it is difficult to retain pseuodeterminism.
I am not at all worried about any strange case of non-pseudodeterminism when optimizing Fμ,n,∥∗∥ for measures I have not thought about yet since this problem is not even close to being non-pseudodeterministic. For example, if E0,F0 are norm 1 completely positive superoperators of the same Choi rank ≤N and if (xn,yn)∈(U∖{0})2 for all n and (En)n,(Fn)n are sequences where
En+1 is obtained by moving from En in the direction of the gradient (with possible momentum) of log(|⟨Enxn,yn⟩|2)−log(∥En∥) and Fn+1 is obtained from Fn the same way with the same rate, then my experiments show that limn→∞∥En−Fn∥=0 regardless of what each (xn,yn) is. In other words, even if (En)n does not converge, the sequences (En)n,(Fn)n uniformly approximate each other as n→∞. This is a much stronger form of pseudodeterminism that is hard to violate, so it is not a high priority to find particular instances of non-pseudodeterminism especially if those instances do not coincide with real-world data.
I kind of expect the fitness function Fμ,n,∥∗∥ to have just one or a few local maxima because their closest relatives are the linear models and those linear models are obtained by optimizing an objective function with one local optimum. And I also expect Fμ,n,∥∗∥ to have one or very few local maxima because Fμ,n,∥∗∥ is similar to many other objective functions that I have constructed each with one or a few local optimum. And since Fμ,n,∥∗∥ is simpler than other objective functions I have looked at with few local optima, Fμ,n,∥∗∥ should also have very few local optima. And the function Fμ is concave, so there is only one local maximum value {Fμ(E):E∈Q} whenever Q is a convex set (such possible convex sets of interest include all quantum channels and all unital channels). The restriction of our attention to completely positive operators of low Choi rank and in the boundary of the unit ball means that when we maximize Fμ,n,∥∗∥, we cannot use convexity to prove that there is only one local maximum, but convexity still suggests that there should be just one especially when n is large. When n is small, we cannot use convexity to make conclusions though since I did a Hessian calculation, and the Hessian of F(μ,Φ(A1,…,Ar)) with respect to (A1,…,Ar) generally has plenty of both positive and negative eigenvalues. I do not consider it a major problem if Fμ,n,∥∗∥ has multiple local maxima, since that probably just means that we need to increase the value of n until these local maxima merge.