Joseph Van Name

Karma: 6

Joseph Van Name 5 Jun 2026 21:29 UTC
1 point
0
on: Joseph Van Name’s Shortform
In this post, we shall compute average loss/fitness level for a linear dimensionality reduction.
The purpose of these calculations is to demonstrate that such a linear dimensionality reduction behaves mathematically and should be used as a simple model for what your loss/fitness functions should look like in AI/ML if you want your AI/ML to be well-behaved and interpretable.
Suppose that is either the field or real numbers, the field of complex numbers, or the division ring of quaternions. Suppose that is a -dimensional inner product space over the field
Suppose that is a measure over the unit sphere in . Then the objective is to find an optimal -dimensional subspace of for the measure . Let be a function. Therefore, define a function mapping the set of all -dimensional orthogonal projection matrices to by setting . The goal is to find an orthogonal projection that maximize/minimizes .
Let . Then, let be independent random variables each following the standard normal distribution on one real-variable. Then observe that follows the Chi-squared distribution with degrees of freedom. If follows the Chi-square distribution with degrees of freedom, then where is the digamma function. Let be a probability measure on the unit sphere of , and let be the uniform probability measure on the set of all orthogonal projections from to of rank . Then
where the random variable follows the F-distribution with and degrees of freedom. From standard facts about the F-distribution, we know that if and is a positive integer, then
. Observe that precisely when
, so in this case when , then
, and
diverges whenever .
Here is the digamma function where For integers and half-intergers, the digamma function can be evaluated as where is the Euler-Mascheroni constant, and which is a harmonic number. Thus, in the case where both are even (which includes the complex and quaternionic case), we have
.

Joseph Van Name 29 May 2026 3:59 UTC
1 point
0
on: Joseph Van Name’s Shortform
I was able to completely interpret a simple machine learning model trained on some cryptographic input. This objective is a special case of something I call an LSRDR which is a machine learning algorithm that I created in order to analyze block ciphers for cryptocurrency mining.
Set Let denote the finite field with elements. For each , let be the function defined by . Let denote the standard irreducible representation of . Here, can be represented somewhat inconveniently as an -matrix. Then our objective is to find a unit vector such that the spectral radius is locally maximized. Sometimes I obtain a bad local maximum, but sometimes I obtain a good one. Whenever I obtain a good local maximum, it is always the same thing. And in this case, for the good local maximum, after multiplying by −1 for positivity, I can always find positive constants such that whenever , whenever and .
Here, , .
The scenario where we obtain an overly perfect and completely interpretation to the local optimum happens all the time with these sorts of optimization algorithms that I have been working on, so if we want to develop more interpretable machine learning, it seems like this is the right direction to go. Of course, my trained model is very simple, so we need to do a substantial amount of work to generalize this sort of machine learning algorithm to something like a deep neural network. I am making progress, but it takes more computational power than I have to make progress with inherently interpretable deep learning.

Joseph Van Name 21 May 2026 1:53 UTC
1 point
0
on: Joseph Van Name’s Shortform
This post will be about my machine learning algorithm where quadratic algebraic numbers including the golden ratio appear in the trained models. This demonstrates that these machine learning models behave mathematically which is exactly the kind of thing that we want for AI interpretability and AI safety.
This post will be about particular examples of -spectral radius dimensionality reductions (LSRDRs). I originally developed the notion of an LSRDR to evaluate the cryptographic security of block ciphers for cryptocurrency mining, but let’s talk about machine learning instead of cryptocurrency technologies here.
Also, the results that I have obtained in this proof have been obtained experimentally. I have not proven these results rigorously.
Dimensionality reduction: Let denote either the field of real or complex numbers. Suppose that are -matrices over and are -matrices over . Then define the operation by setting . Define the operator .
Define the -spectral radius similarity by setting
.
Here, the spectral radius is analogous to a dot product, and is analogous to the cosine similarity.
If are fixed matrices and , then we say that is an -SRDR if the similarity is locally maximized. Informally, the LSRDR is a collection of smaller matrices that approximates the collection of bigger matrices.
Lie algebras: A Lie algebra is a vector space over a field together with a bilinear operation that satisfies the identities:
1. for all
2. for all .
For example, if is an associative bilinear operation, then one can check that the commutator operation defined by is a Lie-bracket, and a Lie algebra should be thought of as a vector space with an abstract commutator operation.
Let denote the Lie algebra of -anti-symmetric matrices over where the Lie algebra operation is just the commutator For the rest of this post, we shall set . Then is a Lie algebra of dimension
Set and let be an orthonormal basis for . Use the standard orthonormal basis if you want, but it does not matter which basis you choose.
An observation about the spectrum: Let be the linear operators defined by setting for each . Let be an -SRDR of . It turns out that the spectrum eventually stabilizes in the sense that if we keep constant and set greater than around or so, then does not depend on whenever Therefore, let denote the multiset for sufficiently large Then is the multi-set
multiplied by a constant scaling factor. Here, the notion means that the eigenvalue has multiplicity .
The general pattern:
So if we want to get interesting experimental results about LSRDRs, then we just need to the following. We first select a finite dimensional inner product space with an interesting bilinear operation , but make sure that is not associative. We then select an orthonormal basis of and define linear operators by . Then take an LSRDR of and then the operators will have interesting spectra.
Testing if a number is quadratic:
After evaluating the spectra, I needed to first normalize the spectrum and then try to figure out exact values of the eigenvalues from their floating point approximation. This is easy to do for quadratic algebraic numbers. You just take the continued fraction representation of your number that you want to test. If the continued fraction representation terminates, then you have a rational number. And your continued fraction of a positive irrational repeats if and only if it is a solution to a quadratic equation with integer coefficients, and it is easy to find those coefficients from the continued fraction representation.
Are LSRDRs relevant to deep learning?
LSRDRs are linear models without all the layers that deep neural networks have. But I have been generalizing LSRDRs to deeper machine learning models that retain some but not all of the interesting mathematical properties of LSRDRs. I would therefore consider these investigations into LSRDRs as relevant to deep learning.

Joseph Van Name 6 Apr 2026 16:39 UTC
3 points
0
on: Joseph Van Name’s Shortform
I am going to apply my own dimensionality reduction algebra to a quantum channel (or matrices) obtained from the Okubo algebra in order to demonstrate the compatibility between my dimensionality reduction and the Okubo algebra.
TL-DR version: I trained my own machine learning algorithms on Okubo algebras and the squares of the fitness levels of the local maxima were usually either rational numbers or quadratic algebraic numbers. This suggests that my machine learning algorithm behaves mathematically.
Origin of algorithm: I have originally created this dimensionality reduction algorithm to analyze the cryptographic security of block ciphers for the cryptocurrency that I have created. If you want to discuss cryptocurrency technologies, please contact me privately off this site since I really do not feel comfortable talking about that stuff here.
After obtaining the dimensionality reduction algorithm, I noticed that such algorithms behaved mathematically for reasons that I still can’t explain, and I have concluded that such mathematical behavior is needed to construct inherently interpretable and safe machine learning algorithms. Of course, if we want inherently interpretable and safe AI, we need machine learning algorithms that we can use to train models with many layers that can solve sophisticated tasks, but I am well on my way towards creating these algorithms too despite a complete and total lack of support.
Mathematics: The Okubo algebra^[1] is a close cousin to the octonions and satisfies many similar properties to the octonions.
The underlying set of the Okubo algebra is the set of all -complex Hermitian matrices with trace 0. Observe that the set of all -complex Hermitian matrices forms a real vector space of dimension . Therefore, the Okubo algebra’s underlying set has dimension . Let be the complex numbers with . Then up to complex conjugation. The Okubo algebra is endowed with a bilinear operation defined by (I scaled the operation by a factor of so that the norm on the Okubo algebra is just the Frobenius norm). The operation satisfies the property where refers to the Frobenius norm and .
Let be an isomorphism between inner product spaces. Then define an operation on by setting . Then define orthogonal matrices by where is the standard basis for real Euclidean space.
If are -complex matrices and are -complex matrices, then define the -spectral radius similarity between and by
.
Computational results: The following facts are suggested by computer experiments but have not been rigorously proven. To run the computer experiments, I used gradient ascent to locally maximize the -spectral radius similarity. By maximizing the -spectral radius similarity, we reduce the dimensions of a tuple of matrices, and I call this dimensionality reduction the -spectral radius dimensionality reduction (LSRDR).
The maximum value of among the real -matrices is . Let be the maximum value of among the -complex,real symmetric,complex symmetric, complex anti-symmetric, complex Hermitian matrices. Then
.
Similar facts seem to hold for the other values (but I have not completely performed the calculations due to numerical instabilities that I do not want to fix). For example, and for .
The fitness levels that I have are simple but they are not too simple. This indicates that LSRDRs of Okubo algebras are interesting mathematically.
1. ^
  Okubo algebras: automorphisms, derivations and idempotents, Alberto Elduque,2013,
  https://api.semanticscholar.org/CorpusID:119713330

Joseph Van Name 11 Feb 2026 8:49 UTC
1 point
0
in reply to: kbear’s comment on: Joseph Van Name’s Shortform
Yes. When we take convex combinations of finitely many point mass measures, the integral is just a sum. I use the sum of finitely many elements for ease of calculations, but to prove theorems, I should use measures for full generality.
The idea of finding an object $A$ along with distinct local optima $G_{A} (x_{1}), \dots, G_{A} (x_{n})$ with $n$ maximized looks like an interesting problem to work on. I have not worked on this kind of objective before, but I can certainly try this, as I have a few ideas of how to do this. This might work better for discrete optimization problems though since I cannot think of a good way to use gradient updates to produce new local optima. In this case, I will need to use either evolutionary computation or hill climbing instead. I do not think that this will result in natural looking objects $A$ though, so I don’t think I can learn much from this endeavor.
I have not thought much about finding measures $μ$ where $F_{μ, n, ∥ * ∥}$ has many local maxima because I have many higher priorities. These days, people are focused on the more complicated machine learning systems such as large language models, and in order to catch up, I also need to increase the performance, capabilities, and efficiency of my pseudodeterministic machine learning models. For the more complicated multi-layered models, it seems more difficult to obtain and retain pseudodeterminism. Pseudodeterminism is a robust property for simple objective functions such as when we are training a linear model or performing convex optimization, but pseuodeterminism becomes increasingly fragile as we increase the sophistication of our objective functions. This means that it is trivial to violate pseudodeterminism for the sophisticated models that I want to work more on, but it is difficult to retain pseuodeterminism.
I am not at all worried about any strange case of non-pseudodeterminism when optimizing $F_{μ, n, ∥ * ∥}$ for measures I have not thought about yet since this problem is not even close to being non-pseudodeterministic. For example, if $E_{0}, F_{0}$ are norm 1 completely positive superoperators of the same Choi rank $\leq N$ and if $(x_{n}, y_{n}) \in (U ∖ {0})^{2}$ for all $n$ and $(E_{n})_{n}, (F_{n})_{n}$ are sequences where
$E_{n + 1}$ is obtained by moving from $E_{n}$ in the direction of the gradient (with possible momentum) of $log (| ⟨ E_{n} x_{n}, y_{n} ⟩ |^{2}) - log (∥ E_{n} ∥)$ and $F_{n + 1}$ is obtained from $F_{n}$ the same way with the same rate, then my experiments show that ${lim}_{n \to \infty} ∥ E_{n} - F_{n} ∥ = 0$ regardless of what each $(x_{n}, y_{n})$ is. In other words, even if $(E_{n})_{n}$ does not converge, the sequences $(E_{n})_{n}, (F_{n})_{n}$ uniformly approximate each other as $n \to \infty$ . This is a much stronger form of pseudodeterminism that is hard to violate, so it is not a high priority to find particular instances of non-pseudodeterminism especially if those instances do not coincide with real-world data.
I kind of expect the fitness function $F_{μ, n, ∥ * ∥}$ to have just one or a few local maxima because their closest relatives are the linear models and those linear models are obtained by optimizing an objective function with one local optimum. And I also expect $F_{μ, n, ∥ * ∥}$ to have one or very few local maxima because $F_{μ, n, ∥ * ∥}$ is similar to many other objective functions that I have constructed each with one or a few local optimum. And since $F_{μ, n, ∥ * ∥}$ is simpler than other objective functions I have looked at with few local optima, $F_{μ, n, ∥ * ∥}$ should also have very few local optima. And the function $F_{μ}$ is concave, so there is only one local maximum value ${F_{μ} (E) : E \in Q}$ whenever $Q$ is a convex set (such possible convex sets of interest include all quantum channels and all unital channels). The restriction of our attention to completely positive operators of low Choi rank and in the boundary of the unit ball means that when we maximize $F_{μ, n, ∥ * ∥}$ , we cannot use convexity to prove that there is only one local maximum, but convexity still suggests that there should be just one especially when $n$ is large. When $n$ is small, we cannot use convexity to make conclusions though since I did a Hessian calculation, and the Hessian of $F (μ, Φ (A_{1}, \dots, A_{r}))$ with respect to $(A_{1}, \dots, A_{r})$ generally has plenty of both positive and negative eigenvalues. I do not consider it a major problem if $F_{μ, n, ∥ * ∥}$ has multiple local maxima, since that probably just means that we need to increase the value of $n$ until these local maxima merge.

Joseph Van Name 9 Feb 2026 6:04 UTC
1 point
0
in reply to: kbear’s comment on: Joseph Van Name’s Shortform
For experiments, I just used a convex combination of point mass measures for $μ$ where the point masses are generated uniformly at random (though I might get something more complicated if I tried evaluating the integrals). I then attempted to find multiple local maxima by the usual gradient ascent. If I always end up with the same local maximum, I presume that there is only one local maximum even though I have no mathematical proof that this is the case.
I am redoing the experiments and the only way I can get pseudodeterminism to fail is by using real inner product spaces instead of complex inner product spaces and by setting n=1 (and in this case, pseudodeterminism fails because set of all points where the fitness function returns a real number instead of negative infinity has multiple components). When pseudodeterminism fails, it does not even fail that badly. The distribution of all models that we get has low collision entropy -log(X=Y), so P(X=Y) when X,Y are trained models with different initializations is still high.
Pseudodeterminism does not seem to be rare, but the problem in machine learning is to pseudodeterministically train machine learning models that can solve interesting and challenging problems; I have been working on this in my spare time (without anyone’s help), but since people don’t seem to be interested in this, progress has been slow.

Joseph Van Name 9 Feb 2026 5:20 UTC
9 points
0
in reply to: Mitchell_Porter’s comment on: Joseph Van Name’s Shortform
Yes. It seems like to get pseudodeterministic AI, we will need to rebuild AI from the very beginning, and I am not sure that it will all work. For example, pseudodeterminism is harder to attain with stochastic or mini-batch gradient descent, so one might need to use all the training data whenever one updates the weights. I have so far been able to get pseudodeterministic multi-layered models for solving classification problems, word embeddings for NLP, models that are measurements of security of block ciphers such as the advanced encryption standard (the models evaluating the AES are very easy to train), and other things. I have not been able to make pseudodeterministic version of convolutional networks, transformers, GANs, etc. We can use pseudodeterminism for narrow AI or the first few layers of a deep neural network right now though. There is also a funding and exposure issue since not very many people are talking about pseudodeterminism. I have more posts planned about this though.
A trade of performance In exchange for interpretability is exactly what we want for AI safety.

Joseph Van Name 7 Feb 2026 9:46 UTC
10 points
0
on: Joseph Van Name’s Shortform
For machine learning, it is desirable for the trained model to have absolutely no random information left over from the initialization; in this short post, I will mathematically prove an interesting (to me) but simple consequence of this desirable behavior.
This post is a result of some research that I am doing for machine learning algorithms related to my investigation of cryptographic functions for the cryptocurrency that I launched (to discuss crypto, leave me a personal message so we can discuss this off this site).
This post shall be about linear machine learning models. Actually, we are using quantum operators, so they are more sophisticated than your logistic regression models, but they are still linear so it is really easy to train a neural network that can solve more sophisticated problems than these linear models can. But the kinds of results that you find in this post can also extend to some non-linear models with multiple layers and stronger capabilities. It is just easier to understand what is going on with the linear models, and even with the linear models, we still obtain some interesting mathematics.
We say that a machine learning model trained by gradient ascent/descent is pseudodeterministically trained (or just pseudodeterministic for short) if the fitness/loss function has precisely one local optimum. As a result, the trained model will have absolutely no information left over from the initialization. As another consequence, the trained model will attain the global optimum rather than a suboptimal local optimum. The results in this post will actually hold whenever the global optimum is unique. But I need to bring up pseudodeterminism since pseudodeterminism implies that we can actually find the unique global optimum instead of always getting stuck at a suboptimal local optimum.
If a machine learning model global optimizes an objective function, the machine learning model should be considered as an inherently interpretable model rather than a high performance model since the machine learning model has no random information in it independent of the objective function itself and since one can only find the global optima for sufficiently easy objective functions. The global optimum is also more interpretable because it inherits the symmetry of the objective function which depends on the training data. In this post, we shall show that if the training data has some symmetry, then the quantum operator that we train will also have that symmetry.
This post is mathematical and contain mathematical proofs. Fortunately, the mathematical proofs are not that difficult, so it is easy for the readers. After all, the main thrust of this post is that these mathematical proofs are backed up by experimental results. The main bottleneck towards understanding this post is therefore the task of getting through all the technical definitions. I might follow up this short post with a more general post, so you should read this before going through the more general post.
Let $U$ be a finite dimensional complex inner product space. If $B \subseteq U^{2}$ , then define sets $B^{*}, B^{⊤}, ¯ ¯¯ ¯ B$ by setting
$B^{*} = {(y, x) : (x, y) \in B}, ¯ ¯¯ ¯ B = {(¯ ¯ ¯ x, ¯ ¯ ¯ y) : (x, y) \in B}$
$B^{⊤} = {(¯ ¯ ¯ y, ¯ ¯ ¯ x) : (x, y) \in B}$ .
Let $μ$ be a probability measure on $U^{2}$ . Here, $μ$ is the probability distribution for the training data. Define new measures $μ^{*}, ¯ ¯ ¯ μ, μ^{⊤}$ by setting
$μ^{*} (B) = μ (B^{*}), ¯ ¯ ¯ μ (B) = μ (¯ ¯¯ ¯ B), μ^{⊤} (B) = μ (B^{⊤})$ .
Let $L (U)$ denote the collection of linear operators from $U$ to $U$ . If $A_{1}, \dots, A_{r} \in L (U)$ , then define an operator $Φ (A_{1}, \dots, A_{r}) : L (U) \to L (U)$ by setting $Φ (A_{1}, \dots, A_{r}) (X) = A_{1} X A_{1}^{*} + \dots + A_{r} X A_{r}^{*}$ . The operators of the form $Φ (A_{1}, \dots, A_{r})$ are the completely positive superoperators of Choi rank at most $r$ . Recall that $L (U)$ is an inner product space with the Frobenius inner product. It is easy to show that the Hermitian adjoint $Φ (A_{1}, \dots, A_{r})^{*}$ is just $Φ (A_{1}^{*}, \dots, A_{r}^{*})$ . If $E$ is a completely positive superoperator, then define $¯ ¯ ¯ E$ by setting
$¯ ¯ ¯ E (X) = (E (X^{* ⊤}))^{* ⊤}$ . Define $E^{⊤} = {¯ ¯ ¯ E}^{*} =^{*}$ . Then it is easy to show that $¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ Φ (A_{1}, \dots, A_{r}) = Φ (_{1}, \dots,_{r})$ and
$Φ (A_{1}, \dots, A_{r})^{⊤} = Φ (A_{1}^{⊤}, \dots, A_{r}^{⊤})$ .
We say that a norm $∥ * ∥$ on $L (L (U))$ is Hermitian adjoint preserving (resp. conjugate preserving, transpose preserving) if $∥ E ∥ = ∥ E^{*} ∥$ (resp, $∥ E ∥ = ∥ ¯ ¯ ¯ E ∥$ and $∥ E ∥ = ∥ E^{⊤} ∥$ ).
The domain of the fitness function $F_{μ, n}$ is the set of all non-zero completely positive superoperators $E : L (U) \to L (U)$ of Choi rank at most $n$ with $∥ E ∥ = 1$ . We define the fitness function $F_{μ, n, ∥ * ∥}$ by setting
$F_{μ, n, ∥ * ∥} (E) = \int log (⟨ E (x x^{*}), y y^{*} ⟩) μ (x, y) = \int log (⟨ E (x x^{*}), y y^{*} ⟩) μ (x, y) - log (∥ E ∥)$ . Observe that we also have $F_{μ, n, ∥ * ∥} = \int log (y^{*} E (x x^{*}) y) d μ (x, y)$ .
Experimental result (pseudodeterminism): Computer experiments show that the function $F_{μ, n, ∥ * ∥}$ typically has only one local maximum in the sense that we cannot find any other local maximum.
Define a function $F_{μ}$ whose domain is the set of all completely positive superoperators $E : L (U) \to L (U)$ by setting
$F_{μ} (E) = \int log (⟨ E (x x^{*}), y y^{*} ⟩) μ (x, y)$ which is equivalent to
$F_{μ} (E) = \int log (y^{*} E (x x^{*}) y) μ (x, y)$ . We wrote $F (μ, E)$ for $F_{μ} (E)$ to reduce the use of subscripts.
Lemma: $F (μ, E) = F (μ^{*}, E^{*}) = F (¯ ¯ ¯ μ, ¯ ¯ ¯ E) = F (μ^{⊤}, E^{⊤})$ .
Proof: $F (μ, E) = \int log (⟨ E (x x^{*}), y y^{*} ⟩ μ (x, y)$
$= \int log (⟨ x x^{*}, E^{*} (y y^{*}) ⟩ μ (x, y)$
$= \int log (⟨ E^{*} (y y^{*}), x x^{*} ⟩) μ (x, y)$
$= \int log (⟨ E^{*} (x x^{*}), y y^{*} ⟩ μ^{*} (x, y) = F (μ^{*}, E^{*}) .$
Likewise,
$F (μ, E) = \int log (y^{*} E (x x^{*}) y) \cdot μ (x, y)$
$= \int log (^{*} ¯ ¯¯¯¯¯¯¯¯¯¯¯¯¯¯ ¯ E (x x^{*}) ¯ ¯ ¯ y) \cdot μ (x, y)$
$= \int log (^{*} \cdot ¯ ¯ ¯ E (¯ ¯¯¯¯¯¯ ¯ x x^{*}) \cdot ¯ ¯ ¯ y) μ (x, y)$
$= \int log (y^{*} \cdot ¯ ¯ ¯ E (x x^{*}) \cdot y) ¯ ¯ ¯ μ (x, y) = F (¯ ¯ ¯ μ, ¯ ¯ ¯ E) .$
As a consequence, $F (μ, E) = F (μ^{*}, E^{*}) = F (^{*},^{*}) = F (μ^{⊤}, E^{⊤})$ . Q.E.D.
Theorem: Suppose that $F_{μ, n, ∥ * ∥}$ has a unique global maximum $(E, F_{μ, n, ∥ * ∥} (E))$ .
1. If $∥ * ∥$ is Hermitian adjoint preserving and $μ = μ^{*}$ , then $E = E^{*}$ .
2. If $∥ * ∥$ is conjugate preserving and $μ = ¯ ¯ ¯ μ$ , then $E = ¯ ¯ ¯ E$ .
3. If $∥ * ∥$ is transpose preserving and $μ = μ^{⊤}$ , then $E = E^{⊤}$ .
Proof: The proofs of 2 and 3 are similar, so we shall only prove 1. For 1, assuming the premises, both $E$ and $E^{*}$ belong to the domain of $F_{μ, n, ∥ * ∥}$ . But
$F_{μ, n, ∥ * ∥} (E) = F (μ, E)$
$= F (μ^{*}, E^{*}) = F (μ, E^{*}) = F_{μ, n, ∥ * ∥} (E^{*})$ . Since, $F_{μ, n, ∥ * ∥}$ has only one global maximum, we conclude that $E = E^{*}$ . Q.E.D.
From the above result, we conclude that the global maximum $(E, F_{μ, n, ∥ * ∥} (E))$ inherits any symmetry that the measure $μ$ has.
I would really like to build these inherently interpretable models so that they can solve some really interesting problems (or at least be a few layers in solving them), but I am still stuck attempting to communicate with people about linear models. Having a unique global optimum or more generally pseudodeterminism seems to be the best way to develop inherently interpretable and safe AI, but I have a hard time communicating with anyone about this.

Joseph Van Name 24 Nov 2025 5:15 UTC
−7 points
0
in reply to: Taylor G. Lunt’s comment on: Literacy is Decreasing Among the Intellectual Class
You are siding with evil because you yourself are evil. My anger against people like you is righteous. If you are not convinced, it is simply because you have been completely consumed by your own evil. Universities promote violence. I know. I was a professor. But you just want me to be violently attacked and injured or killed.
~~UNIVERSITIES MUST BE REJECTED FOR PROMOTING VIOLENCE!~~
P.S. Only commenting and responding to my non-technical posts where I call out universities for their problem just proves my point. Face it. You are stupid, and universities only pretended to educate you.
I am striking this out because it is better if I instead made only mathematics posts on this site relevant to AI safety.

Joseph Van Name 24 Nov 2025 1:20 UTC
−40 points
−2
in reply to: Random Developer’s comment on: Literacy is Decreasing Among the Intellectual Class
I was a professor, so I know that universities promote violence, so don’t even try that bullshit on me. People hate universities because universities are absolutely horrendous and extremely unprofessional. Until universities apologize for their extremely low standards and horrible behavior, we should MOCK all people with degrees from universities and regard them as evil worthless people. People hate me for bringing this up because most people with college degrees are scumbags who are afraid to admit that they are far stupider (and more evil too) than people who did not waste their money and time in college.
~~I DO NOT NEED TO VISIT A UNIVERSITY TO KNOW HOW IT IS LIKE SINCE I WAS A PROFESSOR, YOU BLOODTHIRSTY SCUMBAG!~~
P.S. I knew people would downvote me. The reason people hate me for talking about this is that most people with college degrees are bloodthirsty piles garbage who need to be punished for their evil. You all probably think that barbaric practices like ECT are healthy because you just fucking trust medical professionals trained by universities along with all other university graduates. You all have the morals of Jeffrey Dahmer.
I am striking this out because it is better if I instead made only mathematics posts on this site relevant to AI safety.

Joseph Van Name 23 Nov 2025 10:28 UTC
−19 points
0
on: Literacy is Decreasing Among the Intellectual Class
We should no longer consider people with degrees or affiliated with universities as being ‘intellectual’ in any way whatsoever because universities promote violence and refuse to improve their horrendous behavior.
I am striking this out because it is better if I instead made only mathematics posts on this site relevant to AI safety.

Joseph Van Name 22 Nov 2025 20:12 UTC
1 point
0
on: Why Not Just Train For Interpretability?
The use of something like L1 regularization to achieve sparsity for inherent interpretability may just make things worse; a fixation on L1 regularization may lead people in the wrong direction. To avoid fixation, we should take a step back and look at the big picture. Occam’s razor suggests that we should look for simple (and creative) solutions instead of over-engineering solutions when the entire foundation is inadequate.
In order to obtain inherent interpretability, the machine learning model needs to behave in a way that is interesting to mathematicians. By piling on tweaks such as a lot of L1 or L0 regularization for sparsity, one is making the machine learning model more complicated. That makes it more difficult to study mathematically. And neural networks are inherently difficult to study mathematically and to interpret, so they should be replaced with something else. The problem is that neural networks already have so much momentum that people are unwilling to try anything else, and people are way too indoctrinated into neural networkology that they cannot learn new things.
So how does one get momentum with a non-neural machine learning algorithm? One starts with shallow but mathematical machine learning algorithms first and one can also work with algorithms with few layers too. These shallow/few layer mathematical algorithms can still be effective for some problems since they have plenty of width. One may also construct a hybrid model where the first few layers are the mathematical construction but where the rest of the network is a deep neural network. I do not see how to make a very deep network this way, so the next steps are obscure to me.

Joseph Van Name 12 Oct 2025 20:03 UTC
1 point
0
in reply to: joseph_c’s comment on: Using complex polynomials to approximate arbitrary continuous functions
I have not heard about the IBM paper until now. This is inspired by my personal experiments training (obviously classical) machine learning models.
Suppose that $V_{0}, \dots, V_{n}, W_{0}, \dots, W_{n}$ are real or complex finite dimensional inner product spaces. Suppose that the training data consists of tuples of the form $(v_{0}, \dots, v_{n})$ where $v_{0} \in V_{0}, \dots, v_{n} \in V_{n}$ are vectors. Let $W_{0} = V_{0}$ and let $B_{j} : V_{j + 1} \times W_{j} \to W_{j + 1}$ be bilinear for all $j$ . Then let
$L_{v} (w) = B_{j} (v, w)$ whenever $v \in V_{j + 1}, w \in W_{j}$ . Then we define our polynomial by setting $p (v_{0}, \dots, v_{n}) = L_{v_{n}} \dots L_{v_{1}} (v_{0})$ . In other words, my machine learning models are just compositions of bilinear mappings. In addition to wanting ${Tr}_{U} ([p (v_{0}, \dots, v_{n})])$ to approximate the label, we also include regularization that makes the machine learning model pseudodeterministically trained so that if we train it twice with different initializations, we end up with the same trained model. Here, the machine learning model has $n$ layers, but the addition of extra layers gives us diminishing returns since bilinearity is close to linearity, so I still want to figure out how to improve the performance of such a machine learning model to match deep neural networks (if that is even feasible).
I use quantum information theory for my experiments mainly because quantum information theory behaves well unlike neural networks.

Joseph Van Name 12 Oct 2025 1:50 UTC
1 point
0
in reply to: ChristianKl’s comment on: Joseph Van Name’s Shortform
Proof-of-stake is still wasteful since it promotes pump and dump scams and causes people to waste their money on scam projects. If the creators are able to get their reward at the very beginning of a project, they will be more interested in short-term gains rather than a long-term token that will last. Humans are not psychologically/socially equipped to invest in proof-of-stake cryptocurrencies since they tend to get scammed.

Joseph Van Name 11 Oct 2025 5:35 UTC
7 points
1
on: Joseph Van Name’s Shortform
Bitcoin mining is a real-world example of a goal that people spend an enormous amount of resources to attain, but this goal is useless or at least horribly inefficient.
Recall that the orthogonality thesis states that it is possible for an intelligent entity to have bad or dumb goals and that it is also possible for a not-so-intelligent entity to have good goals. I would therefore consider Bitcoin mining to be a real-world prominent example of the orthogonality thesis as it in a sense a dumb goal attained intelligently (though, this example is imperfect).
Bitcoin’s mining algorithm consists of computing many SHA-256 hashes relentlessly. The Bitcoin miners are rewarded whenever they compute a suitable SHA-256 hash that is lower than the target. These SHA-256 hashes establish decentralized consensus about the state of the blockchain, and they distribute newly minted bitcoins. But besides this, computing so many SHA-256 hashes is nearly useless. Computing so many SHA-256 hashes consumes large quantities of energy and creates electronic waste.
So what are some of the possible alternatives to Bitcoin mining? It seems like the best alternative that does not significantly change the nature of Bitcoin mining would be to replace SHA-256 mining with some other mining algorithm that serves some scientific purpose.
This is more difficult than it seems because Bitcoin mining must satisfy a list of cryptographic properties. If the mining algorithm did not satisfy these cryptographic problems, then it might not be feasible for newly minted bitcoins to be dispersed every 10 minutes, and we may enter a scenario where a single entity with a secret algorithm or slightly faster hardware were to put all the blocks on the blockchain.
Since Bitcoin mining must satisfy a list of cryptographic properties, it is difficult to come up with a more scientifically useful mining algorithm that satisfies these cryptographic properties. But in science, if there is a difficult problem, people should perform research on this scientific problem. While finding a useful cryptocurrency mining algorithm has its challenges, cryptocurrency mining algorithms are easy to produce since they can be made from cryptographic hash functions without requiring public key encryption or other advanced cryptographic algorithms, so difficulty seems more like an excuse rather than a legitimate reason not to investigate useful cryptocurrency mining algorithms. The cryptocurrency sector does not want to perform this research. I can think of several reasons why people refuse to support this sort of endeavor despite the great effort that people put into Bitcoin mining, but none of these reasons justify the lack of interest in useful cryptocurrency mining.
The diminishing quality of cryptocurrency users:
It seems like when altcoins were first being developed around 2014, people were much more interested in developing scientifically useful mining algorithms. But around 2017 when cryptocurrency really started to become popular, people simply wanted to make money from cryptocurrencies, yet they were not very interested in understanding how cryptocurrencies work or how to improve them.
Mining algorithms with questionable scientific use:
Some cryptocurrencies and proposals such as Primecoin and Gapcoin have more scientific mining algorithms, but these mining algorithms still have questionable usefulness. For example, the objective in Primecoin mining is to find suitable Cunningham chains. A Cunningham chain of the first kind is a sequence of prime numbers $(p_{1}, \dots, p_{n})$ where $p_{j + 1} = 2 p_{j} + 1$ whenever $1 \leq j < n$ . The most interesting thing about Cunningham chains is that they can be used in cryptocurrency mining algorithms, but they are otherwise of minor importance to mathematics.
These questionable mining algorithms are supposed to steer the cryptocurrency community into a more scientific direction, but in reality, they have just steered the cryptocurrency community towards using mining to perform mathematical calculations that not even mathematicians care that much about.
Alternative solutions to the energy waste problem:
Many people just want to do away with cryptocurrency mining in an altcoin by replacing it with proof-of-stake or some other consensus mechanism. This solution is attractive to the cryptocurrency creators since they want complete control over all the coins at the beginning of the project, and they just use the energy usage of cryptocurrency as a marketing strategy to get people interested in their project. But this solution should not be appealing to anyone who wants to use the cryptocurrency even if a cryptocurrency is better funded without much mining (of course, if mining is replaced with another consensus mechanism after all the coins have been created, then this objection does not stand). After all, Satoshi Nakamoto did not fund Bitcoin by selling bitcoins. There are other ways to fund a cryptocurrency project without alternate consensus mechanisms.
Hostility against cryptocurrency technologies:
It seems like many members of society are hostile against cryptocurrency technologies and hate people who own or are in any way interested in cryptocurrency. This sort of hostility is a very good reason to conduct as many transactions using just cryptocurrency since I do not want to deal with all of those Karens. But this hostility may have turned people away from researching useful cryptocurrency mining algorithms even though the usefulness would probably not benefit the cryptocurrency directly.
Hardcore Bitcoiners:
If Bitcoin mining were magically replaced with a useful mining algorithm, barely anything about Bitcoin would change. But in my experience, Bitcoiners do not see it this way. They are so stuck in their ways that they reject all altcoins.
Conclusion:
While cryptocurrencies have a lot of monetary value, they are not exactly powerhouses of innovation, nor do I find them extremely interesting on their own. But a good scientific mining algorithm would make them much more innovative and interesting.

Joseph Van Name 22 Sep 2025 7:29 UTC
2 points
0
on: Joseph Van Name’s Shortform
In this post, we shall go over a way to produce mostly linear machine learning classification models that output probabilities for each possible label. These mostly linear models are pseudodeterministically trained (or pseudodeterministic for short) in the sense that if we train them multiple times with different initializations, we will typically get the same trained model (up-to-symmetry and miniscule floating point differences).
The algorithms that I am mentioning in this post generalize to more complicated multi-layered algorithms in the sense that the multi-layered algorithms remain pseudodeterministic, but for simplicity, we shall stick to just linear operators here.
Let $K$ denote either the field of real numbers, the field of complex numbers, or the division ring of quaternions. Let $U$ be a finite dimensional inner product space over $K$ . The training data is a set $D$ of pairs $(u, v)$ where $u \in U$ and $v \in {1, \dots, n}$ where $u$ is the machine learning model input and $v$ is the label. The machine learning model is trained to predict the label $v$ when given the input $u$ . The trained model is a function $f$ that maps $U$ to the set of all probability vectors of length $n$ , so the trained model actually gives the probabilities for each possible label.
Suppose that $V_{i}$ is a finite dimensional inner product space over $K$ for each $i \in {1, \dots, n}$ . Then the domain of the fitness function consists of tuples $(A_{1}, \dots, A_{n})$ where each $A_{i}$ is a linear operator from $U$ to $V_{i}$ . Let $p \in (0, 1), r \in (0, \infty), q \in (1, \infty)$ , and let $λ \geq 0$ . The parameter $p$ is the exponent while $λ$ is the regularization parameter. Define (almost total) functions $G, R, F : L (U, V_{1}) \times \dots \times L (U, V_{n}) \to R$ by setting
$G (A_{1}, \dots, A_{n}) = \sum_{(u, v) \in D} (\frac{∥ A_{v} u ∥^{r}}{∥ A_{1} u ∥^{r} + \dots + ∥ A_{n} u ∥^{r}})^{p} / | D |$
$R (A_{1}, \dots, A_{n}) = (\sum_{(u, v) \in D} λ \cdot log (∥ A_{v} u ∥) / | D |)$
$- λ \cdot (log (∥ A_{1} ∥_{q}) + \dots + log (∥ A_{n} ∥_{q})) / n$ .
Here, $∥ * ∥_{q}$ denotes the Schatten $q$ -norm which can be defined by setting
$∥ A ∥_{q} = Tr ((A A^{*})^{q / 2})$ .
Set $F = G + R$ . Here, $F$ denotes our fitness function. The function $G$ what we really want to maximize, but unfortunately, $G$ is typically non-pseudodeterministic, so we need to add the regularization term $R$ to obtain pseudodeterminism. The regularization term $R$ also has the added effect of making $∥ A_{v} u ∥$ relatively large compared to the norm $∥ A ∥_{q}$ for training data points $(u, v)$ . This may be useful in determining whether a pair should belong to either the training or test data in the first place.
We observe that $F$ is $0$ -homogeneous in the sense that $F (A_{1}, \dots, A_{n}) = F (c A_{1}, \dots, c A_{n})$ for each non-zero scalar $c$ (in the quaternionic case, the scalars are just the real numbers).
Suppose now that we have obtained a tuple $(A_{1}, \dots, A_{n})$ that maximizes the fitness $F (A_{1}, \dots, A_{n})$ . Let $P V (n)$ denote the set of all probability vectors of length $n$ . Then define an almost total function $f : U \to P V (n)$ by setting
$f (u) = \frac{(∥ A_{1} u ∥^{r (1 - p)}, \dots, ∥ A_{n} u ∥^{r (1 - p)})}{∥ A_{1} u ∥^{r (1 - p)} + \dots + ∥ A_{n} u ∥^{r (1 - p)}} .$
If $(u, v)$ belongs to the training data set, then the $i$ -th entry of $f (u)$ is the machine learning model’s estimate of the probability that $i = v$ . I will let the reader justify this calculation of the probabilities.
We can generalize the function $f$ to pseudodeterministically trained machine learning models with multiple layers by replacing the linear operators $A_{1}, \dots, A_{n}$ with some non-linear or multi-linear operators. Actually, there are quite a few ways of generalizing the fitness function $F$ , and I have taken some liberty in the exact formulation for $F$ .
In addition to being pseudodeterministic, the fitness function $F$ has other notable desirable properties. For example, when maximizing $F$ using gradient ascent, one tends to converge to the local maximum at an exponential rate without needing to decay the learning rate.

Joseph Van Name 7 Aug 2025 5:08 UTC
1 point
0
in reply to: AnthonyC’s comment on: Consider showering
Whether one takes or should take a cold shower or not depends on a lot of factors including whether one exercises, one’s health, one’s personal preferences, the air temperature, the cold water temperature, the humidity level, and the hardness of the shower water. But it seems like most people can’t fathom taking a cold shower simply because they are cold intolerant even though cold showers have many benefits.
In addition to the practical benefits of cold showers, cold showers also may offer health benefits.
Cold showers could improve one’s immune system (though we should).
The Effect of Cold Showering on Health and Work: A Randomized Controlled Trial—PMC
Cold showers may boost mood or alleviate depression.
Scientific Evidence-Based Effects of Hydrotherapy on Various Systems of the Body—PMC
Adapted cold shower as a potential treatment for depression—ScienceDirect
Cold showers could also improve circulation and metabolism.
Cold showers also offer other benefits.
I always use the exhaust fan. It is never powerful enough to reduce the humidity faster than a warm shower increases the humidity. I also lock the door when taking a shower, and I do not know why anyone would take a shower without locking the door. Opening the door while showering just makes the rest of the home humid as well, and we can’t have that.
I exercise daily, so out of habit, I always take a shower after I exercise, and most of my showers are after exercise. Even if I spend a few minutes cooling down after exercise, I need the shower to cool down even more, and by taking a warm shower, I cannot cool down as effectively, so I end up sweating after taking the shower. And I sometimes take my temperature after exercise and the shower and even after the shower, I tend to have a mouth temperature of 99.0 to 99.5 degrees Fahrenheit. I doubt that people who barely need to take a shower after exercising are doing much exercise or perhaps they are doing weights instead of cardio which produces less sweat, but in any case, I have never exercised and thought that I do not need a shower regardless of whether I am doing cardio, weights, or whatever.
Soap scum left over after taking a cold shower seems to be a problem for you and for you only.
Added 8/20/2025: And taking a hot shower produces all the condensation that helps all that mirror bacteria grow. Biological risk from the mirror world — LessWrong

Joseph Van Name 4 Aug 2025 0:59 UTC
−1 points
0
in reply to: AnthonyC’s comment on: Consider showering
Instead of not taking showers, we should all take cold showers for many reasons.
1. You already mentioned the energy usage which is a problem.
2. Hot showers increase the relative humidity of the bathroom to 100 percent which is way too high. And that humidity means that you get a lot of condensation in the bathroom too. That is good only if you want the bathroom covered in mold.
3. If you take a hot shower that fogs up all the mirrors, you are censoring your own nakeyness. Please don’t do that.
4. I do not care if people shower daily. But people need to exercise daily. And after exercising, people need to shower. As a corollary, most of the time that people shower should be right after exercising. But after exercising, you are already warm, so the goal is to cool down. This means that everyone needs to take a cold shower.
5. Cold intolerance is a major problem. People need to get over it. People who can’t tolerate a little bit of cold probably are intolerant in other areas as well. They cannot go mountain climbing because the mountains have snow on them. They can’t tolerate hot peppers. And they are afraid of spiders too.

Joseph Van Name 12 Jun 2025 6:36 UTC
3 points
0
on: Joseph Van Name’s Shortform
I am going to share an algorithm that I came up with that tends to produce the same result when we run it multiple times with a different initialization. The iteration is not even guaranteed convergence since we are not using gradient ascent, but it typically converges as long as the algorithm is given a reasonable input. This suggests that the algorithm behaves mathematically and may be useful for things such as quantum error correction. After analyzing the algorithm, I shall use the algorithm to solve a computational problem.
We say that an algorithm is pseudodeterministic if it tends to return the same output even if the computation leading to that output is non-deterministic (due to a random initialization). I believe that we should focus a lot more on pseudodetermistic machine learning algorithms for AI safety and interpretability since pseudodeterministic algorithms are inherently interpretable.
Define $f (z) = 3 z^{2} - 2 z^{3}$ for all complex numbers $z$ . Then $f (0) = 0, f (1) = 1, f^{'} (0) = f^{'} (1) = 0$ , and there are neighborhoods $U, V$ of $0, 1$ respectively where if $x \in U$ , then $f^{N} (x) \to 0$ quickly and if $y \in V$ , then $f^{N} (y) \to 1$ quickly. Set $f^{\infty} = {lim}_{N \to \infty} f^{N}$ . The function $f^{\infty}$ serves as error correction for projection matrices since if $Q$ is nearly a projection matrix, then $f^{\infty} (Q)$ will be a projection matrix.
Suppose that $K$ is either the field of real numbers, complex numbers or quaternions. Let $Z (K)$ denote the center of $K$ . In particular, $Z (R) = R, Z (C) = C, Z (H) = R$ .
If $A_{1}, \dots, A_{r}$ are $m \times n$ -matrices, then define $Φ (A_{1}, \dots, A_{r}) : M_{n} (K) \to M_{m} (K)$ by setting $Φ (A_{1}, \dots, A_{r}) = \sum_{k = 1}^{r} A_{k} X A_{k}^{*}$ . Then we say that an operator of the form $Φ (A_{1}, \dots, A_{r})$ is completely positive. We say that a $Z (K)$ -linear operator $E : M_{n} (K) \to M_{m} (K)$ is Hermitian preserving if $E (X)$ is Hermitian whenever $X$ is Hermitian. Every completely positive operator is Hermitian preserving.
Suppose that $E : M_{n} (K) \to M_{n} (K)$ is $Z (K)$ -linear. Let $t > 0$ . Let $P_{0} \in M_{n} (K)$ be a random orthogonal projection matrix of rank $d$ . Set $P_{N + 1} = f^{\infty} (P_{N} + t \cdot E (P_{N}))$ for all $N$ . Then if everything goes well, the sequence $(P_{N})_{N}$ will converge to a projection matrix $P$ of rank $d$ , and the projection matrix $P$ will typically be unique in the sense that if we run the experiment again, we will typically obtain the exact same projection matrix $P$ . If $E$ is Hermitian preserving, then the projection matrix $P$ will typically be an orthogonal projection. This experiment performs well especially when $E$ is completely positive or at least Hermitian preserving or nearly so. The projection matrix $P$ will satisfy the equation $P \cdot E (P) = E (P) \cdot P = P \cdot E (P) \cdot P$ .
In the case when $E$ is a quantum channel, we can easily explain what the projection $P$ does. The operator $P$ is a projection onto a subspace of complex Euclidean space that is particularly well preserved by the channel $E$ . In particular, the image $Im (P)$ is spanned by the top $d$ eigenvectors of $E (P)$ . This means that if we send the completely mixed state $P / d$ through the quantum channel $E$ and we measure the state $E (P / d)$ with respect to the projective measurement $(P, I - P)$ , then there is an unusually high probability that this measurement will land on $P$ instead of $I - P$ .
Let us now use the algorithm that obtains $P$ from $E$ to solve a problem in many cases.
If $x$ is a vector, then let $Diag (x)$ denote the diagonal matrix where $x$ is the vector of diagonal entries, and if $X$ is a square matrix, then let $Diag (X)$ denote the diagonal of $X$ . If $x$ is a length $n$ vector, then $Diag (x)$ is an $n \times n$ -matrix, and if $X$ is an $n \times n$ -matrix, then $Diag (X)$ is a length $n$ vector.
Problem Input: An $n \times n$ -square matrix $A$ with non-negative real entries and a natural number $d$ with $1 \leq d < n$ .
Objective: Find a subset $B \subseteq {1, \dots, n}$ with $| B | = d$ and where if $x = A \cdot χ_{B}$ , then the $d$ largest entries in $x$ are the values $x [b]$ for $b \in B$ .
Algorithm: Let $E$ be the completely positive operator defined by setting $E (X) = Diag (A \cdot Diag (X))$ . Then we run the iteration using $E$ to produce an orthogonal projection $P$ with rank $d$ . In this case, the projection $P$ will be a diagonal projection matrix with rank $d$ where $diag (P) = χ_{B}$ and where $B$ is our desired subset of ${1, \dots, n}$ .
While the operator $P$ is just a linear operator, the pseudodeterminism of the algorithm that produces the operator $P$ generalizes to other pseudodeterministic algorithms that return models that are more like deep neural networks.

Joseph Van Name 4 Jun 2025 21:01 UTC
1 point
0
in reply to: Logan Riggs’s comment on: Spectral radii dimensionality reduction computed without gradient calculations
I would have thought that a fitness function that is maximized using something other than gradient ascent and which can solve NP-complete problems at least in the average case would be worth reading since that means that it can perform well on some tasks but it also behaves mathematically in a way that is needed for interpretability. The quality of the content is inversely proportional to the number of views since people don’t think the same way as I do.
Wheels on the Bus | @CoComelon Nursery Rhymes & Kids Songs
Stuff that is popular is usually garbage.
But here is my post about the word embedding.
Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima — LessWrong
And I really do not want to collaborate with people who are not willing to read the post. This is especially true of people in academia since universities promote violence and refuse to acknowledge any wrongdoing. Universities are the absolute worst.
Instead of engaging with the actual topic, people tend to just criticize stupid stuff simply because they only want to read about what they already know or what is recommended by their buddies; that is a very good way not to learn anything new or insightful. For this reason, even the simplest concepts are lost on most people.