Joseph Van Name

Karma: −2

Joseph Van Name 24 Nov 2025 5:15 UTC
−5 points
0
in reply to: Taylor G. Lunt’s comment on: Literacy is Decreasing Among the Intellectual Class
You are siding with evil because you yourself are evil. My anger against people like you is righteous. If you are not convinced, it is simply because you have been completely consumed by your own evil. Universities promote violence. I know. I was a professor. But you just want me to be violently attacked and injured or killed.
~~UNIVERSITIES MUST BE REJECTED FOR PROMOTING VIOLENCE!~~
P.S. Only commenting and responding to my non-technical posts where I call out universities for their problem just proves my point. Face it. You are stupid, and universities only pretended to educate you.
I am striking this out because it is better if I instead made only mathematics posts on this site relevant to AI safety.

Joseph Van Name 24 Nov 2025 1:20 UTC
−31 points
−1
in reply to: Random Developer’s comment on: Literacy is Decreasing Among the Intellectual Class
I was a professor, so I know that universities promote violence, so don’t even try that bullshit on me. People hate universities because universities are absolutely horrendous and extremely unprofessional. Until universities apologize for their extremely low standards and horrible behavior, we should MOCK all people with degrees from universities and regard them as evil worthless people. People hate me for bringing this up because most people with college degrees are scumbags who are afraid to admit that they are far stupider (and more evil too) than people who did not waste their money and time in college.
~~I DO NOT NEED TO VISIT A UNIVERSITY TO KNOW HOW IT IS LIKE SINCE I WAS A PROFESSOR, YOU BLOODTHIRSTY SCUMBAG!~~
P.S. I knew people would downvote me. The reason people hate me for talking about this is that most people with college degrees are bloodthirsty piles garbage who need to be punished for their evil. You all probably think that barbaric practices like ECT are healthy because you just fucking trust medical professionals trained by universities along with all other university graduates. You all have the morals of Jeffrey Dahmer.
I am striking this out because it is better if I instead made only mathematics posts on this site relevant to AI safety.

Joseph Van Name 23 Nov 2025 10:28 UTC
−19 points
0
on: Literacy is Decreasing Among the Intellectual Class
We should no longer consider people with degrees or affiliated with universities as being ‘intellectual’ in any way whatsoever because universities promote violence and refuse to improve their horrendous behavior.
I am striking this out because it is better if I instead made only mathematics posts on this site relevant to AI safety.

Joseph Van Name 22 Nov 2025 20:12 UTC
1 point
0
on: Why Not Just Train For Interpretability?
The use of something like L1 regularization to achieve sparsity for inherent interpretability may just make things worse; a fixation on L1 regularization may lead people in the wrong direction. To avoid fixation, we should take a step back and look at the big picture. Occam’s razor suggests that we should look for simple (and creative) solutions instead of over-engineering solutions when the entire foundation is inadequate.
In order to obtain inherent interpretability, the machine learning model needs to behave in a way that is interesting to mathematicians. By piling on tweaks such as a lot of L1 or L0 regularization for sparsity, one is making the machine learning model more complicated. That makes it more difficult to study mathematically. And neural networks are inherently difficult to study mathematically and to interpret, so they should be replaced with something else. The problem is that neural networks already have so much momentum that people are unwilling to try anything else, and people are way too indoctrinated into neural networkology that they cannot learn new things.
So how does one get momentum with a non-neural machine learning algorithm? One starts with shallow but mathematical machine learning algorithms first and one can also work with algorithms with few layers too. These shallow/few layer mathematical algorithms can still be effective for some problems since they have plenty of width. One may also construct a hybrid model where the first few layers are the mathematical construction but where the rest of the network is a deep neural network. I do not see how to make a very deep network this way, so the next steps are obscure to me.

Approximating arbitrary complex-valued continuous functions

Joseph Van Name17 Nov 2025 8:25 UTC

5 points

0 comments5 min readLW link

Joseph Van Name 12 Oct 2025 20:03 UTC
1 point
0
in reply to: joseph_c’s comment on: Using complex polynomials to approximate arbitrary continuous functions
I have not heard about the IBM paper until now. This is inspired by my personal experiments training (obviously classical) machine learning models.
Suppose that $V_{0}, \dots, V_{n}, W_{0}, \dots, W_{n}$ are real or complex finite dimensional inner product spaces. Suppose that the training data consists of tuples of the form $(v_{0}, \dots, v_{n})$ where $v_{0} \in V_{0}, \dots, v_{n} \in V_{n}$ are vectors. Let $W_{0} = V_{0}$ and let $B_{j} : V_{j + 1} \times W_{j} \to W_{j + 1}$ be bilinear for all $j$ . Then let
$L_{v} (w) = B_{j} (v, w)$ whenever $v \in V_{j + 1}, w \in W_{j}$ . Then we define our polynomial by setting $p (v_{0}, \dots, v_{n}) = L_{v_{n}} \dots L_{v_{1}} (v_{0})$ . In other words, my machine learning models are just compositions of bilinear mappings. In addition to wanting ${Tr}_{U} ([p (v_{0}, \dots, v_{n})])$ to approximate the label, we also include regularization that makes the machine learning model pseudodeterministically trained so that if we train it twice with different initializations, we end up with the same trained model. Here, the machine learning model has $n$ layers, but the addition of extra layers gives us diminishing returns since bilinearity is close to linearity, so I still want to figure out how to improve the performance of such a machine learning model to match deep neural networks (if that is even feasible).
I use quantum information theory for my experiments mainly because quantum information theory behaves well unlike neural networks.

Joseph Van Name 12 Oct 2025 1:50 UTC
1 point
0
in reply to: ChristianKl’s comment on: Joseph Van Name’s Shortform
Proof-of-stake is still wasteful since it promotes pump and dump scams and causes people to waste their money on scam projects. If the creators are able to get their reward at the very beginning of a project, they will be more interested in short-term gains rather than a long-term token that will last. Humans are not psychologically/socially equipped to invest in proof-of-stake cryptocurrencies since they tend to get scammed.

Joseph Van Name 11 Oct 2025 5:35 UTC
7 points
1
on: Joseph Van Name’s Shortform
Bitcoin mining is a real-world example of a goal that people spend an enormous amount of resources to attain, but this goal is useless or at least horribly inefficient.
Recall that the orthogonality thesis states that it is possible for an intelligent entity to have bad or dumb goals and that it is also possible for a not-so-intelligent entity to have good goals. I would therefore consider Bitcoin mining to be a real-world prominent example of the orthogonality thesis as it in a sense a dumb goal attained intelligently (though, this example is imperfect).
Bitcoin’s mining algorithm consists of computing many SHA-256 hashes relentlessly. The Bitcoin miners are rewarded whenever they compute a suitable SHA-256 hash that is lower than the target. These SHA-256 hashes establish decentralized consensus about the state of the blockchain, and they distribute newly minted bitcoins. But besides this, computing so many SHA-256 hashes is nearly useless. Computing so many SHA-256 hashes consumes large quantities of energy and creates electronic waste.
So what are some of the possible alternatives to Bitcoin mining? It seems like the best alternative that does not significantly change the nature of Bitcoin mining would be to replace SHA-256 mining with some other mining algorithm that serves some scientific purpose.
This is more difficult than it seems because Bitcoin mining must satisfy a list of cryptographic properties. If the mining algorithm did not satisfy these cryptographic problems, then it might not be feasible for newly minted bitcoins to be dispersed every 10 minutes, and we may enter a scenario where a single entity with a secret algorithm or slightly faster hardware were to put all the blocks on the blockchain.
Since Bitcoin mining must satisfy a list of cryptographic properties, it is difficult to come up with a more scientifically useful mining algorithm that satisfies these cryptographic properties. But in science, if there is a difficult problem, people should perform research on this scientific problem. While finding a useful cryptocurrency mining algorithm has its challenges, cryptocurrency mining algorithms are easy to produce since they can be made from cryptographic hash functions without requiring public key encryption or other advanced cryptographic algorithms, so difficulty seems more like an excuse rather than a legitimate reason not to investigate useful cryptocurrency mining algorithms. The cryptocurrency sector does not want to perform this research. I can think of several reasons why people refuse to support this sort of endeavor despite the great effort that people put into Bitcoin mining, but none of these reasons justify the lack of interest in useful cryptocurrency mining.
The diminishing quality of cryptocurrency users:
It seems like when altcoins were first being developed around 2014, people were much more interested in developing scientifically useful mining algorithms. But around 2017 when cryptocurrency really started to become popular, people simply wanted to make money from cryptocurrencies, yet they were not very interested in understanding how cryptocurrencies work or how to improve them.
Mining algorithms with questionable scientific use:
Some cryptocurrencies and proposals such as Primecoin and Gapcoin have more scientific mining algorithms, but these mining algorithms still have questionable usefulness. For example, the objective in Primecoin mining is to find suitable Cunningham chains. A Cunningham chain of the first kind is a sequence of prime numbers $(p_{1}, \dots, p_{n})$ where $p_{j + 1} = 2 p_{j} + 1$ whenever $1 \leq j < n$ . The most interesting thing about Cunningham chains is that they can be used in cryptocurrency mining algorithms, but they are otherwise of minor importance to mathematics.
These questionable mining algorithms are supposed to steer the cryptocurrency community into a more scientific direction, but in reality, they have just steered the cryptocurrency community towards using mining to perform mathematical calculations that not even mathematicians care that much about.
Alternative solutions to the energy waste problem:
Many people just want to do away with cryptocurrency mining in an altcoin by replacing it with proof-of-stake or some other consensus mechanism. This solution is attractive to the cryptocurrency creators since they want complete control over all the coins at the beginning of the project, and they just use the energy usage of cryptocurrency as a marketing strategy to get people interested in their project. But this solution should not be appealing to anyone who wants to use the cryptocurrency even if a cryptocurrency is better funded without much mining (of course, if mining is replaced with another consensus mechanism after all the coins have been created, then this objection does not stand). After all, Satoshi Nakamoto did not fund Bitcoin by selling bitcoins. There are other ways to fund a cryptocurrency project without alternate consensus mechanisms.
Hostility against cryptocurrency technologies:
It seems like many members of society are hostile against cryptocurrency technologies and hate people who own or are in any way interested in cryptocurrency. This sort of hostility is a very good reason to conduct as many transactions using just cryptocurrency since I do not want to deal with all of those Karens. But this hostility may have turned people away from researching useful cryptocurrency mining algorithms even though the usefulness would probably not benefit the cryptocurrency directly.
Hardcore Bitcoiners:
If Bitcoin mining were magically replaced with a useful mining algorithm, barely anything about Bitcoin would change. But in my experience, Bitcoiners do not see it this way. They are so stuck in their ways that they reject all altcoins.
Conclusion:
While cryptocurrencies have a lot of monetary value, they are not exactly powerhouses of innovation, nor do I find them extremely interesting on their own. But a good scientific mining algorithm would make them much more innovative and interesting.

Using complex polynomials to approximate arbitrary continuous functions

Joseph Van Name11 Oct 2025 4:06 UTC

5 points

2 comments5 min readLW link

Joseph Van Name 22 Sep 2025 7:29 UTC
2 points
0
on: Joseph Van Name’s Shortform
In this post, we shall go over a way to produce mostly linear machine learning classification models that output probabilities for each possible label. These mostly linear models are pseudodeterministically trained (or pseudodeterministic for short) in the sense that if we train them multiple times with different initializations, we will typically get the same trained model (up-to-symmetry and miniscule floating point differences).
The algorithms that I am mentioning in this post generalize to more complicated multi-layered algorithms in the sense that the multi-layered algorithms remain pseudodeterministic, but for simplicity, we shall stick to just linear operators here.
Let $K$ denote either the field of real numbers, the field of complex numbers, or the division ring of quaternions. Let $U$ be a finite dimensional inner product space over $K$ . The training data is a set $D$ of pairs $(u, v)$ where $u \in U$ and $v \in {1, \dots, n}$ where $u$ is the machine learning model input and $v$ is the label. The machine learning model is trained to predict the label $v$ when given the input $u$ . The trained model is a function $f$ that maps $U$ to the set of all probability vectors of length $n$ , so the trained model actually gives the probabilities for each possible label.
Suppose that $V_{i}$ is a finite dimensional inner product space over $K$ for each $i \in {1, \dots, n}$ . Then the domain of the fitness function consists of tuples $(A_{1}, \dots, A_{n})$ where each $A_{i}$ is a linear operator from $U$ to $V_{i}$ . Let $p \in (0, 1), r \in (0, \infty), q \in (1, \infty)$ , and let $λ \geq 0$ . The parameter $p$ is the exponent while $λ$ is the regularization parameter. Define (almost total) functions $G, R, F : L (U, V_{1}) \times \dots \times L (U, V_{n}) \to R$ by setting
$G (A_{1}, \dots, A_{n}) = \sum_{(u, v) \in D} (\frac{∥ A_{v} u ∥^{r}}{∥ A_{1} u ∥^{r} + \dots + ∥ A_{n} u ∥^{r}})^{p} / | D |$
$R (A_{1}, \dots, A_{n}) = (\sum_{(u, v) \in D} λ \cdot log (∥ A_{v} u ∥) / | D |)$
$- λ \cdot (log (∥ A_{1} ∥_{q}) + \dots + log (∥ A_{n} ∥_{q})) / n$ .
Here, $∥ * ∥_{q}$ denotes the Schatten $q$ -norm which can be defined by setting
$∥ A ∥_{q} = Tr ((A A^{*})^{q / 2})$ .
Set $F = G + R$ . Here, $F$ denotes our fitness function. The function $G$ what we really want to maximize, but unfortunately, $G$ is typically non-pseudodeterministic, so we need to add the regularization term $R$ to obtain pseudodeterminism. The regularization term $R$ also has the added effect of making $∥ A_{v} u ∥$ relatively large compared to the norm $∥ A ∥_{q}$ for training data points $(u, v)$ . This may be useful in determining whether a pair should belong to either the training or test data in the first place.
We observe that $F$ is $0$ -homogeneous in the sense that $F (A_{1}, \dots, A_{n}) = F (c A_{1}, \dots, c A_{n})$ for each non-zero scalar $c$ (in the quaternionic case, the scalars are just the real numbers).
Suppose now that we have obtained a tuple $(A_{1}, \dots, A_{n})$ that maximizes the fitness $F (A_{1}, \dots, A_{n})$ . Let $P V (n)$ denote the set of all probability vectors of length $n$ . Then define an almost total function $f : U \to P V (n)$ by setting
$f (u) = \frac{(∥ A_{1} u ∥^{r (1 - p)}, \dots, ∥ A_{n} u ∥^{r (1 - p)})}{∥ A_{1} u ∥^{r (1 - p)} + \dots + ∥ A_{n} u ∥^{r (1 - p)}} .$
If $(u, v)$ belongs to the training data set, then the $i$ -th entry of $f (u)$ is the machine learning model’s estimate of the probability that $i = v$ . I will let the reader justify this calculation of the probabilities.
We can generalize the function $f$ to pseudodeterministically trained machine learning models with multiple layers by replacing the linear operators $A_{1}, \dots, A_{n}$ with some non-linear or multi-linear operators. Actually, there are quite a few ways of generalizing the fitness function $F$ , and I have taken some liberty in the exact formulation for $F$ .
In addition to being pseudodeterministic, the fitness function $F$ has other notable desirable properties. For example, when maximizing $F$ using gradient ascent, one tends to converge to the local maximum at an exponential rate without needing to decay the learning rate.

Joseph Van Name 7 Aug 2025 5:08 UTC
1 point
0
in reply to: AnthonyC’s comment on: Consider showering
Whether one takes or should take a cold shower or not depends on a lot of factors including whether one exercises, one’s health, one’s personal preferences, the air temperature, the cold water temperature, the humidity level, and the hardness of the shower water. But it seems like most people can’t fathom taking a cold shower simply because they are cold intolerant even though cold showers have many benefits.
In addition to the practical benefits of cold showers, cold showers also may offer health benefits.
Cold showers could improve one’s immune system (though we should).
The Effect of Cold Showering on Health and Work: A Randomized Controlled Trial—PMC
Cold showers may boost mood or alleviate depression.
Scientific Evidence-Based Effects of Hydrotherapy on Various Systems of the Body—PMC
Adapted cold shower as a potential treatment for depression—ScienceDirect
Cold showers could also improve circulation and metabolism.
Cold showers also offer other benefits.
I always use the exhaust fan. It is never powerful enough to reduce the humidity faster than a warm shower increases the humidity. I also lock the door when taking a shower, and I do not know why anyone would take a shower without locking the door. Opening the door while showering just makes the rest of the home humid as well, and we can’t have that.
I exercise daily, so out of habit, I always take a shower after I exercise, and most of my showers are after exercise. Even if I spend a few minutes cooling down after exercise, I need the shower to cool down even more, and by taking a warm shower, I cannot cool down as effectively, so I end up sweating after taking the shower. And I sometimes take my temperature after exercise and the shower and even after the shower, I tend to have a mouth temperature of 99.0 to 99.5 degrees Fahrenheit. I doubt that people who barely need to take a shower after exercising are doing much exercise or perhaps they are doing weights instead of cardio which produces less sweat, but in any case, I have never exercised and thought that I do not need a shower regardless of whether I am doing cardio, weights, or whatever.
Soap scum left over after taking a cold shower seems to be a problem for you and for you only.
Added 8/20/2025: And taking a hot shower produces all the condensation that helps all that mirror bacteria grow. Biological risk from the mirror world — LessWrong

Joseph Van Name 4 Aug 2025 0:59 UTC
−1 points
0
in reply to: AnthonyC’s comment on: Consider showering
Instead of not taking showers, we should all take cold showers for many reasons.
1. You already mentioned the energy usage which is a problem.
2. Hot showers increase the relative humidity of the bathroom to 100 percent which is way too high. And that humidity means that you get a lot of condensation in the bathroom too. That is good only if you want the bathroom covered in mold.
3. If you take a hot shower that fogs up all the mirrors, you are censoring your own nakeyness. Please don’t do that.
4. I do not care if people shower daily. But people need to exercise daily. And after exercising, people need to shower. As a corollary, most of the time that people shower should be right after exercising. But after exercising, you are already warm, so the goal is to cool down. This means that everyone needs to take a cold shower.
5. Cold intolerance is a major problem. People need to get over it. People who can’t tolerate a little bit of cold probably are intolerant in other areas as well. They cannot go mountain climbing because the mountains have snow on them. They can’t tolerate hot peppers. And they are afraid of spiders too.

Joseph Van Name 12 Jun 2025 6:36 UTC
3 points
0
on: Joseph Van Name’s Shortform
I am going to share an algorithm that I came up with that tends to produce the same result when we run it multiple times with a different initialization. The iteration is not even guaranteed convergence since we are not using gradient ascent, but it typically converges as long as the algorithm is given a reasonable input. This suggests that the algorithm behaves mathematically and may be useful for things such as quantum error correction. After analyzing the algorithm, I shall use the algorithm to solve a computational problem.
We say that an algorithm is pseudodeterministic if it tends to return the same output even if the computation leading to that output is non-deterministic (due to a random initialization). I believe that we should focus a lot more on pseudodetermistic machine learning algorithms for AI safety and interpretability since pseudodeterministic algorithms are inherently interpretable.
Define $f (z) = 3 z^{2} - 2 z^{3}$ for all complex numbers $z$ . Then $f (0) = 0, f (1) = 1, f^{'} (0) = f^{'} (1) = 0$ , and there are neighborhoods $U, V$ of $0, 1$ respectively where if $x \in U$ , then $f^{N} (x) \to 0$ quickly and if $y \in V$ , then $f^{N} (y) \to 1$ quickly. Set $f^{\infty} = {lim}_{N \to \infty} f^{N}$ . The function $f^{\infty}$ serves as error correction for projection matrices since if $Q$ is nearly a projection matrix, then $f^{\infty} (Q)$ will be a projection matrix.
Suppose that $K$ is either the field of real numbers, complex numbers or quaternions. Let $Z (K)$ denote the center of $K$ . In particular, $Z (R) = R, Z (C) = C, Z (H) = R$ .
If $A_{1}, \dots, A_{r}$ are $m \times n$ -matrices, then define $Φ (A_{1}, \dots, A_{r}) : M_{n} (K) \to M_{m} (K)$ by setting $Φ (A_{1}, \dots, A_{r}) = \sum_{k = 1}^{r} A_{k} X A_{k}^{*}$ . Then we say that an operator of the form $Φ (A_{1}, \dots, A_{r})$ is completely positive. We say that a $Z (K)$ -linear operator $E : M_{n} (K) \to M_{m} (K)$ is Hermitian preserving if $E (X)$ is Hermitian whenever $X$ is Hermitian. Every completely positive operator is Hermitian preserving.
Suppose that $E : M_{n} (K) \to M_{n} (K)$ is $Z (K)$ -linear. Let $t > 0$ . Let $P_{0} \in M_{n} (K)$ be a random orthogonal projection matrix of rank $d$ . Set $P_{N + 1} = f^{\infty} (P_{N} + t \cdot E (P_{N}))$ for all $N$ . Then if everything goes well, the sequence $(P_{N})_{N}$ will converge to a projection matrix $P$ of rank $d$ , and the projection matrix $P$ will typically be unique in the sense that if we run the experiment again, we will typically obtain the exact same projection matrix $P$ . If $E$ is Hermitian preserving, then the projection matrix $P$ will typically be an orthogonal projection. This experiment performs well especially when $E$ is completely positive or at least Hermitian preserving or nearly so. The projection matrix $P$ will satisfy the equation $P \cdot E (P) = E (P) \cdot P = P \cdot E (P) \cdot P$ .
In the case when $E$ is a quantum channel, we can easily explain what the projection $P$ does. The operator $P$ is a projection onto a subspace of complex Euclidean space that is particularly well preserved by the channel $E$ . In particular, the image $Im (P)$ is spanned by the top $d$ eigenvectors of $E (P)$ . This means that if we send the completely mixed state $P / d$ through the quantum channel $E$ and we measure the state $E (P / d)$ with respect to the projective measurement $(P, I - P)$ , then there is an unusually high probability that this measurement will land on $P$ instead of $I - P$ .
Let us now use the algorithm that obtains $P$ from $E$ to solve a problem in many cases.
If $x$ is a vector, then let $Diag (x)$ denote the diagonal matrix where $x$ is the vector of diagonal entries, and if $X$ is a square matrix, then let $Diag (X)$ denote the diagonal of $X$ . If $x$ is a length $n$ vector, then $Diag (x)$ is an $n \times n$ -matrix, and if $X$ is an $n \times n$ -matrix, then $Diag (X)$ is a length $n$ vector.
Problem Input: An $n \times n$ -square matrix $A$ with non-negative real entries and a natural number $d$ with $1 \leq d < n$ .
Objective: Find a subset $B \subseteq {1, \dots, n}$ with $| B | = d$ and where if $x = A \cdot χ_{B}$ , then the $d$ largest entries in $x$ are the values $x [b]$ for $b \in B$ .
Algorithm: Let $E$ be the completely positive operator defined by setting $E (X) = Diag (A \cdot Diag (X))$ . Then we run the iteration using $E$ to produce an orthogonal projection $P$ with rank $d$ . In this case, the projection $P$ will be a diagonal projection matrix with rank $d$ where $diag (P) = χ_{B}$ and where $B$ is our desired subset of ${1, \dots, n}$ .
While the operator $P$ is just a linear operator, the pseudodeterminism of the algorithm that produces the operator $P$ generalizes to other pseudodeterministic algorithms that return models that are more like deep neural networks.

Joseph Van Name 4 Jun 2025 21:01 UTC
1 point
0
in reply to: Logan Riggs’s comment on: Spectral radii dimensionality reduction computed without gradient calculations
I would have thought that a fitness function that is maximized using something other than gradient ascent and which can solve NP-complete problems at least in the average case would be worth reading since that means that it can perform well on some tasks but it also behaves mathematically in a way that is needed for interpretability. The quality of the content is inversely proportional to the number of views since people don’t think the same way as I do.
Wheels on the Bus | @CoComelon Nursery Rhymes & Kids Songs
Stuff that is popular is usually garbage.
But here is my post about the word embedding.
Interpreting a matrix-valued word embedding with a mathematically proven characterization of all optima — LessWrong
And I really do not want to collaborate with people who are not willing to read the post. This is especially true of people in academia since universities promote violence and refuse to acknowledge any wrongdoing. Universities are the absolute worst.
Instead of engaging with the actual topic, people tend to just criticize stupid stuff simply because they only want to read about what they already know or what is recommended by their buddies; that is a very good way not to learn anything new or insightful. For this reason, even the simplest concepts are lost on most people.

Joseph Van Name 28 May 2025 19:22 UTC
8 points
0
in reply to: Logan Riggs’s comment on: Spectral radii dimensionality reduction computed without gradient calculations
In this post, the existence of a non-gradient based algorithm for computing LSRDRs is a sign that LSRDRs behave mathematically and are quite interpretable. Gradient ascent is a general purpose optimization algorithm that works in the case when there is no other way to solve the optimization problem, but when there are multiple ways of obtaining a solution to an optimization problem, the optimization problem is behaving in a way that should be appealing to mathematicians.
LSRDRs and similar algorithms are pseudodeterministic in the sense that if we train the model multiple times on the same data, we typically get identical models. Pseudodeterminism is a signal of interpretability for several reasons that I will go into more detail in a future post:
1. Pseudodeterministic models do not contain any extra random or even pseudorandom information that is not contained in the training data already. This means that when interpreting these models, one does not have to interpret random information.
2. Pseudodeterministic models inherit the symmetry of their training data. For example, if we train a real LSRDR using real symmetric matrices, then the projection $P$ will itself by a symmetric matrix.
3. In mathematics, a well-posed problem is a problem where there exists a unique solution to the problem. Well-posed problems behave better than ill-posed problems in the sense that it is easier to prove results about well-posed problems than it is to prove results about ill-posed problems.
In addition to pseudodeterminism, in my experience, LSRDRs are quite interpretable since I have interpreted LSRDRs already in a few posts:
Interpreting a dimensionality reduction of a collection of matrices as two positive semidefinite block diagonal matrices — LessWrong
When performing a dimensionality reduction on tensors, the trace is often zero. — LessWrong
I have Generalized LSRDRs so that they are starting to behave like deeper neural networks. I am trying to expand the capabilities of generalized LSRDRs so they behave more like deep neural networks, but I still have some work to expand their capabilities while retaining pseudodeterminism. In the meantime, generalized LSRDRs may still function as narrow AI for specific problems and also as layers in AI.
Of course, if we want to compare capabilities, we should also compare NNs to LSRDRs at tasks such as evaluating the cryptographic security of block ciphers, solving NP-complete problems in the average case, etc.
As for the difficulty of this post, it seems like that is the result of the post being mathematical. But going through this kind of mathematics so that we obtain inherently interpretable AI should be the easier portion of AI interpretability. I would much rather communicate about the actual mathematics than about how difficult the mathematics is.

Spectral radii dimensionality reduction computed without gradient calculations

Joseph Van Name28 May 2025 5:06 UTC

5 points

4 comments6 min readLW link

Joseph Van Name 14 May 2025 5:42 UTC
1 point
0
on: Joseph Van Name’s Shortform
In this post, we shall describe 3 related fitness functions with discrete domains where the process of maximizing these functions is pseudodeterministic in the sense that if we locally maximize the fitness function multiple times, then we typically attain the same local maximum; this appears to be an important aspect of AI safety. These fitness functions are my own. While these functions are far from deep neural networks, I think they are still related to AI safety since they are closely related to other fitness functions that are locally maximized pseudodeterministically that more closely resemble deep neural networks.
Let $K$ denote a finite dimensional algebra over the field of real numbers together with an adjoint operation $*$ (the operation $*$ is a linear involution with $(x y)^{*} = y^{*} x^{*}$ ). For example, $K$ could be the field of real numbers, complex numbers, quaternions, or a matrix ring over the reals, complex, or quaternions. We can extend the adjoint $*$ to the matrix ring $M_{r} (K)$ by setting $(x_{i, j})_{i, j}^{*} = (x_{j, i}^{*})_{i, j}$ .
Let $n$ be a natural number. If $A_{1}, \dots, A_{r} \in M_{n} (K), X_{1}, \dots, X_{r} \in M_{d} (K)$ , then define
$Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (K) \to M_{n, d} (K)$ by setting $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ .
Suppose now that $1 \leq d < n$ . Then let $S_{d} \subseteq M_{n, n} (K)$ be the set of all $0, 1$ -diagonal matrices with $d$ many $1$ ’s on the diagonal. We observe that each element in $S_{d}$ is an orthogonal projection. Define fitness functions $F_{d}, G_{d}, H_{d} : S_{d} \to R$ by setting
$F_{d} (P) = ρ (Γ (A_{1}, \dots, A_{r}; P A_{1} P, \dots, P A_{r} P))$ ,
$G_{d} (P) = ρ (Γ (P A_{1} P, \dots, P A_{r} P; P A_{1} P, \dots, P A_{r} P))$ , and
$H_{d} (P) = \frac{F_{d} (P)^{2}}{G_{d} (P)}$ . Here, $ρ$ denotes the spectral radius.
$F_{d} (P)$ is typically slightly larger than $G_{d} (P)$ , so these three fitness functions are closely related.
If $P, Q \in S_{d}$ , then we say that $Q$ is in the neighborhood of $P$ if $Q$ differs from $P$ by at most 2 entries. If $F$ is a fitness function with domain $S_{d}$ , then we say that $(P, F (P))$ is a local maximum of the function $F$ if $F (P) \geq F (Q)$ whenever $Q$ is in the neighborhood of $P$ .
The path from initialization to a local maximum $(P_{s}, F (P_{s}))$ for will be a sequence $(P_{0}, \dots, P_{s})$ where $P_{j}$ is always in the neighborhood of $P_{j - 1}$ and where $F (P_{j}) \geq F (P_{j - 1})$ for all $j$ and the length of the path will be $s$ and where $P_{0}$ is generated uniformly randomly.
Empirical observation: Suppose that $F \in {F_{d}, G_{d}, H_{d}}$ . If we compute a path from initialization to local maximum for $F$ , then such a path will typically have length less than $n$ . Furthermore, if we locally maximize $F$ multiple times, we will typically obtain the same local maximum each time. Moreover, if $P_{F}, P_{G}, P_{H}$ are the computed local maxima of $F_{d}, G_{d}, H_{d}$ respectively, then $P_{F}, P_{G}, P_{H}$ will either be identical or differ by relatively few diagonal entries.
I have not done the experiments yet, but one should be able to generalize the above empirical observation to matroids. Suppose that $M$ is a basis matroid with underlying set ${1, \dots, n}$ and where $| A | = d$ for each $A \in M$ . Then one should be able to make the same observation about the fitness functions $F_{d} |_{M}, G_{d} |_{M}, H_{d} |_{M}$ as well.
We observe that the problems of maximizing $F_{d}, G_{d}, H_{d}$ are all NP-complete problems since the clique problems can be reduced to special cases of maximizing $F_{d}, G_{d}, H_{d}$ . This means that the problems of maximizing $F_{d}, G_{d}, H_{d}$ can be sophisticated problems, but this also means that we should not expect it to be easy to find the global maxima for $F_{d}, G_{d}, H_{d}$ in some cases.

Joseph Van Name 10 May 2025 5:57 UTC
1 point
0
on: Joseph Van Name’s Shortform
This is a post about some of the machine learning algorithms that I have been doing experiments with. These machine learning models behave quite mathematically which seems to be very helpful for AI interpretability and AI safety.
Sequences of matrices generally cannot be approximated by sequences of Hermitian matrices.
Suppose that $A_{1}, \dots, A_{r}$ are $n \times n$ -complex matrices and $X_{1}, \dots, X_{r}$ are $d \times d$ -complex matrices. Then define a mapping $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) : M_{n, d} (C) \to M_{n, d} (C)$ by $Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}) (X) = A_{1} X X_{1}^{*} + \dots + A_{r} X X_{r}^{*}$ for all $X$ . Define
$Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the $L_{2}$
-spectral radius by setting $ρ_{2} (A_{1}, \dots, A_{r}) = ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2}$ . Define the $L_{2}$ -spectral radius similarity between $(A_{1}, \dots, A_{r})$ and $(X_{1}, \dots, X_{r})$ by
$∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; X_{1}, \dots, X_{r}))}{ρ_{2} (A_{1}, \dots, A_{r}) ρ_{2} (X_{1}, \dots, X_{r})}$ .
The $L_{2}$ -spectral radius similarity is always in the interval $[0, 1]$ . if $n = d$ , $A_{1}, \dots, A_{r}$ generates the algebra of $n \times n$ -complex matrices, and $X_{1}, \dots, X_{r}$ also generates the algebra of $n \times n$ -complex matrices, then $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2} = 1$ if and only if there are $C, λ$ with $A_{j} = λ C X_{j} C^{- 1}$ for all $j$ .
Define $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ to be the supremum of
$ρ_{2} (A_{1}, \dots, A_{r}) \cdot ∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$
where $X_{1}, \dots, X_{r}$ are $d \times d$ -Hermitian matrices.
One can get lower bounds for $ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ simply by locally maximizing $∥ (A_{1}, \dots, A_{r}) ≃ (X_{1}, \dots, X_{r}) ∥_{2}$ using gradient ascent, but if one locally maximizes this quantity twice, one typically gets the same fitness level.
Empirical observation/conjecture: If $(A_{1}, \dots, A_{r})$ are $n \times n$ -complex matrices, then $ρ_{2, n}^{H} (A_{1}, \dots, A_{r}) = ρ_{2, d}^{H} (A_{1}, \dots, A_{r})$ whenever $d \geq n$ .
The above observation means that sequences of $n \times n$ -matrices $(A_{1}, \dots, A_{r})$ are fundamentally non-Hermitian. In this case, we cannot get better models of $(A_{1}, \dots, A_{r})$ using Hermitian matrices larger than the matrices $(A_{1}, \dots, A_{r})$ themselves; I kind of want the behavior to be more complex instead of doing the same thing whenever $d \geq n$
, but the purpose of modeling $(A_{1}, \dots, A_{r})$ as Hermitian matrices is generally to use smaller matrices and not larger matrices.
This means that the function $ρ_{2, d}^{H}$ behaves mathematically.
Now, the model $(X_{1}, \dots, X_{r})$ is a linear model of $(A_{1}, \dots, A_{r})$ since the mapping $A_{j} \mapsto X_{j}$ is the restriction of a linear mapping, so such a linear model should be good for a limited number of tasks, but the mathematical behavior of the model $(X_{1}, \dots, X_{r})$ generalizes to multi-layered machine learning models.

Joseph Van Name 3 May 2025 9:38 UTC
9 points
0
on: Joseph Van Name’s Shortform
In this post, I will post some observations that I have made about the octonions that demonstrate that the machine learning algorithms that I have been looking at recently behave mathematically and such machine learning algorithms seem to be highly interpretable. The good behavior of these machine learning algorithms is in part due to the mathematical nature of the octonions and also the compatibility with the octonions and the machine learning algorithm. To be specific, one should think of the octonions as encoding a mixed unitary quantum channel that looks very close to the completely depolarizing channel, but my machine learning algorithms work well with those sorts of quantum channels and similar objects.
Suppose that $K$ is either the field of real numbers, complex numbers, or quaternions.
If $A_{1}, \dots, A_{r} \in M_{m} (K), B_{1}, \dots, B_{r} \in M_{n} (K)$ are matrices, then define an superoperator $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) : M_{m, n} (K) \to M_{m, n} (K)$
by setting $Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}) (X) = A_{1} X B_{1}^{*} + \dots + A_{r} X B_{r}^{*}$
(the domain and range of )and define $Φ (A_{1}, \dots, A_{r}) = Γ (A_{1}, \dots, A_{r}; A_{1}, \dots, A_{r})$ . Define the L_2-spectral radius similarity $∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$ by setting
$∥ (A_{1}, \dots, A_{r}) ≃ (B_{1}, \dots, B_{r}) ∥_{2}$
$= \frac{ρ (Γ (A_{1}, \dots, A_{r}; B_{1}, \dots, B_{r}))}{ρ (Φ (A_{1}, \dots, A_{r}))^{1 / 2} ρ (Φ (B_{1}, \dots, B_{r}))^{1 / 2}}$ where $ρ$ denotes the spectral radius.
Recall that the octonions are the unique (up-to-isomorphism) 8 dimensional real inner product space $V$ together with a bilinear binary operation $*$ such that $∥ x * y ∥ = ∥ x ∥ \cdot ∥ y ∥$ and $1 * x = x * 1 = x$ for all $x, y \in V$ .
Suppose that $e_{1}, \dots, e_{8}$ is an orthonormal basis for $V$ . Define operators $(A_{1}, \dots, A_{8})$ by setting $A_{i} v = e_{j} * v$ . Now, define operators $(B_{1}, \dots, B_{64})$ up to reordering by setting ${B_{1}, \dots, B_{64}} = {A_{i} \otimes A_{j} : i, j \in {1, \dots, 8}}$ .
Let $d$ be a positive integer. Then the goal is to find complex symmetric $d \times d$ -matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized. We achieve this goal through gradient ascent optimization. Since we are using gradient ascent, I consider this to be a machine learning algorithm, but the function mapping $A_{j}$ to $X_{j}$ is a linear transformation, so we are training linear models here (we can generalize this fitness function to one where we train non-linear models though, but that takes a lot of work if we want the generalized fitness functions to still behave mathematically).
Experimental Observation: If $1 \leq d \leq 8$ , then we can easily find complex symmetric matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized and where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}^{2} = (2 d + 6) / 64 = (d + 3) / 32.$
If $7 \leq d \leq 16$ , then we can easily find complex symmetric matrices $(X_{1}, \dots, X_{64})$ where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}$ is locally maximized and where $∥ (A_{1}, \dots, A_{64}) ≃ (X_{1}, \dots, X_{64}) ∥_{2}^{2} = (2 d + 4) / 64 = (d + 2) / 32.$
.

Joseph Van Name 28 Apr 2025 7:28 UTC
1 point
0
on: Joseph Van Name’s Shortform
Here are some observations about the kind of fitness functions that I have been running experiments on for AI interpretability. The phenomena that I state in this post are determined experimentally without a rigorous mathematical proof and they only occur some of the time.
Suppose that $F : X \to [- \infty, \infty)$ is a continuous fitness function. In an ideal universe, we would like for the function $F$ to have just one local maximum. If $F$ has just one local maximum, we say that $F$ is maximized pseudodeterministically (or simply pseudodeterministic). At the very least, we would like for there to be just one real number of the form $F (x)$ for local maximum $(x, F (x))$ . In this case, all local maxima will typically be related by some sort of symmetry. Pseudodeterministic fitness function seem to be quite interpretable to me. If there are many local maximum values and the local maximum value that we attain after training depends on things such as the initialization, then the local maximum will contain random/pseudorandom information independent of the training data, and the local maximum will be difficult to interpret. A fitness function with a single local maximum value behaves more mathematically than a fitness function with many local maximum values, and such mathematical behavior should help with interpretability; the only reason I have been able to interpret pseudodeterminisitic fitness functions before is that they behave mathematically and have a unique local maximum value.
Set $O = F^{- 1} [(- \infty, \infty)] = X ∖ F^{- 1} [{- \infty}]$ . If the set $O$ is disconnected (in a topological sense) and if $L$ behaves differently on each of the components of $L$ , then we have literally shattered the possibility of having a unique local maximum, but in this post, we shall explore a case where each component of $O$ still has a unique local maximum value.
Let $m_{0}, \dots, m_{n}$ be positive integers with $m_{0} = m_{n} = 1$ and where $m_{1} \geq 1, \dots, m_{n - 1} \geq 1$ . Let $r_{0}, \dots, r_{n - 1}$ be other natural numbers. The set $X$ is the collection of all tuples $A = (A_{i, j})_{i, j}$ where each $A_{i, j}$ is a real $m_{i + 1} \times m_{i}$ -matrix and where the indices range from $i \in {0, \dots, n - 1}, j \in {1, \dots, r_{i}}$ and where $(A_{i, j})_{j}$ is not identically zero for all $i$ .
The training data is a set $Σ$ that consists of input/label pairs $(u, v)$ where $v \in {- 1, 1}$ and where $u = (u_{0}, \dots, u_{n - 1})$ such that each $u_{i}$ is a subset of ${1, \dots, r_{i}}$ for all $i$ (i.e. $Σ$ is a binary classifier where $u$ is the encoded network input and $v$ is the label).
Define $W (u, A) = (\sum_{j \in u_{n - 1}} A_{n - 1, j}) \dots (\sum_{j \in u_{0}} A_{0, j})$ . Now, we define our fitness level by setting
$F (A) = \sum_{(u, v) \in Σ} log (| W (u, A) |) / | Σ | - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$
$= E (log (| W (u, A) |)) - \sum_{i} log (∥ \sum_{j} A_{i, j} A_{i, j}^{*} ∥_{p}) / 2$ where the expected value is with respect to selecting an element $(u, v) \in Σ$ uniformly at random. Here, $∥ * ∥_{p}$ is a Schatten $p$ -norm which is just the $ℓ_{p}$ -norm of the singular values of the matrix. Observe that the fitness function $F$ only depends on the list $(u : (u, v) \in Σ)$ , so $F$ does not depend on the training data labels.
Observe that $O = X ∖ ⋃_{u \in Σ} {A \in X : W (u, A) = 0}$ which is a disconnected open set. Define a function $f : O \to {- 1, 1}^{Σ}$ by setting $f (A) = (W (u, A) / | W (u, A) |)_{(u, v) \in Σ}$ . Observe that if $x, y$ belong to the same component of $O$ , then $f (x) = f (y)$ .
While the fitness function $F$ has many local maximum values, the function $F$ seems to typically have at most one local maximum value per component. More specifically, for each $(α_{i})_{i \in Σ}$ , the set $f^{- 1} [{(α_{i})_{i \in Σ}}]$ seems to typically be a connected open set where $F$ has just one local maximum value (maybe the other local maxima are hard to find, but if thye are hard to find, they are irrelevant).
Let $Ω = f^{- 1} [{(v)_{(u, v) \in Σ}]$ . Then $Ω$ is a (possibly empty) open subset of $O$ , and there tends to be a unique (up-to-symmetry) $A_{0} \in Ω$ where $F (A_{0})$ is locally maximized. This unique $A_{0}$ is the machine learning model that we obtain when training on the data set $Σ$ . To obtain $A_{0}$ , we first perform an optimization that works well enough to get inside the open set $Ω$ . For example, to get inside $Ω$ , we could try to maximize the fitness function $\sum_{(u, v) \in Σ} arctan (v \cdot W (u, A))$ . We then maximize $F$ inside the open set $Ω$ to obtain our local maximum.
After training, we obtain a function $f$ defined by $f (u) = W (u, A_{0})$ . Observe that the function $f$ is a multi-linear function. The function $f$ is highly regularized, so if we want better performance, we should tone down the amount of regularization, but this can be done without compromising pseudodeterminism. The function $f$ has been trained so that $f (u) / | f (u) | = v$ for each $(u, v) \in Σ$ but also so that $| f (u) |$ is large compared to what we might expect whenever $(u, v) \in Σ$ . In other words, $f$ is helpful in determining whether $(u, v)$ belongs to $Σ$ or not since one can examine the magnitude and sign of $f (u)$ .
In order to maximize AI safety, I want to produce inherently interpretable AI algorithms that perform well on difficult tasks. Right now, the function $f$ (and other functions that I have designed) can do some machine learning tasks, but they are not ready to replace neural networks, but I have a few ideas about how to improve my AI algorithms performance without compromising pseudodeterminism. I do not believe that pseudodeterministic machine learning will increase AI risks too much because when designing these pseudodeterministic algorithms, we are trading some (but hopefully not too much) performance for increased interpretability, but this tradeoff is good for safety by increasing interpretability without increasing performance.

Joseph Van Name

Ap­prox­i­mat­ing ar­bi­trary com­plex-val­ued con­tin­u­ous functions

Us­ing com­plex polyno­mi­als to ap­prox­i­mate ar­bi­trary con­tin­u­ous functions

Spec­tral radii di­men­sion­al­ity re­duc­tion com­puted with­out gra­di­ent calculations

Approximating arbitrary complex-valued continuous functions

Using complex polynomials to approximate arbitrary continuous functions

Spectral radii dimensionality reduction computed without gradient calculations