I do not care to share much more of my reasoning because I have shared enough and also because there is a reason that I have vowed to no longer discuss except possibly with lots of obfuscation. This discussion that we are having is just convincing me more that the entities here are not the entities I want to have around me at all. It does not do much good to say that the community here is acting well or to question my judgment about this community. It will do good for the people here to act better so that I will naturally have a positive judgment about this community.
Joseph Van Name
You are judging my reasoning without knowing all that went into my reasoning. That is not good.
I will work with whatever data I have, and I will make a value judgment based on the information that I have. The fact that Karma relies on very small amounts of information is a testament to a fault of Karma, and that is further evidence of how the people on this site do not want to deal with mathematics. And the information that I have indicates that there are many people here who are likely to fall for more scams like FTX. Not all of the people here are so bad, but I am making a judgment based on the general atmosphere here. If you do not like my judgment, then the best thing would be to try to do better. If this site has made a mediocre impression on me, then I am not at fault for the mediocrity here.
Let’s see whether the notions that I have talked about are sensible mathematical notions for machine learning.
Tensor product-Sometimes data in a neural network has tensor structure. In this case, the weight matrices should be tensor products or tensor sums. Regarding the structure of the data works well with convolutional neural networks, and it should also work well for data with tensor structure to it.
Trace-The trace of a matrix measures how much the matrix maps vectors onto themselves since
where follows the multivariate normal distribution.
Spectral radius-Suppose that we are iterating a smooth function . Suppose furthermore that and is near and . We would like to determine whether or not. If the Jacobian of at has spectral radius less than , then ,. If the Jacobian of at has spectral radius greater than , then this limit does not converge.
The notions that I have been talking about are sensible and arise in machine learning. And understanding these notions is far easier than trying to interpret very large networks like GPT-4 without using these notions. Many people on this site just act like clowns. Karma is only a good metric when the people on the network value substance over fluff. And the only way to convince me otherwise will be for the people here to value posts that involve basic notions like the trace, eigenvalues, and spectral radius of matrices.
P.S. I can make the trace, determinant, and spectral radius even simpler. These operations are what you get when you take the sum, product, and the maximum absolute value of the eigenvalues. Yes. Those are just the basic eigenvalue operations.
Talking about whining and my loss of status is a good way to get me to dislike the LW community and consider them to be anti-intellectuals who fall for garbage like FTX. Do you honestly think the people here should try to interpret large sections of LLMs while simultaneously being afraid of quaternions?
It is better to comment on threads where we are interacting in a more positive manner.
I thought apologizing and recognizing inadequacies was a core rationalist skill. And I thought rationalists were supposed to like mathematics. The lack of mathematical appreciation is one of these inadequacies of the LW community. But instead of acknowledging this deficiency, the community here blasts me as talking about something off topic. How ironic!
I usually think of the field of complex numbers algebraically, but one can also think of the real numbers, complex numbers, and quaternions geometrically. The real numbers are good with dealing with 1 dimensional space, and the complex numbers are good for dealing with 2 dimensional space geometrically. While the division ring of quaternions is a 4 dimensional algebra over the field of real numbers, the quaternions are best used for dealing with 3 dimensional space geometrically.
For example, if are open subsets of some Euclidean space, then a function is said to be a conformal mapping when it preserves angles and the orientation. We can associate the 2-dimensional Euclidean space with the field of complex numbers, and the conformal mappings between open subsets of 2-dimensional spaces are just the complex differentiable mappings. For the Mandelbrot set, we need this conformality because we want the Mandelbrot set to look pretty. If the complex differentiable maps were not conformal, then the functions that we iterate in complex dynamics would stretch subsets of the complex plane in one dimension and expand them in the other dimension and this would result in a fractal that looks quite stretched in one real dimension and squashed in another dimension (the fractals would look like spaghetti; oh wait, I just looked at a 3D fractal and it looks like some vegetable like broccoli). This stretching and squashing is illustrated by 3D fractals that try to mimic the Mandelbrot set but without any conformality. The conformality is why the Julia sets are sensible (mathematicians have proven theorems about these sets) for any complex polynomial of degree 2 or greater.
For the quaternions, it is well-known that the dot product and the cross product operations on 3 dimensional space can be described in terms of the quaternionic multiplication operation between purely imaginary quaternions.
Um. If you want to convince a mathematician like Terry Tao to be interested in AI alignment, you will need to present yourself as a reasonably competent mathematician or related expert and actually formulate an AI problem in such a way so that someone like Terry Tao would be interested in it. If you yourself are not interested in the problem, then Terry Tao will not be interested in it either.
Terry Tao is interested in random matrix theory (he wrote the book on it), and random matrix theory is somewhat related to my approach to AI interpretability and alignment. If you are going to send these problems to a mathematician, please inform me about this before you do so.
My approach to alignment: Given matrices , define a superoperator by setting
, and define . Define the -spectral radius of as . Here, is the usual spectral radius.
Define . Here, is either the field of reals, field of complex numbers, or division ring of quaternions.
Given matrices , define
. The value is always a real number in the interval that is a measure of how jointly similar the tuples are. The motivation behind is that is always a real number in (well except when the denominator is zero) that measures how well can be approximated by -matrices. Informally, measures how random are where a lower value of indicates a lower degree of randomness.
A better theoretical understanding of would be great. If and is locally maximized, then we say that is an LSRDR of . Said differently, is an LSRDR of if the similarity is maximized.
Here, the notion of an LSRDR is a machine learning notion that seems to be much more interpretable and much less subject to noise than many other machine learning notions. But a solid mathematical theory behind LSRDRs would help us understand not just what LSRDRs do, but the mathematical theory would help us understand why they do it.
Problems in random matrix theory concerning LSRDRs:
Suppose that are random matrices (according to some distribution). Then what are some bounds for .
Suppose that are random matrices and are non-random matrices. What can we say about the spectrum of ? My computer experiments indicate that this spectrum satisfies the circular law, and the radius of the disc for this circular law is proportional to , but a proof of this circular law would be nice.
Tensors can be naturally associated with collections of matrices. Suppose now that are the matrices associated with a random tensor. Then what are some bounds for .
P.S. By massively downvoting my posts where I talk about mathematics that is clearly applicable to AI interpretability and alignment, the people on this site are simply demonstrating that they need to do a lot of soul searching before they annoy people like Terry Tao with their lack of mathematical expertise.
P.P.S. Instead of trying to get a high profile mathematician like Terry Tao to be interested in problems, it may be better to search for a specific mathematician who is an expert in a specific area related to AI alignment since it may be easier to contact a lower profile mathematician, and a lower profile mathematician may have more specific things to say and contribute. You are lucky that Terry Tao is interested in random matrix theory, but this does not mean that Terry Tao is interested in anything in the intersection between alignment and random matrix theory. It may be better to search harder for mathematicians who are interested in your specific problems.
P.P.P.S. To get more mathematicians interested in alignment, it may be a good idea to develop AI systems that behave much more mathematically. Neural networks currently do not behave very mathematically since they look like the things that engineers would come up with instead of mathematicians.
P.P.P.P.S. I have developed the notion of an LSRDR for cryptocurrency research because I am using this to evaluate the cryptographic security of cryptographic functions.
We can use the spectral radius similarity to measure more complicated similarities between data sets.
Suppose that are -real matrices and are -real matrices. Let denote the spectral radius of and let denote the tensor product of with . Define the -spectral radius by setting , Define the -spectral radius similarity between and as
.
We observe that if is invertible and is a constant, then
Therefore, the -spectral radius is able to detect and measure symmetry that is normally hidden.
Example: Suppose that are vectors of possibly different dimensions. Suppose that we would like to determine how close we are to obtaining an affine transformation with for all (or a slightly different notion of similarity). We first of all should normalize these vectors to obtain vectors with mean zero and where the covariance matrix is the identity matrix (we may not need to do this depending on our notion of similarity). Then is a measure of low close we are to obtaining such an affine transformation . We may be able to apply this notion to determining the distance between machine learning models. For example, suppose that are both the first few layers in a (typically different) neural network. Suppose that is a set of data points. Then if and , then is a measure of the similarity between and .
I have actually used this example to see if there is any similarity between two different neural networks trained on the same data set. For my experiment, I chose a random collection of of ordered pairs and I trained the neural networks to minimize the expected losses . In my experiment, each was a random vector of length 32 whose entries were 0′s and 1′s. In my experiment, the similarity was worse than if were just random vectors.
This simple experiment suggests that trained neural networks retain too much random or pseudorandom data and are way too messy in order for anyone to develop a good understanding or interpretation of these networks. In my personal opinion, neural networks should be avoided in favor of other AI systems, but we need to develop these alternative AI systems so that they eventually outperform neural networks. I have personally used the -spectral radius similarity to develop such non-messy AI systems including LSRDRs, but these non-neural non-messy AI systems currently do not perform as well as neural networks for most tasks. For example, I currently cannot train LSRDR-like structures to do any more NLP than just a word embedding, but I can train LSRDRs to do tasks that I have not seen neural networks perform (such as a tensor dimensionality reduction).
I am curious about your statement that all large neural networks are isomorphic or nearly isomorphic and therefore have identical loss values. This should not be too hard to test.
Let be training data sets. Let be neural networks. First train on and on . Then slowly switch the training sets, so that we eventually train both and on just . After fully training and , one should be able to train an isomorphism between the networks and (here I assume that and are designed properly so that they can produce such an isomorphism) so that the value for each node in can be perfectly computed from each node in . Furthermore, for every possible input, the neural networks should give the exact same output. If this experiment does not work, then one should be able to set up another experiment that does actually work.
I have personally trained many ML systems for my cryptocurrency research where after training two systems on the exact same data but with independent random initializations, the fitness levels are only off by a floating point error of about , and I am able to find an exact isomorphism between these systems (and sometimes they are exactly the same and I do not need to find any isomorphism). But I have designed these ML systems to satisfy these properties along with other properties, and I have not seen this with neural networks. In fact, the property of attaining the exact same fitness level is a bit fragile.
I found a Bachelor’s thesis (people should read these occasionally; I apologize for selecting a thesis from Harvard) where someone tried to find an isomorphism between 1000 small trained machine learning models, and no such isomorphism was found.
Or maybe one can find a more complicated isomorphism between neural networks since a node permutation is quite oversimplistic.
I have made a few minor and mostly cosmetic edits to the post about the dimensionality reduction of tensors that produces so many trace free matrices and also to the post about using LSRDRs to solve a combinatorial graph theory problem.
“What’s the problem?”-Neural networks are horribly uninterpretable, so it would be nice if we could use more interpretable AI models or at least better interpretability tools. Neural networks seem to include a lot of random information, so it would be good to use AI models that do not include so much random information. Do you think that we would have more interpretable models by forsaking all mathematical theory?
“what does this get us?”-This gets us systems trained by gradient ascent that behave much more mathematically. Mathematical AI is bound to be highly interpretable.
The downvotes display a very bad attitude, and they indicate that the LW community is a community that I really do not want much to do with at worst, and at best, the LW community is a community that lacks discipline and such mathematics texts will be needed to instill such discipline. In those posts that you have looked at, I did not include any mathematical proofs (these are empirical observations, so I could not include proof), and the lack of mathematical proofs makes the text much easier to go through. I also made the texts quite short; I only included enough text to pretty much define the fitness function and then state what I have observed.
For toy examples, I just worked with random complex matrices, and I wanted these matrices to be sufficiently small so that I can make and run the code to compute with these matrices quite quickly, but these matrices need to be large enough so that I can properly observe what is going on. I do not want to make an observation about tiny matrices that do not have any relevance to what is going on in the real world.
If we want to be able to develop safer AI systems, we will need to make them much more mathematical, and people are doing a great disservice by hating the mathematics needed for developing these safer AI systems.
I would go further than this. Future architectures will not only be designed for improved performance, but they will be (hopefully) increasingly designed to optimize safety and interpretability as well, so they will likely be much different than the architectures we see today. It seems to me (this is my personal opinion based on my own research for cryptocurrency technologies, so my opinion does not match anyone else’s opinion) that non-neural network machine learning models (but which are probably still trained by moving in the direction of a vector field) or at least safer kinds of neural network architectures are needed. The best thing to do will probably to work on alignment, interpretability, and safety for all known kinds of AI models and develop safer AI architectures. Since future systems will be designed not just for performance but for alignability, safety, and interpretability as well, we may expect for these future systems to be easier to align than systems that are simply designed for performance.
The -spectral radius similarity is not transitive. Suppose that are -matrices and are real -matrices. Then define . Then the generalized Cauchy-Schwarz inequality is satisfied:
.
We therefore define the -spectral radius similarity between and as . One should think of the -spectral radius similarity as a generalization of the cosine similarity between vectors . I have been using the -spectral radius similarity to develop AI systems that seem to be very interpretable. The -spectral radius similarity is not transitive.
and
, but can take any value in the interval .
We should therefore think of the -spectral radius similarity as a sort of least upper bound of -valued equivalence relations than a -valued equivalence relation. We need to consider this as a least upper bound because matrices have multiple dimensions.
Notation: is the spectral radius. The spectral radius is the largest magnitude of an eigenvalue of the matrix . Here the norm does not matter because we are taking the limit. is the direct sum of matrices while denotes the Kronecker product of matrices.
I appreciate your input. I plan on making more posts like this one with a similar level of technical depth. Since I included a proof with this post, this post contained a bit more mathematics than usual. With that being said, others have stated that I should be aware of the mathematical prerequisites for posts like this, so I will keep the mathematical prerequisites in mind.
Here are some more technical thoughts about this.
We would all agree that the problem of machine learning interpretability is a quite difficult problem; I believe that the solution to the interpretability problem requires us not only to use better interpretability tools, but the machine learning models themselves need to be more inherently interpretable. MPO word embeddings and similar constructions have a little bit (but not too much) of difficulty since one needs to get used to different notions. For example, if we use neural networks using ReLU activation (or something like that), then one has less difficulty upfront, but when it comes time to interpret such a network, the difficulty in interpretability will increase since neural networks with ReLU activation do not seem to have the right interpretability properties, so I hesitate to interpret neural networks. And even if we do decide to interpret neural networks, the interpretability tools that we use may have a more complicated design than the networks themselves.
There are some good reasons why complex numbers and quaternions have relatively little importance in machine learning. And these reasons do not apply to constructions like MPO word embeddings.
Since equal norm tight frames are local minimizers of the frame potential, it would help to have a good understanding of the frame potential. For simplicity, it is a good idea to only look at the real case. The frame potential is a potential for a force between a collection of particles on the sphere where particles are repelled from each other (and from each other’s antipodal point) and where the force tries to make all the particles orthogonal to each other. If , then it is possible to make all of the particles orthogonal to each other, and in this case, when we minimize this potential, the equal norm tight frames will simply be orthonormal bases. In the case when , we cannot make all of the particles orthogonal to each other, but we can try to get as close as possible. Observe that unlike the Newtonian and logarithmic potential, the frame potential does not have a singularity for when the two particles over lap. I will leave it to you to take the gradient (at least in the real case) of the frame potential to see exactly what this force does to the particles.
Training an MPO word embedding with the complex numbers of quaternions is actually easier in the sense that for real MPO word embeddings, one needs to use a proper initialization, but with complex and quaternionic MPO word embeddings, an improper initialization will only result in minor deficiencies in the MPO word embedding. This means that the quaternions and complex numbers are easier to work with for MPO word embeddings than the real numbers. In hindsight, the solution to the problem of real MPO word embeddings is obvious, but at the time, I thought that I must use complex or quaternionic matrices.
I like the idea of making animations, but even in the real case where things are easy to visualize, the equal norm tight frames are non-unique and they may involve many dimensions. The non-uniqueness will make it impossible to interpret the equal norm tight frames; for the same reason, it is hard to interpret what is happening with neural networks since if you retrain a neural network with a different initialization or learning rate, you will end up with a different trained network, but MPO word embeddings have much more uniqueness properties that make them easier to interpret. I have made plenty of machine learning training animations and have posted these animations on YouTube and TikTok, but it seems like in most cases, the animation still needs to be accompanied by technical details; with just an animation, the viewers can see that something is happening with the machine learning model, but they need both the animation and technical details to interpret what exactly is happening. I am afraid that most viewers just stick with the animations without going into so many technical details. I therefore try to make the animations more satisfying than informative most of the time.
If you have any questions about the notation or definitions that I have used, you should ask about it in the mathematical posts that I have made and not here. Talking about it here is unhelpful, condescending, and it just shows that you did not even attempt to read my posts. That will not win you any favors with me or with anyone who cares about decency.
Karma is not only imperfect, but Karma has absolutely no relevance whatsoever because Karma can only be as good as the community here.
P.S. Asking a question about the notation does not even signify any lack of knowledge since a knowledgeable person may ask questions about the notation because the knowledgeable person thinks that the post should not assume that the reader has that background knowledge.
P.P.S. I got downvotes, so I got enough engagement on the mathematics. The problem is the community here thinks that we should solve problems with AI without using any math for some odd reason that I cannot figure out.
I am pointing out something wrong with the community here. The name of this site is LessWrong. On this site, it is better to acknowledge wrongdoing so that the people here do not fall into traps like FTX again. If you read the article, you would know that it is better to acknowledge wrongdoing or a community weakness than to double down.
I already did that. But it seems like the people here simply do not want to get into much mathematics regardless of how closely related to interpretability it is.
P.S. If anyone wants me to apply my techniques to GPT, I would much rather see the embedding spaces as more organized objects. I cannot deal very well with words that are represented as vectors of length 4096 very well. I would rather deal with words that are represented as 64 by 64 matrices (or with some other dimensions). If we want better interpretability, the data needs to be structured in a more organized fashion so that it is easier to apply interpretability tools to the data.
“Lesswrong has a convenient numerical proxy-metric of social status: site karma.”-As long as I get massive downvotes for talking correctly about mathematics and using it to create interpretable AI systems, we should all regard karma as a joke. Karma can only be as good as the community here.
Let’s compute some inner products and gradients.
Set up: Let denote either the field of real or the field of complex numbers. Suppose that are positive integers. Let be a sequence of positive integers with . Suppose that is an -matrix whenever . Then from the matrices , we can define a -tensor . I have been doing computer experiments where I use this tensor to approximate other tensors by minimizing the -distance. I have not seen this tensor approximation algorithm elsewhere, but perhaps someone else has produced this tensor approximation construction before. In previous shortform posts on this site, I have given evidence that the tensor dimensionality reduction behaves well, and in this post, we will focus on ways to compute with the tensors , namely the inner product of such tensors and the gradient of the inner product with respect to the matrices .
Notation: If are matrices, then let denote the superoperator defined by letting . Let .
Inner product: Here is the computation of the inner product of our tensors.
.
In particular, .
Gradient: Observe that . We will see shortly that the cyclicity of the trace is useful for calculating the gradient. And here is my manual calculation of the gradient of the inner product of our tensors.
.
So in my research into machine learning algorithms, I have stumbled upon a dimensionality reduction algorithm for tensors, and my computer experiments have so far yielded interesting results. I am not sure that this dimensionality reduction is new, but I plan on generalizing this dimensionality reduction to more complicated constructions that I am pretty sure are new and am confident would work well.
Suppose that is either the field of real numbers or the field of complex numbers. Suppose that are positive integers and is a sequence of positive integers with . Suppose that is an -matrix whenever . Then define a tensor .
If , and is a system of matrices that minimizes the value , then is a dimensionality reduction of , and we shall denote let denote the tensor of reduced dimension . We shall call a matrix table to tensor dimensionality reduction of type .
Observation 1: (Sparsity) If is sparse in the sense that most entries in the tensor are zero, then the tensor will tend to have plenty of zero entries, but as expected, will be less sparse than .
Observation 2: (Repeated entries) If is sparse and and the set has small cardinality, then the tensor will contain plenty of repeated non-zero entries.
Observation 3: (Tensor decomposition) Let be a tensor. Then we can often find a matrix table to tensor dimensionality reduction of type so that is its own matrix table to tensor dimensionality reduction.
Observation 4: (Rational reduction) Suppose that is sparse and the entries in are all integers. Then the value is often a positive integer in both the case when has only integer entries and in the case when has non-integer entries.
Observation 5: (Multiple lines) Let be a fixed positive even number. Suppose that is sparse and the entries in are all of the form for some integer and . Then the entries in are often exclusively of the form as well.
Observation 6: (Rational reductions) I have observed a sparse tensor all of whose entries are integers along with matrix table to tensor dimensionality reductions of where .
This is not an exclusive list of all the observations that I have made about the matrix table to tensor dimensionality reduction.
From these observations, one should conclude that the matrix table to tensor dimensionality reduction is a well-behaved machine learning algorithm. I hope and expect this machine learning algorithm and many similar ones to be used to both interpret the AI models that we have and will have and also to construct more interpretable and safer AI models in the future.
There are some cases where we have a complete description for the local optima for an optimization problem. This is a case of such an optimization problem.
Such optimization problems are useful for AI safety since a loss/fitness function where we have a complete description of all local or global optima is a highly interpretable loss/fitness function, and so one should consider using these loss/fitness functions to construct AI algorithms.
Theorem: Suppose that U is a real,complex, or quaternionic n×n-matrix that minimizes the quantity ∥U∥2+∥U−1∥2. Then U is unitary.
Proof: The real case is a special case of a complex case, and by representing each n×n-quaternionic matrix as a complex 2n×2n-matrix, we may assume that U is a complex matrix.
By the Schur decomposition, we know that U=VTV∗ where V is a unitary matrix and T is upper triangular. But we know that ∥U∥2=∥T∥2. Furthermore, U−1=VT−1V∗, so ∥U−1∥2=∥T−1∥2. Let D denote the diagonal matrix whose diagonal entries are the same as T. Then ∥T∥2≥∥D∥2 and ∥T−1∥2≥∥D−1∥2. Furthermore, ∥T∥2=∥D∥2 iff T is diagonal and ∥T−1∥2=∥D−1∥2 iff D is diagonal. Therefore, since ∥U∥2+∥U−1∥2=∥T∥2+∥T−1∥2 and ∥T∥2+∥T−1∥2 is minimized, we can conclude that T=D, so T is a diagonal matrix. Suppose that T has diagonal entries (z1,…,zn). By the arithmetic-geometric mean equality and the Cauchy-Schwarz inequality, we know that 12⋅(∥(z1,…,zn)∥2+∥(z−11,…,z−1n)∥2)≥∥(|z1|,…,|zn|)∥2⋅∥(|z−11|,…,|z−1n)|∥2
≥⟨(|z1|,…,|zn|),(|z−11|,…,|z−1n)|⟩=√n.
Here, the equalities hold if and only if |zj|=1 for all j, but this implies that U is unitary. Q.E.D.