Very cool, thanks! I agree that Dalcy’s epsilon-game picture makes arguments about ELO vs. optimality more principled
Dmitry Vaintrob
I really like this question and this analysis! I think an extension I’d do here is to restrict the “3 reasonable moves” picture by looking at proposed moves of different agents in various games. My guess is that in fact the “effective information content” in a move at high-level play is less than 1 bit per move on average. If you had a big gpu to throw at this problem you could try to explicitly train an engine via an RL policy with a strong entropy objective and see what maximal entropy is compatible with play at different ratings
SLT is a thermodynamic theory of Bayesian learning, but not the thermodynamic theory of Bayesian learning
SLT provides a rigorous mathematical framework for Bayesian learning in a certain regime, but I argue its practical applicability to real neural networks (even in a Bayesian learning/ high-level modeling context) is limited by finite-size effects and high-dimensionality. The valuable empirical work in this space is better understood as ‘thermodynamic interpretations of ML’ rather than validations of SLT proper
I’ve been having lots of conversations with people about SLT. I like SLT as a model for Bayesian learning a lot. At the same time I think that the assumptions of SLT are a model of the reality of learning (including Bayesian learning), in the same way that variants of the harmonic oscillator are a model of a physical system. There are some results that show that every physical system under some assumptions is a harmonic oscillator in a limit, but this limit doesn’t always hold.
I think the place where I am bothered by SLT rhetoric is where interesting experiments get done and interesting thermodynamic parameters get found, but instead of viewing this as results in “thermodynamic interpretation of ML”, there is an incorrect assumption that the observed phenomena are explained (fully or up to a controllable error) by an expansion around a singularity of a singular learning system.
I’m planning on writing more about this, but I’ll try to write out the key arguments to let people look at them and to see if I’m getting something wrong.
Essentially, I think it would be good to coordinate on a language for talking about Bayesian learning that doesn’t overindex on the singular learning limit, and instead uses physics terms (free energy, susceptibility, heat capacity) for the actual observed invariants. I think that in many ways SLT is moving in this direction, and I’m excited about what they have done.
First the things I agree with:
We are doing Bayesian learning (or a mild variant of “Boltzmann learning”, which allows rescaling the “size” of Bayesian updates by a scalar factor)
If we fix a neural network as a statistical system (its data distribution, architecture, loss) and take the number of data samples, n, to infinity, there is a limit where the singular learning prediction is true.
So there is always a regime where n is sufficiently large that SLT gives an exact answer to Bayesian learning. Why am I objecting to a (mathematically true) fact?
Essentially, the two key issues are that
Neural networks are high-dimensional systems, and this makes errors in SLT approximation potentially very large (even exponentially large in something like number of parameters) at finite data.
In order to get the exact Watanabe approximation to hold, we need to assume that for infinite data the singularity is exact. Thus if we have some predicted singular loss depending on a weight choice , then we can (at least wrt the guarantees given by SLT theory) use the singularities of to get a prediction for the asymptotic only if exactly in the infinite-data limit. If there is any small approximation issue that relates to the architecture / data rather than to the number of samples n (e.g. the difference between the discrete Fourier Transform and the continuous FT on the circle for modular addition, approximations of a polynomial by sigmoids in modular addition and other contexts, even bit complexity issues, etc.), then we can’t rely on the SLT prediction—at least theoretically. Here one might a priori expect that the singular information from an approximate model of the loss would be predictive of the singularity of the true loss , but in fact this is false: the property of being nontrivially singular at all is a measure-zero property of a (loss) function, so if there are any modeling assumptions or possible sources of error or noise, we expect the “true infinite limit” Watanabe prediction to be the same one as for a “nonsingular” loss (i.e., positive-definite Hessian at the limit). Sometimes there are small corrections (often called “gauge”) from the architecture, but these are small compared to the empirical singularity-like effects one observes from thermodynamic measurements.
I want to especially harp on the fact that it takes at worst-case exponential number of samples, and therefore exponential loss precision of the difference (more precisely, exp of some power of the number of parameters) to mathematically guarantee that the SLT prediction is correct. Thus an argument shaped like “SLT is mathematically correct” is, for realistic models, true but boring.
However my pair of counterarguments isn’t strong by itself. There are lots of cases in the context of mathematical modeling of complex systems where theory says that we might (in worst-case situations) need to wait exponentially long/ get exponential errors/ etc., but in practice finite time suffices. And I think that there are interesting systems where the modeling assumptions above are correct. It’s just that you shouldn’t assume this by default.
Thus it is meaningful to ask the following question:
in what learning problems, what regimes, and at what scales can Bayesian learning be described by an SLT prediction? In other words, when can we find a function that is a “sufficiently good” approximation of the true infinite-sample loss and whose singularities “meaningfully describe the thermodynamic behavior” of Bayesian learning.
I think this is an interesting question. One might object to it by saying that this is hard to measure, since Bayesian learning is extremely hard to study in “realistic” cases. However I’d argue that we have enough examples to start studying this. And again, my sense here is that SLT prediction fail already to first order (i.e., for predicting e.g. the correct asymptotic for up to O(1) rescaling), but they fail in an interesting way.
SLT failures in Bayesian learning.
I’ll write more about this later. But let me give two examples where I think this is the case.
Grokking modular addition (with MSE loss)
In this paper, grokking in (MSE) modular addition is analyzed using a technique called “mean field theory” (this is an extension of more classical work on things like the neural tangent kernel, which drops the highly restrictive assumption that the system is in a “lazy learning” regime: i.e., that the solution is a small perturbation away from a “trivial vacuum”). This is a paper about Bayesian learning and it gives exact predictions that are confirmed by experiment (one can empirically “approximate” Bayesian learning by something called Langeving SGD). The prediction in particular implies an expansion for the SLT term (the “heat capacity”) and for the free energy (roughly, the stat-phys version of what is called “basin volume” in SLT). It turns out that the free energy in the relevant regime is actually explicitly not controlled by any basin around a singularity, but is rather given to first order by a high-dimensionality phenomenon (similar to critical phenomena in thermodynamics). Specifically, the first-order approximation of the free energy is compatible with the system having some fixed number of degrees of freedom per neuron (here: “row of the input weight matrix”). More precisely this paper predicts (and experiment verifies) that to first order, the NN learns to randomly sample each “weight” row from some fixed distribution. (The resulting NN has approximately correct outputs by the central limit theorem; but if one is interested in a more realistic context with a small number of neurons compared to p, higher terms in the mean-field expansion let you predict regimes with higher and higher accuracy; the leading free-energy term will still be the same). This terms is, importantly, not even a rational number (as predicted by SLT), but some numerical integral associated with the neuron distribution (not dominated by any one value or any small set of moments, except in certain limiting regimes). Note that (similar to SLT) the mean-field prediction holds, in a certain idealization, for any “input number” n so long as n is less than some exponential in the number of neurons.
LoRA for empirical “approximately singular” matrices from ML
For our second example, we look at a two-layer “deep(ish) linear network” that tries to model , (here we replace the ReLU with a trivial nonlinearity, and take W_{in}, W_{out} to be the trainable parameters and M to be the fixed “target”), there is an exact formula for the Bayesian learning prediction of this network in terms of the singular values of M (“singular” is a good term here :). The Bayesian learning dynamics are entirely controlled by the “entropy” or “volume” function which is roughly the volume of the “space of pairs of matrices , of bounded size whose product is M” (the “bounded size” is a secret parameter here, that is related to the “number of training points” parameter n in SLT).
If the (min of) input/output width is larger than the “hidden layer” width, the function has singularities and SLT in this case makes a nontrivial prediction: namely, that if some fraction of the singular values is zero (note we can assume WLOG that this is the last values, since singular values are un-ordered), then the function Vol has a singularity. SLT now implies that if the “target” matrix M has lower-than-full rank (i.e. some of the s_i are exactly 0) then in the high-data limit we have an exact asymptotic on the free energy function F(n) (as a function of dataset size).
Here in order for the SLT asymptotic to be true, we want to assume that the first k singular vector of M are exactly zero and the data size parameter n is much larger than the inverse (square) of the smallest nonzero singular values (this is called the “spectral gap”). If we want some kind of exact or asymptotic formula for the free energy that makes valid predictions for n without the singular gap assumptions, or without the assumption that the singular values are exactly 0, we can simply plug the (easily computable) “true” singular values into the (known) exact formula—or a known expansion whose validity (/error bound) we can mathematically verify—and see what we get.
Ultimately this depends on what deep linear models one would want to model “in practice” in the context of ML. This might at first seem like a silly problem: ML is explicity nonlinear and any deep linear model is just a toy.
However, this is not entirely true: in some cases for a realistic neural net, one is interested in doing “LoRA” decomposition. LoRA means different things in different contexts, but one standard context is decomposing a weight matrix that appears in some layer of the model into a product of two matrices with and having lower rank. In general in such a context, one would then train all the weights of the model together (and again, typically LORA is used in a slightly modified context where a low-rank product somehow supplements rather than replacing an intermediate layer). Nevertheless, we can alway model “some part” of LoRA learning as learning an approximate factorization of a matrix into a product (this corresponds to “freezing” all weight layers except those that defined , and modeling the loss as an approximation loss on -- sketchy, but I’d guess a reasonable “directional” guess for asymptotics in a suitable regime).
Once we have reduced to this problem, we can now write down the exact Bayesian prediction and the SLT prediction for various values of “dataset size” n. Both predictions depend only on the singular values of the matrix (this can be seen using symmetry). Now empirically, it turns out that one can approximate the “bulk” of these singular values relatively well up to a constant by a power law for some exponent on the order of .3-.5 (see here for example). Here the “bulk” captures most of the singular values and it’s in some sense “pretty singular”: for the majority i, the value is pretty small, i.e. “close to a singularity”.
Thus it makes sense to extrapolate this regime and ask whether, for a singular matrix with singularities following a power law there is some regime of values of n where an SLT prediction (assuming that all below some cutoff are zero and keeping only the singular terms from these) actually describes the true value of the free energy to first order. In other words, we’re replacing the “real” model that approximately solves by an idealized “singular” model that approximately solves with the singular of given by the formula
Unsurprisingly, the answer is a strong “no”: both the prediction that the “approximately singular” for are “essentially zero” and the prediction that the large for are large enough to impose a “reasonable gap” completely fail, and even the power law in the asymptotic from the SLT prediction fully fails to capture the power law of the true Bayesian learning prediction in this case. Here again, note that the key issue is the high-dimensionality of the system. (Also note here that I’m blackboxing a lot of math: happy to discuss it more in comments etc., and I’m also planning to write out a more careful version of this later.)
Discussion
One can try to salvage an SLT prediction in the LoRA example (which I think is particularly damning) in a few ways. Maybe:
The important contribution comes from the “non-power law” part of the tail
The assumption that LoRA rank reduction is a good model of Bayesian learning more generally is flawed
etc.
However I think that together with the known results about modular addition, these failure modes show that in a meaningful way, realistic models do not have a good “singular model” with strong predictive power, even if we are only trying to make predictions about Bayesian learning. The regime where the SLT prediction succeeds certainly exists (this is a rigorous mathematical statement after all), but it requires too large a sample number n and imposes too strong of a regularity assumption on the “true” infinite-data loss landscape (essentially, an assumption that “small” and “large” phenomena are cleanly separated by a large gap) for it to even approximately give valid predictions in regimes we care about. Again, the “dominant” brunt of my intuition here is that this failure is related to the fact that the predictions of SLT assume that the number of parameters is fixed (i.e. O(1)) as the number of samples goes to infinity—but in reality, the number of parameters is quite high, and the regime where SLT predictions hold exactly is in some sense exponential (or very large) compared to parameter count, and thus never attained (and essentially uninteresting).
Having said this, I want to point out that SLT has a number of really good results that I am deeply excited about. I think they both have good theoretical results for toy models or strongly-asymptotic (but still interesting) regimes where the SLT approximation is exact, which may directionally give good intuition about realistic models (in the same way that the harmonic oscillator gives very deep intuition about quantum and statistical mechanics, despite not all systems being reducible to it—not that in particular, SLT can be understood as a “vastly generalized harmonic oscillator” with the essential property being the existence of a strongly position-localized—though singular—semiclassical approximation; ignore if these words are meaningless to you).
But also people who label their work as “singular learning theory” have really good empirical results where the measurements being made are strictly thermodynamic, and which continue to work in “explicitly non-SLT” contexts such as mean field theory, which capture the high-dimensionality and have totally different asymptotics than what is predicted in the SLT regime. (Note that mean-field is, of course, itself an approximation/ toy model!)
For example: timaeus (the organization that “does SLT”) has an elegant result by Garrett Baker et al. that measures the effects of changes in training distribution on susceptibilities; Nina Panickssery and I have a paper on the “lambda-hat estimator” (heat capacity) tracking the description length of the algorithm learned by modular addition in generalizing vs. memorizing models; there is work on saddles; Timaeus has produced work on the relationship of attention heads and a certain thermodynamic quantity. An upcoming paper by Timaeus that I’m very excited about looks at a susceptibility-like (“conjugate variable”) metric on inputs that generalizes influence functions.
All of these results are (I think) very good results directionally linking statistical mechanics invariants with interesting learning behaviors. Also, none of them are “properly SLT results”: they simply assume standard links between information theory, thermodynamics, and Bayesian learning (some of them developed in the learning context by Sumio Watanabe, who first discovered them in the context of his singular models). None of them assume that one is in a regime where the SLT “approximation from singularities” even approximately holds—and I suspect that if one were to dig even a little in most of these cases (e.g. look at larger-scale dependence on temperature, relationships between single-neuron statistics, etc.), then one would see that we are in a regime where in fact, in a strong sense, any “singular” toy model would fail to predict the behvior of interest in any regime or with any number of terms of the “singular” asymptotic expansion. (Essentially this belief is due to the fact that the lambda-hat estimator does not even approximately asymptote in the way that SLT predicts in the regimes of interest, and because in an intuition I have that I’m hoping to write down later, the SLT approach really fails to track inherently high-dimensional phenomena like grokking.)
I would be happy to be wrong about the failure of the SLT regime, and I retain hope that with sufficient processing/ renormalization, there is some sense in which the thermodynamic behavior of learning is extensively related to singularities in some suitably re-interpreted and lower-dimensional set of “relevant” variables which are not the weights.
Also I am excited about the work that Timaeus is doing and think that the “thermodynamics-aware” approach to learning that they try to follow is significantly underexplored. I just want to point out that there are very deep directions here that decouple from the assumptions of SLT and instead lean into high-dimensionality of the loss landscape, and which produce useful work that can be missed if one’s model of Bayesian learning theory is “everything is singularities”.
This is fascinating! If there’s nothing else going on with your prompting, this looks like an incredibly hacky mid-inference intervention. My guess would be that openai applied some hasty patch against a sycophancy steering vector and this vector caught both actual sycophantic behaviors and descriptions of sycophantic behaviors in LLMs (I’d guess “sycophancy” as a word isn’t so much the issue as the LLM behavior connotation). Presumably the patch they used activates at a later token in the word “sycophancy” in an AI context. This is incredibly low-tech and unsophisticated—like much worse than the stories of repairing Apollo missions with duct tape. Even a really basic finetuning would not exhibit this behavior (otoh, I suppose stuff like this works for humans, where people will sometimes redirect mid-sentence).
FWIW, I wasn’t able to reconstruct this exact behavior (working in an incognito window with a fresh chatgpt instance), but it did suspiciously avoid talking about sycophancy and when I asked about sycophancy specifically, it got stuck in inference and returned an error
I think it can be a problem if you recommend a book and expect the other person to have a social obligation to read it (and needs to make an effortful excuse or pay social capital if it’s not read). It might be hard to fully get rid of this, but I think the utility comparison that should be made is “social friction from someone not following a book recommendation” vs. “utility to the other person from you recommending a book based on knowledge of the book and the person’s preferences/interests”. I suspect that in most contexts this is both an EV-positive exchange and the person correctly decides not to read/finish the book. Maybe a good social norm would be to not get upset if someone doesn’t read your book rec, and also to not feel pressured to read a book that was recommended if you started it/ read a summary and decided it’s not for you
Very cool and well-presented—thanks for taking the time to write this down. I thought about this question at some point and ended up deciding that the compressed sensing picture isn’t very well shaped for this, but didn’t have a complete argument for this—it’s nice to have confirmation
On the friendship fallacy and Owen Barfield
I just finished reading the book “The Fellowship: The Literary Lives of the Inklings”, by Philip and Carol Zaleski. It’s a book about an intellectually appealing and socially cohesive group of writers in Oxford who met weekly and critiqued each other’s work, which included JRR Tolkien and CS Lewis. The book is very centered on Christianity (the writers also write Christian apologetics), but this works well, as understanding either Lewis or Tolkien or the Inklings in general without the lens of their deeply held thoughtful Christianity is about as silly as trying to analyze the Lion King without reading Hamlet.
But there is a core character in the book who is treated sympathetically and who I really hate: Owen Barfield, the “founding” Inkling. From his youth, he is a follower of Rudolf Steiner and a devoted Anthroposophist (a particularly benign group of Christian Occultists). Barfield was Lewis’s friend, existing always in his shadow (Lewis was very famous in his lifetime as a philosopher and Christian apologist, a kind of Jordan Peterson of his time if you imagine Jordan Peterson had brains and real literary/academic credentials). He worked in a law firm and consistently saw himself as a thwarted philosopher/writer/poet, and he found recognition late in life after he wrote a Lewis biography and after his woo-adjacent ideas became more popular in the 60s.
Throughout his life, Barfield created a personal philosophy of “all the things I like/ think are interesting are kind of the same thing”, and he was very sad when people he liked disapproved of, or failed to identify as “sort of the same thing” the different things that he mixed into his philosophy. While he generally is a bit of an “intellectual klutz”, his fundamental failure is the “Friendship Fallacy”: the idea of treating ideas as friends, as something deserving of loyalty. When he encounters different ideas he likes, he “wants them to get along,” and when ideas fail to convince skeptics or produce results or interface with reality (or indeed, with faith), he simply fails to impose any kind of falsifiability requirement and treats this as a loyalty test he must pass. He totally lacks the kind of internal courage needed to kill one’s darlings (whether philosophical or literary) and to treat his own ideas with skepticism and view towards falsification—perhaps the core trait of a good thinker (Feynman’s “You must not fool yourself—and you are the easiest person to fool”).
Interestingly, I don’t extend this antipathy to the Christianity of the group’s other famous members. Unlike Barfield, Tolkien and Chesterton largely succeed (imo) in separating the domains of the literary, the psychological, and the religious. They don’t pretend to be scientific authorities or predict things “in the world”. Tolkien in particular is very anti-progress and a bit of a luddite, but in my understanding his work as a linguist is very good for his time. In fact, it’s funny that his deeply Christian mentality created one of the most “atheist nerd”-like behaviors of creating thoroughly crafted fictional languages of fantasy cultures. I’ve been surprised to learn from reading a couple of his biographies that his linguistic worldbuilding in fact preceded his fantasy work: he designed Elvish before writing any work in his canon, and wrote the work to flesh out the mythology behind expressions and poems. He famously said about his work “The making of language and mythology are related functions”. In fact, he viewed the work of producing plausible cultures and languages—in my view an admirable (though non-academic) kind of secular scholarship analogous to studying alternative physical systems, etc. -- as an explicitly Christian task of “subcreation”, a sort of worship-by-imitation of God.
It’s a bit hard to exactly formulate a razor between the kind of “lazy scientism” of Barfield and various other forms of “pseudoscientific woo” and the serious and purely mystical/ inspirational deep religiosity of people like Tolkien (and to a lesser extent Lewis—another interesting thing I learned was that he started out as a devoted atheist in a world where this was actually socially fraught, and was converted through a philosophical struggle involving Barfield and Tolkien in particular). But maybe the idea of a “philosophy without struggle”: a tendency towards confirmation and a total lack of earnest self-questioning goes a part of the way towards explaining this distinction. Another part is the difference between a purely metaphysical personal religion and a more woo idea of a religion that “makes predictions about the world”. I think the thing that really took me aback a bit was the level of academic embrace of Barfield late in his life, not just as a Lewis biographer but as a respected academic philosopher with honorary professorships and the works—a confirmation (if ever more are needed) that lazy pseudointellectualism and confirmation bias are very much not incompatible with academic success. Another theme that I think is interesting is the fact that Lewis and Tolkien were at times genuinely interested and even somewhat inspired by his ideas (though they had no time for occultism or 60-esque woo). The extent to which this happened is hard to gauge (he outlived them and wrote a lot about how he influenced them in his biographies/reminiscences, and this was then picked up by scholars). But unquestioningly, this did occur to some extent. And whether or not you class Tolkien/Lewis as “valuable thinkers”, the history of science and philosophy does seem to abound with examples of clear and robust thinkers whose good ideas were to some extent inspired by charismatic charlatans and woo.
Below are my personal notes on Barfield that I wrote after reading the book.
I despise Barfield. Not in the visceral sense that the first syllable of his name may (Anthroposophically) evoke. Indeed I identify with the underdog/late-bloomer shape of his biography, with his striving towards a higher calling. I readily adopt the book’s sympathy towards him as a literary character with fortunes tied to an idea deeply espoused, a thwarted writer with some modicum of undiscovered talent. My antipathy isn’t even in the specifics of what he espouses: a mild but virulently wrong view of science and philosophy adjacent to all the stupid of my parents’ generation of `anthroposophy’ (Atlantis, Consciousness and Quantum Mechanics, anti-Evolutionism, Vibes). But I despise him as one of a Fundamental Mistake. That of confusing science and personality. Being loyal to a scientific or philosophical discipline isn’t like being loyal to a person: if it’s consistently fucking up and you need to make excuses for its behavior to all your reasonable smart friends, you’re not being a good friend but rather a bad scientist. Barfield is almost an archetype of Bad Science if you project out the crazy/dogmatic/ political/ evil-Nazi component. He really is a nice man. But within his mild-mannered Christian friendliness which I respect, he is inflexible and unscientific. He doesn’t update. He glows when people endorse his preferred view (Anthroposophy and Steiner) and sadly laments when they disagree with him—because he can’t help but feel like ``there’s something there″. He wants to seamlessly draw parallels between all the nice things he and other nice people believe. He draws lines of identification back and forth between all the things he likes (Coleridge <> Himself <> Quantum Mechanics <> Anthroposophy <> Steiner <> Religion <> Consciousness <> Complementary dualism/”polarity”). He has “nothing but symbols” in his brain, and the symbols in his brain aren’t strong enough to notice that they fail to signify. A person without significance, with a philosophy without significance, possessed of a brain without the capacity to grasp the concept of what it means to signify. The first of these is a tragedy (people should matter) and his late-found fame mediated through famous friends is a sweet story, maybe one he even deserves as the first-mover of the Inklings, the reason for the Lewis-Tolkien friendship, etc. The second is a neutral: theories that fail to achieve significance “in their lifetime” may be bunk but may have value: Greek Atomism, various prescient ideas about physics and computers (Babbage/ Lovelace), etc. But the third is a profound personal failing, and it’s only through luck and through (mostly well-placed) trust in much smarter and more rigorous friends that he avoided attaching this vapid form of mentation to something truly vile: Nazism (which he very briefly flirted with, charmed by its interest in magic and the occult), various fundamentalisms (including an anti-evolutionary fundamentalism: his friends believed in evolution but he didn’t really buy it “on vibes”; he was never a fundamentalist), Communism, etc.
Thanks for this post. I would argue that part of an explanation here could also be economic: modernity brings specialization and a move from the artisan economy of objects as uncommon, expensive, multipurpose, and with a narrow user base (illuminated manuscripts, decorative furniture) to a more utilitarian and targeted economy. Early artisans need to compete for a small number of rich clients by being the most impressive, artistic, etc., whereas more modern suppliers follow more traditional laws of supply and demand and track more costs (cost-effectiveness, readability and reader’s time vs. beauty and remarkableness). And consumers similarly can decouple their needs: art as separate from furniture and architecture, poetry and drama as separate from information and literature. I think another aspect of this shift, that I’m sad we’ve lost, is the old multipurpose scientific/philosophical treatises with illustrations or poems (my favorite being de Rerum Natura, though you could argue that Nietzsche and Wagner tried to revive this with their attempts at Gesamtkunstwerke).
I’m managing to get verve and probity, but having issues with wiles
I really liked the post—I was confused by the meaning and purpose no-coincidence principle when I was a ARC, and this post clarifies it well. I like that this is asking for something that is weaker than a proof (or a probabilistic weakening of proof), as [related to the example of using the Riemann hypothesis], in general you expect from incompleteness for there to be true results that lead to “surprising” families of circuits which are not provable by logic. I can also see Paul’s point of how this statement is sort of like P vs. BPP but not quite.
More specifically, this feels like a sort of 2nd-order boolean/polynomial hierarchy statement whose first-order version is P vs. BPP. Are there analogues of this for other orders?
Looks like a conspiracy of pigeons posing as lw commenters have downvoted your post
Thanks!
I haven’t grokked your loss scales explanation (the “interpretability insights” section) without reading your other post though.
Not saying anything deep here. The point is just that you might have two cartoon pictures:
every correctly classified input is either the result of a memorizing circuit or of a single coherent generalizing circuit behavior. If you remove a single generalizing circuit, your accuracy will degrade additively.
a correctly classified input is the result of a “combined” circuit consisting of multiple parallel generalizing “subprocesses” giving independent predictions, and if you remove any of these subprocesses, your accuracy will degrade multiplicatively.
A lot of ML work only thinks about picture #1 (which is the natural picture to look at if you only have one generalizing circuit and every other circuit is a memorization). But the thing I’m saying is that picture #2 also occurs, and in some sense is “the info-theoretic default” (though both occur simultaneously—this is also related to the ideas in this post)
Thanks for the questions!
You first introduce the SLT argument that tells us which loss scale to choose (the “Watanabe scale”, derived from the Watanabe critical temperature).
Sorry, I think the context of the Watanabe scale is a bit confusing. I’m saying that in fact it’s the wrong scale to use as a “natural scale”. The Watanabe scale depends only on the number of training datapoints, and doesn’t notice any other properties of your NN or your phenomenon of interest.
Roughly, the Watanabe scale is the scale on which loss improves if you memorize a single datapoint (so memorizing improves accuracy by 1/n with n = #(training set) and in a suitable operationalization, improves loss by , and this is the Watanabe scale).
It’s used in SLT roughly because it’s the minimal temperature scale where “memorization doesn’t count as relevant”, and so relevant measurements become independent of the n-point sample. However in most interp experiments, the realistic loss reconstruction loss reconstruction is much rougher (i.e., further from optimal loss) than the 1/n scale where memorization becomes an issue (even if you conceptualize #(training set) as some small synthetic training set that you were running the experiment on).
For your second question: again, what I wrote is confusing and I really want to rewrite it more clearly later. I tried to clarify what I think you’re asking about in this shortform. Roughly, the point here is that to avoid having your results messed up by spurious behaviors, you might want to degrade as much as possible while still observing the effect of your experiment. The idea is that if you found any degradation that wasn’t explicitly designed with your experiment in mind (i.e., is natural), but where you see your experimental results hold, then you have “found a phenomenon”. The hope is that if you look at the roughest such scale, you might kill enough confounders and interactions to make your result be “clean” (or at least cleaner): so for example optimistically you might hope to explain all the loss of the degraded model at the degradation scale you chose (whereas at other scales, there are a bunch of other effects improving the loss on the dataset you’re looking at that you’re not capturing in the explanation).
The question now is when degrading, what order you want to “kill confounders” in to optimally purify the effect you’re considering. The “natural degradation” idea seems like a good place to look since it kills the “small but annoying” confounders: things like memorization, weird specific connotations of the test sentences you used for your experiment, etc. Another reasonable place to look is training checkpoints, as these correspond to killing “hard to learn” effects. Ideally you’d perform several kinds of degradation to “maximally purify” your effect. Here the “natural scales” (loss on the level Claude 1 e.g., or Bert) are much too fine for most modern experiments, and I’m envisioning something much rougher.
The intuition here comes from physics. Like if you want to study properties of a hydrogen atom that you don’t see either in water or in hydrogen gas, a natural thing to do is to heat up hydrogen gas to extreme temperatures where the molecules degrade but the atoms are still present, now in “pure” form. Of course not all phenomena can be purified in this way (some are confounded by effects both at higher and at lower temperature, etc.).
Thanks! Yes the temperature picture is the direction I’m going in. I had heard the term “rate distortion”, but didn’t realize the connection with this picture. Might have to change the language for my next post
This seems overstated
In some sense this is the definition of the complexity of an ML algorithm; more precisely, the direct analog of complexity in information theory, which is the “entropy” or “Solomonoff complexity” measurement, is the free energy (I’m writing a distillation on this but it is a standard result). The relevant question then becomes whether the “SGLD” sampling techniques used in SLT for measuring the free energy (or technically its derivative) actually converge to reasonable values in polynomial time. This is checked pretty extensively in this paper for example.
A possibly more interesting question is whether notions of complexity in interpretations of programs agree with the inherent complexity as measured by free energy. The place I’m aware of where this is operationalized and checked is our project with Nina on modular addition: here we do have a clear understanding of the platonic complexity, and the local learning coefficient does a very good job of asymptotically capturing it with very good precision (both for memorizing and generalizing algorithms, where the complexity difference is very significant).
Citation? [for Apollo]
Look at this paper (note I haven’t read it yet). I think their LIB work is also promising (at least it separates circuits of small algorithms)
Thanks for the reference, and thanks for providing an informed point of view here. I would love to have more of a debate here, and would quite like being wrong as I like tropical geometry.
First, about your concrete question:
As I understand it, here the notion of “density of polygons’ is used as a kind of proxy for the derivative of a PL function?
Density is a proxy for the second derivative: indeed, the closer a function is to linear, the easier it is to approximate it by a linear function. I think a similar idea occurs in 3D graphics, in mesh optimization, where you can improve performance by reducing the number of cells in flatter domains (I don’t understand this field, but this is done in this paper according to some energy curvature-related energy functional). The question of “derivative change when crossing walls” seems similar. In general, glancing at the paper you sent, it looks like polyhedral currents are a locally polynomial PL generalization of currents of ordinary functions (and it seems that there is some interesting connection made to intersection theory/analogues of Chow theory, though I don’t have nearly enough background to read this part carefully). Since the purpose of PL functions in ML is to approximate some (approximately smooth, but fractally messy and stochastic) “true classification”, I don’t see why one wouldn’t just use ordinary currents here (currents on a PL manifold can be made sense of after smoothing, or in a distribution-valued sense, etc.).
In general, I think the central crux between us is whether or not this is true:
tropical geometry might be relevant ML, for the simple reason that the functions coming up in ML with ReLU activation are PL
I’m not sure I agree with this argument. The use of PL functions is by no means central to ML theory, and is an incidental aspect of early algorithms. The most efficient activation functions for most problems tend to not be ReLUs, though the question of activation functions is often somewhat moot due to the universal approximation theorem (and the fact that, in practice, at least for shallow NNs anything implementable by one reasonable activation tends to be easily implementable, with similar macroscopic properties, by any other). So the reason that PL functions come up is that they’re “good enough to approximate any function” (and also “asymptotic linearity” seems genuinely useful to avoid some explosion behaviors). But by the same token, you might expect people who think deeply about polynomial functions to be good at doing analysis because of the Stone-Weierstrass theorem.
More concretely, I think there are two core “type mismatches” between tropical geometry and the kinds of questions that appear in ML:
Algebraic geometry in general (including tropical geometry) isn’t good at dealing with deep compositions of functions, and especially approximate compositions.
(More specific to TG): the polytopes that appear in neural nets are as I explained inherently random (the typical interpretation we have of even combinatorial algorithms like modular addition is that the PL functions produce some random sharding of some polynomial function). This is a very strange thing to consider from the point of view of a tropical geometer: like as an algebraic geometer, it’s hard for me to imagine a case where “this polynomial has degree approximately 5… it might be 4 or 6, but the difference between them is small”. I simply can’t think of any behavior that is at all meaningful from an AG-like perspective where the questions of fan combinatorics and degrees of polynomials are replaced by questions of approximate equality.
I can see myself changing my view if I see some nontrivial concrete prediction or idea that tropical geometry can provide in this context. I think a “relaxed” form of this question (where I genuinely haven’t looked at the literature) is whether tropical geometry has ever been useful (either in proving something or at least in reconceptualizing something in an interesting way) in linear programming. I think if I see a convincing affirmative answer to this relaxed question, I would be a little more sympathetic here. However, the type signature here really does seem off to me.
If I understand correctly, you want a way of thinking about a reference class of programs that has some specific, perhaps interpretability-relevant or compression-related properties in common with the deterministic program you’re studying?
I think in this case I’d actually say the tempered Bayesian posterior by itself isn’t enough, since even if you work locally in a basin, it might not preserve the specific features you want. In this case I’d probably still start with the tempered Bayesian posterior, but then also condition on the specific properties/explicit features/ etc. that you want to preserve. (I might be misunderstanding your comment though)
Statistical localization in disordered systems, and dreaming of more realistic interpretability endpoints
[epistemic status: half fever dream, half something I think is an important point to get across. Note that the physics I discuss is not my field though close to my interests. I have not carefully engaged with it or read the relevant papers—I am likely to be wrong about the statements made and the language used.]
A frequent discussion I get into in the context of AI is “what is an endpoint for interpretability”. I get into this argument from two sides:
arguing with interpretability purists, who say that the only way to get robust safety from interpretability is to mathematically prove that behaviors are safe and/or no deception is going on.
arguing with interpretability skeptics, who say that the only way to get robust safety from interpretability is to prove that behaviors are safe and/or no deception is going on.
My typical response to this is that no, you’re being silly: imagine discussing any other phenomenon in this way: “the only way to show that the sun will rise tomorrow is to completely model the sun on the level of subatomic particles and prove that they will not spontaneously explode”. Or asking a bridge safety expert to model every single particle and provably lower-bound the probability of them losing structural coherence in a way not observed by bulk models.
But there’s a more fundamental intuition here, that I started developing when I started trying to learn statistical physics. There are a few lossy ways of expressing it. One is to talk about renormalization, how assumptions about renormalizability of systems is a “theorem” in statistical mechanics, but is not (and probably never will be) proven mathematically, (in some sense, it feels much more like a “truly new flavor of axiom” than even complexity-theoretic things like P vs. NP). But that’s still not it. There is a more general intuition, that’s hard to get across (in particular for someone who, like me, is only a dabbler in the subject) -- that some genuinely incredibly complex and information-laden systems have some “strong locality” properties, which are (insofar as the physical meaning of the word holds meaning) both provable and very robust to changing and expanding the context.
For a while, I thought that this is just a vibe—a way to guide thinking, but not something that can be operationalized in a way that may significantly convince people without a similar intuition.
However, recently I’ve become more hopeful that an “explicitly formalizable” notion of robust interpretability may fall out of this language in a somewhat natural way.
This is closely related to recent discussions and writeups we’ve been doing with Lauren Greenspan on scale and renormalization in (statistical) QFT and connections to ML.
One direction to operationalize this is through the notion of “localization” in statistical physics, and in particular “Anderson localization”. The idea (if I understand it correctly) is that in certain disordered systems (think of a semiconductor, which is an “ordered” metal with a disordered system of “impurity atoms” sprinkled inside), you can prove a kind of screening property: that from the point of view of the localized dynamics near a particular spin, you can provably ignore spins far away from the point you’re studying (or rather, replace them by an “ordered” field that modifies the local dynamics in a fully controllable way). This idea of of local interactions being “screened” from far-away details is ubiquitous. In a very large and very robust class of systems, interactions are purely local, except for mediation by a small number of hierarchical “smooth” couplings that see only high-level summary statistics of the “non-local” spins and treat them as a background—and moreover, these “locality” properties are provable (insofar as we assume the extra “axioms” of thermodynamics), assuming some (once again, hierarchical and robustly adjustable) assumptions of independence. There are a number of related principles here that (if I understand correctly) get used in similar contexts, sometimes interchangeably: one I liked is “local perturbations perturb locally” (“LPPL”) from this paper.
Note that in the above paragraph I did something I generally disapprove of: I am trying to extract and verbalize “vibes” from science that I don’t understand on a concrete level, and I am almost certainly getting a bunch of things wrong. But I don’t know of another way of gesturing in a “look, there’s something here and it’s worth looking into” way without doing this to some extent.
Now AI systems, just like semiconductors, are statistical systems with a lot of disorder. In particular in a standard operationalization (as e.g. in PDLT), we can conceptualize of neural nets as a field theory. There is a “vacuum theory” that depends only on the architecture, and then adding new datapoints corresponds to adding particles. PDLT only studies a certain perturbative picture here, but it seems plausible that an extension of these techniques may extend to non-perturbative scales (and hope for this is a big part of the reason that Lauren and I have been thinking and writing about renormalization). In a “dream” version of such an extension, the datapoints would form a kind of disordered system, with both ordered components, hierarchical relationships, and some assumption of inherent randomness outside of the relationships. A great aspect of “numerical” QFT, such as gets applied in condensed matter models, is that you don’t need a really great model of the hierarchical relationships: sometimes you can just play around and turn on a handful of extra parameters until you find something that works. (Again, at the moment this is an imprecise interpretation of things I have not deeply engaged with.)
Of course doing this makes some assumptions—but the assumptions are on the level of the data (i.e. particles), not the weights/ model internals (i.e., fields—the place where we are worried about misalignment, etc.). And if you assume these assumptions and write down a “localization theorem” result, then plausibly the kind of statement you will get is something along the lines of the following:
“the way this LLM is completing this sentence is a combination of a sophisticated collection of hierarchical relationships, but I know that the behavior here is equivalent to behaviors on other similar sentences up to small (provably) low-complexity perturbations”.
More generally, the kinds of information this kind of picture would give is a kind of “local provably robust interpretability”—where the text completion behavior of a model is provably (under suitable “disordered system” assumptions) reducible to a collection of several local circuits that depend on understandable phenomena at a few different scales. A guiding “complexity intuition” for me here is provided by the notrivial but tractable grammar task diagrams in the paper Marks et al. (See pages 25-27, and note the shape of these diagrams is more or less straightup typical of the shape of a nonrenormalized interaction diagram you see before you start applying renormalization to simplify a statistical system).
An important caveat here is that in physical models of this type (and in pictures that include renormalization more generally), one does not make—or assume—any “fundamentality” assumptions. In many cases a number of alternative (but equivalent, once the “screening” is factored in) pictures exist, with various levels of granularity, elegance, etc. (this already can be seen in the 2D Ising model—a simple magnet model—where the same behaviors can either be understood in a combinatorial “spin-to-spin interaction” way, which mirrors the “fundamental interpretability” desires of mechinterp, and through this “recursive screening out” model that is more renormalization-flavored; the results are the same (to a very high level of precision), even when looking at very localized effects involving collections of a few spins. So the question of whether an interpretation is “fundamental” or uses the “right latents” is to a large extent obviated here; the world of thermodynamics is much more anarchical and democratic than the world of mathematical formalism and “elegant proof”, at least in this context.
Having handwavily described a putative model, I want to quickly say that I don’t actually believe in this model. There are a bunch of things I probably got wrong, there are a bunch of other, better tools to use, and so on. But the point is not the model: it’s that this kind of stuff exists. There exist languages that show that arbitrarily complex, arbitrarily expressive behaviors are provably reducible to local interactions, where behaviors can be understood as clusters of hierarchical interactions that treat all but a few parts of the system at every point as “screened out noise”.
I think that if models like this are possible, then a solution to “the interpretability component to safety” is possible in this framework. If you have provably localized behaviors then for example you have a good idea where to look for deception: e.g., deception cannot occur on the level “very low-level” local interactions, as they are too simple to express the necessary reasoning, and perhaps it can be carefully operationalized and tracked in the higher-level interactions.
As you’ve no doubt noticed, this whole picture is splotchy and vague. It may be completely wrong. But there also may be something in this direction that works. I’m hoping to think more about this, and very interested in hearing people’s criticisms and thoughts.
I like the analogy of a LARP. Characters in a book don’t have reputation or human-like brain states that they honestly try to represent—but a good book can contain interesting, believable characters with consistent motivation, etc. I once participated in a well-organized fantasy LARP in graduate school. I was bad at it but it was a pretty interesting experience. In particular people who are good are able to act in character and express thoughts that “the character would be having” which are not identical to the logic and outlook of the player (I was bad at this, but other players could do it I think). In my case, I noticed that the character imports a bit of your values, which you sometimes break in-game if it feels appropriate. You also use your cognition to further the character’s cognition, while rationalizing their thinking in-game. It obviously feels different from real life: it’s explicitly a setting where you are allowed and encouraged to break your principles (like you are allowed to lie in a game of werewolf, etc.) and you understand that this is low-stakes, and so don’t engage the full mechanism of “trying as hard as possible” (to be a good person, to achieve good worlds, etc.). But also, there’s a sense in which a LARP seems “Turing-complete” for lack of a better word. For example in this LARP, the magical characters (not mine) collaboratively solved a logic puzzle to reverse engineer a partially known magic system and became able to cast powerful spells. I could also imagine modeling arbitrarily complex interactions and relationships in an extended LARP. There would probably always be some processing cost to add the extra modeling steps, but I can’t see how this would impose any hard constraints on some measure of “what is achievable” in such a setting.
I don’t see hard reasons for why e.g. a village of advanced LLMs could not have equal or greater capability than a group of smart humans playing a LARP. I’m not saying I see evidence they do—I just don’t know of convincing systematic obstructions. I agree that modern LLMs seem to not be able to do some things humans could do even in a LARP (some kind of theory of mind, explaining a consistent thinking trace that makes sense to a person upon reflection, etc.) but again a priori this might just be a skill issue.
So I wonder in the factorization “LLM can potentially get as good as humans in a LARP” + “sufficiently many smart humans in a long enough LARP are ‘Turing complete up to constant factors’ ” (in the sense of in principle being able to achieve, without breaking character, any intellectual outcome that non-LARP humans could do), which part would you disagree with?