To add: I think the other use of “pure state” comes from this context. Here if you have a system of commuting operators and take a joint eigenspace, the projector is mixed, but it is pure if the joint eigenvalue uniquely determines a 1D subspace; and then I think this terminology gets used for wave functions as well
Dmitry Vaintrob
One person’s “occam’s razor” may be description length, another’s may be elegance, and a third person’s may be “avoiding having too much info inside your system” (as some anti-MW people argue). I think discussions like “what’s real” need to be done thoughtfully, otherwise people tend to argue past each other, and come off overconfident/ underinformed.
To be fair, I did use language like this so I shouldn’t be talking—but I used it tongue-in-cheek, and the real motivation given in the above is not “the DM is a more fundamental notion” but “DM lets you make concrete the very suggestive analogy between quantum phase and probability”, which you would probably agree with.For what it’s worth, there are “different layers of theory” (often scale-dependent), like classical vs. quantum vs. relativity, etc., where there I think it’s silly to talk about “ontological truth”. But these theories are local conceptual optima among a graveyard of “outdated” theories, that are strictly conceptually inferior to new ones: examples are heliocentrism (and Ptolemy’s epycycles), the ether, etc.
Interestingly, I would agree with you (with somewhat low confidence) that in this question there is a consensus among physicists that one picture is simply “more correct” in the sense of giving theoretically and conceptually more elegant/ precise explanations. Except your sign is wrong: this is the density matrix picture (the wavefunction picture is genuinely understood as “not the right theory”, but still taught and still used in many contexts where it doesn’t cause issues).
I also think that there are two separate things that you can discuss.
Should you think of thermodynamics, probability, and things like thermal baths as fundamental to your theory or incidental epistemological crutches to model the world at limited information?
Assuming you are studying a “non-thermodynamic system with complete information”, where all dynamics is invertible over long timescales, should you use wave functions or density matrices?
Note that for #1, you should not think of a density function as a probability distribution on quantum states (see the discussion with Optimization Process in the comments), and this is a bad intuition pump. Instead, the thing that replaces probability distributions in quantum mechanics is a density matrix.
I think a charitable interpertation of your criticism would be a criticism of #1 (putting limited-info dynamics—i.e., quantum thermodynamics) as primary to “invertible dynamics”. Here there is a debate to be had.
I think there is not really a debate in #2: even in invertible QM (no probability), you need to use density matrices if you want to study different subsystems (e.g. when modeling systems existing in an infinite, but not thermodynamic universe you need this language, since restricting a wavefunction to a subsystem makes it mixed). There’s also a transposed discussion, that I don’t really understand, of all of this in field theory: when do you have fields vs. operators vs. other more complicated stuff, and there is some interesting relationship to how you conceptualize “boundaries”—but this is not what we’re discussing. So you really can’t get away from using density matrices even in a nice invertible universe, as soon as you want to relate systems to subsystems.
For question #1 is reasonable (though I don’t know how productive) to discuss what is “primary”. I think (but here I am really out of my depth) that people who study very “fundamental” quantum phenomena increasingly use a picture with a thermal bath (e.g. I vaguely remember this happening in some lectures here). At the same time, it’s reasonable to say that “invertible” QM phenomena are primary and statistical phenomena are ontological epiphenomena on top of this. While this may be a philosophical debate, I don’t think it’s a physical one, since the two pictures are theoretically interchangeable (as I mentioned, there is a canonical way to get thermodynamics from unitary QM as a certain “optimal lower bound on information dynamics”, appropriately understood).
Still, as soon as you introduce the notion of measurement, you cannot get away from thermodynamics. Measurement is an inherently information-destroying operation, and iiuc can only be put “into theory” (rather than being an arbitrary add-on that professors tell you about) using the thermodynamic picture with nonunitary operators on density matrices.
Thanks—you’re right. I have seen “pure state” referring to a basis vector (e.g. in quantum computation), but in QTD your definition is definitely correct. I don’t like the term “pointer variable”—is there a different notation you like?
Yeah, this also bothered me. The notion of “probability distribution over quantum states” is not a good notion: the matrix I is both (|0\rangle \langle 0|+|1\rangle \langle 1|) and (|a\rangle \langle a|+|b\rangle \langle b|) for any other orthogonal basis. The fact that these should be treated equivalently seems totally arbitrary. The point is that density matrix mechanics is the notion of probability for quantum states, and can be formalized as such (dynamics of informational lower bounds given observations). I was sort of getting at this with the long “explaining probability to an alien” footnote, but I don’t think it landed (and I also don’t have the right background to make it precise)
I’ve found our Agent Smith :) If you are serious, I’m not sure what you mean. Like there is no ontology in physics—every picture you make is just grasping at pieces of whatever theory of everything you eventually develop
I like this! Something I would add at some point before unitarity is that there is another type of universe that we almost inhabit, where your vectors of states have real positive coefficients that sum to 1, and your evolution matrices are Markovian (i.e., have positive coefficients and preserve the sum of coordinates). In a certain sense in such a universe it’s weird to say “the universe is .3 of this particle being in state 1 and .7 of it being in state 2”, but if we interpret this as a probability, we have lived experience of this.
Something that I like to point out that clicked for me at some point and serves as a good intuition pump, is that for many systems that have a real and quantum analogue, there is actually an interpolated collection of linear dynamics problems like you described that exactly interpolates between quantum and statistical. There’s a little bit of weirdness here, BTW, since there’s this weird nonlinearity (“squaring the norm”) that you need to go from quantum to classical systems. The reason for this actually has to do with density matrices.
There’s a whole post to be written on this, but the basic point is that “we’ve been lied to”: when you’re introduced to QM and see a wavefunction , this actually doesn’t correspond to any linear projection/disentanglement/etc. of the “multiverse state”. What instead is being linearly extracted from the “multiverse state” is the external product matrix which is the complex-valued matrix that projects to the 1-dimensional space spanned by the wave function. Now the correction of the “lie” is that the multiverse state itself should be thought of as a matrix. When you do this, the new dynamics now acts on the space of matrices. And you see that the quantum probabilities are now real-valued linear invariants of this state (to see this: the operation of taking the outer product with itself is quadratic, so the “squared norm” operators are now just linear projections that happen to have real values). In this picture, finding the probability of a measurement has exactly the same type signature as measuring the “probability of an event” in the statistical picture: namely, it is a linear function of the “multiverse vector” (just a probability distribution on states in the “statistical universe picture”). Now the evolution of the projection matrix still comes from a linear evolution on your “corrected” vector space of matrix states (in terms of your evolution matrix U, it takes the matrix M to , and of course each coefficient of the new matrix is linear in the old matrix). So this new dynamics is exactly analogous to probability dynamics, with the exception that your matrices are non-Markovian (indeed, on the level of matrices they are also unitary or at least orthogonal) and you make an assumption on your initial “vector” that, when viewed as a matrix, it is rank-1 complex projection matrix, i.e. has the form (In fact if you drop this assumption of being rank-1 and look instead at the linear subspace of matrices these generate—namely, Hermitian matrices—then you also get reasonable quantum mechanics, and many problems in QM in fact force you to make this generalization.)
The elves care, Alex. The elves care.
Why I’m in AI sequence: 2020 Journal entry about gpt3
I moved from math academia to full-time AI safety a year ago—in this I’m in the same boat as Adam Shai, whose reflection post on the topic I recommend you read instead of this.
In making the decision, I went through a lot of thinking and (attempts at) learning about AI before that. A lot of my thinking had been about whether a pure math academic can make a positive difference in AI, and examples that I thought counterindicated this—I finally decided this might be a good idea after talking to my sister Lizka extensively and doing MATS in Summer of 2023. I’m thinking of doing a more detailed post about my decision and thinking later, in case there are other academics thinking about making this transition (and feel free to reach out in pm’s in this case!).
But one thing I have started to forget is how scary and visceral AI risk felt when I was making the decision. I’m both glad and a little sad that the urgency is less visceral and more theoretical now. AI is “a part of the world”, not an alien feature: part of the “setting” in the Venkat Rao post that was part of my internal lexicon at the time.
For now, in order to fill a gap in my constantly flagging daily writing schedule, I’ll share a meandering entry from 2020 about how I thought about positive AI futures. I don’t endorse a lot of it; much is simplistic and low-context, or alternatively commonplace in these circles, though some of it holds up. It’s interesting reading back that the thing I thought was most interesting as a first attempt at orienting my thinking was fleshing out “positive futures” and what they might entail. Two big directional updates I’ve had since are thinking harder about “human alignment” and “human takeover”, and trying to temper the predictions that assume singularitarian “first-past-the-post” AGI for a messier “AI-is-kinda-AGI” world that we will likely end up in.
journal entry
7/19/2020 [...] I’m also being paranoid about GPT-3.
Let’s think. Will the world end, and if so, when? No one knows, obviously. GPT-3 is a good text generation bot. It can figure out a lot about semantics, mood, style, even a little about humor. It’s probably not going to take over the world yet. But how far away are we from AGI?
GPT-3 makes me think, “less than a decade”. There’s a possibility it will be soon (within the year). I’d assign that probability 10%. It felt like 20% when I first saw its text, but seeing Sam Altman’s remark and thinking a little harder, I don’t think it’s quite realistic for it to go AGI without a significant extra step or two. I think that I’d give it order of 50% within the decade. So it’s a little like living with a potentially fatal disease, with a prognosis of 10 years. Now we have no idea what AGI will be like. It will most likely either be very weird and deadly or revolutionary and good, though disappointing in some ways. I think there’s not much we can do about the weird and deadly scenarios. Humans have lived in sociopathic times (see Venkat’s notes on his 14th Century Europe book). It would probably be shorter and deadlier than the plague; various “human zoo” scenarios may be pleasant to experience (after all zoo animals are happier in general than in the wild, at least from the point of view of basic needs), but harrowing to imagine. In any case, it’s not worth speculating on this.
What would a good outcome look like? Obviously, no one knows. It’s very hard to predict our interaction with a super-human intelligence. But here are some pretty standard “decent” scenarios: (1) After a brief period of a pro-social AI piloted by a team of decent people, we end up with a world much like ours but with AI capabilities curbed for a long period of time [...]. If it were up to me I would design this world with certain “guard rail”-like changes: to me this would be a “Foundation”-style society somewhere in New Zealand (or on the bottom of the ocean perhaps? the moon?) consisting of people screened for decency, intelligence, etc. (but with serious diversity and variance built in), and with control of the world’s nukes, with the responsibility of imposing very basic non-interference and freedom of immigration criteria on the world’s societies (i.e., making the “archipelago” dream a reality, basically). So enforcing no torture, disincentivizing violent conflict, imposing various controls to make sure people can move from country to country and are exposed to the basic existence of a variety of experiences in the world, but allowing for culturally alien or disgusting practices in any given country: such as Russian homophobia, strict Islamic law, unpleasant-seeming (for Western Europeans) traditions in certain tribal cultures, etc. This combined with some sort of non-interventionist altruistic push. In this sci-fi scenario the Foundation-like culture would have de facto monopoly of the digital world (but use it sparingly) and also a system of safe nuclear power plants sufficient to provide the world’s power (but turned on carefully and slowly, to prevent economic jolts), but to carefully and “incontrovertibly” turn most of the proceeds into a universal basic income for the entire world population. Obviously this would have to be carefully thought out first by a community of intelligent and altruistic people with clear rules of debate/decision. —The above was written extremely sleepy. [...]
(2) (Unlikely) AI becomes integrated with (at first, decent and intelligent later, all interested) humans via some kind of mind-machine interface, or alternatively a faithful human modeling in silica. Via a very careful and considered transition (in some sense “adiabatic”, i.e. designed so as not to lose any of our human ethos and meaning that can possibly be recovered safely) we become machines, with a good and meaningful (not wireheaded, other than by considered choice) world left for the hold-outs who chose to remain purely human.
(3) The “Her” scenario: AI takes off on its own, because of human carelessness or desperation. It develops in a way that cherishes and almost venerates humans, and puts effort into making a good, meaningful existence for humans (meaningful and good in sort of the above adiabatic sense, i.e. meaningful via a set of clearly desirable stages of progress from step to step, without hidden agendas, and carefully and thoughtfully avoiding creating or simulating, in an appropriate sense, anything that would be considered a moral horror by locally reasonable intelligences at any point in the journey). AI continues its own existence, either self-organized to facilitate this meaningful existence of humans or doing its own thing, in a clearly separated and “transcendent” world, genuinely giving humans a meaningful amount of self-determination, while also setting up guardrails to prevent horrors and also perhaps eliminating or mitigating some of the more mundane woes of existence (something like cancer, etc.) without turning us into wireheads.
(4) [A little less than ideal by my book, but probably more likely than the others]: The “garden of plenty” scenario. AI takes care of all human needs and jobs, and leaves all humans free to live a nevertheless potentially fulfilling existence, like aristocrats or Victorians but less classist, socializing learning reading, etc., with the realization that all they are doing is a hobby: perhaps “human-generated knowledge” would be a sort of sport, or analog of organic produce (homeopathically better, but via a game that makes everyone who plays it genuinely better in certain subtle ways). Perhaps AI will make certain “safe” types of art, craft and knowledge (maybe math! Here I’m obviously being very biased about my work’s meaning not becoming fully automated) purely the domain of humans, to give us a sense of self-determination. Perhaps humans are guided through a sort of accelerated development over a few generations to get to the no.2 scenario.
(5) There is something between numbers 3 and 4 above, less ideal than all of the above but likely, where AI quickly becomes an equal player to humans in the domain of meaning-generation, and sort of fills up space with itself while leaving a vaguely better (maybe number 4-like) Earth to humans. Perhaps imposes a time limit on humans (enforced via a fertility cap, hopefully with the understanding that humans can raise AI babies with genuine sense of filial consciousness and complete with bizzarre scences of trying to explain the crazy world of AI to their parents), after which the human project becomes the AI project, probably essentially incomprehensible to us.
There’s a sense that I have that while I’m partial to scenarios 1 and 2: I want humans to retain the monopoly on meaning-generation and to be able to feel empowered and important, it will be seen to be old-fashioned and almost dangerous by certain of my peers because of the lack of emphasis on harm-prevention, stable future, etc. I think this is part of the very serious debate, so far abstract and fun, but, as AI gets better, perhaps turning heated and loud, between whether comfort or meaning are more important goals of the human project (and both sides will get weird). I am firmly on the side of meaning, with a strict underpinning of retaining bodily and psychological integrity in all the object-level and meta-level senses (except I guess I’m ok with moving to the cloud eventually? Adiabatic is the word for me). Perhaps my point of view is on the side I think it is just in the weird group of futurists and rationalists that I mostly read when reading about AI: probably the generic human who thinks about AI is horrified by all of the above scenarios and just desperately hoping it will go away on its own, or has some really idiosyncratic mix of the above or other ideas which seem obviously preferable to them.
Yeah I agree that it would be even more interesting to look at various complexity parameters. The inspiration here of course is physics: isolating a particle/effective particle (like a neutron in a nucleus) or an interaction between a fixed set of particles, by putting it in a regime where other interactions and groupings drop out. The goto for a physicist is temperature: you can isolate a neutron by putting the nucleus in a very high-temperature environment like a collider where the constituent baryons separate. This (as well as the behavior wrt generality) is the main reason I suggested for “natural degradation” from SLT, as this samples from the tempered distribution and is the most direct analog of varying temperature (putting stuff in a collider). But you can vary other hyperparameters as well. Probably an even more interesting thing to do is to simultaneously do two things with “opposite” behaviors, which I think is what you’re suggesting above. For a cartoon notions of the memorization-generalization “scale” is that if you have low complexity coming from low parameter count/depth or low training time (the latter often behaves similarly to low data diversity), you get simpler “more memorization-y” circuits (I’m planning to talk more about this later in a “learning stories” series—but from work on grokking, leap complexity, etc. people expect later solutions to generalize better. So if you combine this with the tempering “natural degradation” above, you might be able to get rid of behaviors both above and below a range of interest.
You’re right that tempering is not a binary on/off switch. Because of the nature of tempering, you do expect exponential decay of “inefficient” circuits as your temperature gets higher than the “characteristic temp.” of the circuit (this is analogous to how localized particles tend to have exponentially less coupling as they get separated), so it’s not completely unreasonable to “fully turn off” a class of behaviors. But something special in physics that probably doesn’t happen in AI is that the temperature scales relevant for different forces have very high separation (many orders of magnitude), so scales separate very clearly. In AI, I agree that as you described, tempering will only “partially” turn off many of the behaviors you want to clean up. It’s plausible that for simple circuits there is enough of a separation of characteristic temperature between the circuit and its interactions with other circuits that something approaching the behavior in physics is possible, but for most phenomena I’d guess that your “things decay more messily” picture is more likely.
Thanks! Are you saying there is a better way to find citations than a random walk through the literature? :)
I didn’t realize that the pictures above limit to literal pieces of sin and cos curves (and Lissajous curves more generally). I suspect this is a statement about the singular values of the “sum” matrix S of upper-triangular 1′s?
The “developmental clock” observation is neat! Never heard of it before. Is it a qualitative “parametrization of progress” thing or are there phase transition phenomena that happen specifically around the midpoint?
Do the images load now?
Hmm, I’m not sure how what you’re describing (learn on a bunch of examples of (query, well-thought-out guess)) is different from other forms of supervised learning.
Based on the paper Adam shared, it seems that part of the “amortizing” picture is that instead of simple supervised learning you look at examples of the form (context1, many examples from context1), (context2, many examples from context2), etc., in order to get good at quickly performing inference on new contexts.
It sounds like in the Paul Christiano example, you’re assuming access to some internal reasoning components (like activations or chain-of-thought) to set up a student-teacher context. Is this equivalent to the other picture I mentioned?
I’m also curious about what you said about o3 (and maybe have a related confusion about this). I certainly believe that NN’s, including RL models, learn by parallel heuristics (there’s a lot of interp and theory work that suggests this), but I don’t know any special properties of o3 that make it particularly supportive of this point of view
Thanks! I spent a bit of time understanding the stochastic inverse paper, though haven’t yet fully grokked it. My understanding here is that you’re trying to learn the conditional probabilities in a Bayes net from samples. The “non-amortized” way to do this for them is to choose a (non-unique) maximal inverse factorization that satisfies some d-separation condition, then guess the conditional probabilities on the latent-generating process by just observing frequencies of conditional events—but of course this is very inefficient, in particular because the inverse factorization isn’t a general Bayes net, but must satisfy a bunch of consistency conditions; and then you can learn a generative model for these consistency conditions by a NN and then perform some MCMC sampling on this learned prior.
So is the “moral” you want to take away here then that by exploring a diversity of tasks (corresponding to learning this generative prior on inverse Bayes nets) a NN can significantly improve its performance on single-shot prediction tasks?
FWIW, I like John’s description above (and probably object much less than baseline to humorously confrontational language in research contexts :). I agree that for most math contexts, using the standard definitions with morphism sets and composition mappings is easier to prove things with, but I think the intuition described here is great and often in better agreement with how mathematicians intuit about category-theoretic constructions than the explicit formalism.
This phenomenon exists, but is strongly context-dependent. Areas of math adjacent to abstract algebra are actually extremely good at updating conceptualizations when new and better ones arrive. This is for a combination of two related reasons: first, abstract algebra is significantly concerned about finding “conceptual local optima” of ways of presenting standard formal constructions, and these are inherently stable and require changing infrequently; second, when a new and better formalism is found, it tends to be so powerfully useful that papers that use the old formalism (in concepts where the new formalism is more natural) quickly become outdated—this happened twice in living memory, once with the formalism of schemes replacing other points of view in algebraic geometry and once with higher category theory replacing clunkier conceptualizations of homological algebra and other homotopical methods in algebra. This is different from fields like AI or neuroscience, where oftentimes using more compute, or finding a more carefully taylored subproblem is competitive or better than “using optimal formalism”. That said, niceness of conceptualizations depends on context and taste, and there do exist contexts where “more classical” or “less universal” characterizations are preferable to the “consensus conceptual optimum”.
This is very nice! So the way I understand what you linked is this: the class of perturbative expansions in the “Edgeworth expansion” picture I was distilling is that the order-d approximation for the probability distribution associated to the sum variable S_n above is where is the probability distribution associated with a Gaussian and is a polynomial in t and the perturbative parameter . The paper you linked says that a related natural thing to do is to take the Fourier transform, which will be the product of the Gaussian pdf and a different polynomial in the fourier parameter t and the inverse perturbation parameter . You can then look at the leading terms, which will be (maybe up to some fixed scaling) a polynomial in and this gives some kind of “leading” Edgeworth contribution.
Here this can be interpreted as a stationary phase formula, but you can only get “perturbative” theories, i.e. the relevant critical set will be nonsingular (and everything is expressed as a Feynman diagram with edges decorated by the inverse Hessian). But you’re saying that if you take this idea and apply it to different interesting sequences of random variables (not sum variables, but other natural asymptotic limits of other random processes), you can get singular stationary phase (i.e. the Watanabe expansion). Is there an easy way to describe the simplest case that gives an interesting Watanabe expansion?
Thanks for asking! I said in a later shortform that I was trying to do too many things in this post, with only vague relationships between them, and I’m planning to split it into pieces in the future.
Your 1-3 are mostly correct. I’d comment as follows:
(and also kind of 3) That advice of using the tempered local Bayesian posterior (I like the term—let’s shorten it to TLBP) is mostly aimed at non-SLT researchers (but may apply also to some SLT experiments). The suggestion is simpler than to compute expectations. Rather, it’s just to run a single experiment at a weight sampled from the TLBP. This is analogous to tuning a precision dial on your NN to noise away all circuits for which the quotient (usefulness)/(description length) is bounded above by 1/t (where usefulness is measured in reduction of loss). At t = 0, you’re adding no noise and at you’re fully noising it.
This is interesting to do in interp experiments for two general reasons:
You can see whether the behavior your experiment finds is general or spurious. The higher the temperature range it persists over, the more general it is in the sense of usefulness/description length (and all else being equal, the more important your result is).
If you are hoping to say that a behavior you found, e.g. a circuit, is “natural from the circuit’s point of view” (i.e., plausibly occurs in some kind of optimal weight- or activation-level description of your model), you need to make sure your experiment isn’t just putting together bits of other circuits in an ad-hoc way and calling it a circuit. One way to see this, that works 0% of the time, is to notice that turning this circuit on or off affects the output on exactly the context/ structure you care about, and has absolutely no effect at all on performance elsewhere. This never works because our interp isn’t at a level where we can perform uber-precise targeted interventions, and whenever we do something to a network in an experiment, this always significantly affects loss on unrelated inputs. By having a tunable precision parameter (as given by the TLBP for example), you have more freedom to find such “clean” effects that only do what you want and don’t affect loss otherwise. In general, in an imprecise sense, you expect each “true” circuit to have some “temperature of entanglement” with the rest of the model, and if this circuit is important enough to survive tempering to this temperature of entanglement, you expect to see much cleaner and nicer results in the resulting tempered model.
In the above context, you rarely want to use the Watanabe temperature or any other temperature that only depends on the number of samples n, since it’s much too low in most cases. Instead, you’re either looking for a characteristic temperature associated with an experiment or circuit (which in general will not depend on n much), or fishing for behaviors that you hope are “significantly general”. Here the characteristic temperature associated with the level of generality that “is not literally memorizing” is the Watanabe temperature or very similar, but it is probably more interesting to consider larger scales.
(maybe more related to your question 1): Above, I explained why I think performing experiments at TLBP weight values is useful for “general interp”. I also explain that you sometimes have a natural “characteristic temperature” for the TLBP that is independent of sample number (e.g. meaningful at infinite samples), which is the difference between the loss of the network you’re studying and a SOTA NN, which you think of as that “true optimal loss”. In large-sample (highly underparameterized) cases, this is probably a better characteristic temperature than the Watanabe temperature, including for notions of effective parameter count: indeed, insofar as your NN is “an imperfect approximation of an optimal NN”, the noise inherent in this imperfection is on this scale (and not the Watanabe scale). Of course there are issues with this PoV as less expressive NN’s are rarely well-conceptualized as TLBP samples (insofar as they find a subset of a “perfect NN’s circuits”, they find the easily learnable ones rather than the maximally general ones). However it’s still reasonable to think of this as a first stab at the inherent noise scale associated to an underparametrized model, and to think of the effective parameter count at this scale (i.e., free energy / log temperature) as a better approximatin of some “inherent” parameter count.
Why you should try degrading NN behavior in experiments.
I got some feedback on the post I wrote yesterday that seems right. The post is trying to do too many things, and not properly explaining what it is doing, why this is reasonable, and how the different parts are related.
I want to try to fix this, since I think the main piece of advice in this post is important, but gets lost in all the mess.
This main point is:
experimentalists should in many cases run an experiment on multiple neural nets with a variable complexity dial that allows some “natural” degradations of the NN’s performance, and certain dials are better than others depending on context.
I am eventually planning splitting out the post into a few parts, one of which explains this more carefully. When I do this I will replace the current version of the post with just a discussion of the “koan” itself: i.e., nitpicks about work that isn’t careful about thinking about the scale at which it is performing interpretability.
For now I want to give a quick reductive take on what I hope to be the main takeaway of this discussion. Namely, why I think “interpretability on degraded networks” is important for better interpretability.
Basically: when ML experiments modify a neural net to identify or induce a particular behavior, this always degrades performance. Now there are two hypotheses for what is going on:
-
You are messily pulling your NN in the direction of a particular behavior, and confusing this spurious messy phenomenon with finding a “genuine” phenomenon from the program’s point of view.
-
You are messily pulling your NN in the direction of a particular behavior, but also singling out a few “real” internal circuits of the NN that are carrying out this behavior.
Because of how many parameters you have to play with and the polysemanticity of everything in a NN, it’s genuinely hard to tell these two behaviors apart. You might find stuff that “looks” like a core circuit, but actually is just bits of other circuits combined together, and your circuit-fitting experiment makes look like a coherent behavior, and any nice properties of the resulting behavior that make it seem like an “authentic” circuit are just artefacts of the way you set up the experiment.
Now the idea behind running this experiment at “natural” degradations of network performance is to try to separate out these two possibilities more cleanly. Namely, an ideal outcome is that in running your experiment on some class of natural degradation of your neural net, you find a regime such that
the intervention you are running no longer significantly affects the (naturally degraded) performance
the observed effect still takes place.
Then what you’ve done is effectively “cleaned up” your experiment such that you are still probably finding interpretable behaviors in the original neural net (since a good degradation is likely to contain a subset of circuits/behaviors of your original net and not many “new behaviors), in a way that sufficiently reduced the complexity that the behavior you’re seeking is no longer “entangled” with a bunch of other behaviors; this should significantly update you that the behavior is indeed “natural” and not spurious.
This is of course a very small, idealized sketch. But the basic idea behind looking at neural nets with degraded performance is to “squeeze” the complexity in a controlled way to suitably match the complexity of the circuit (and how it’s embedded in the rest of the network/how it interacts with other circuits). If you then have a circuit of “the correct complexity” that explains a behavior, there is in some sense no “complexity room” for other sneaky phenomena to confound it.
In the post, the natural degradation I suggested is the physics-inspired “SLGD sampling” process which in some sense tries to add a maximal amount of noise to your NN while only having a limited impact on performance (measured by loss); this has a bias of keeping “generally useful” circuits and interactions and noising more inessential/ memorize-y circuits. Other interventions that have different properties are “just adding random noise” (either to weights or activations) to suitable reduce performance, or looking at earlier training checkpoints. I suspect that different degradations (or combinations thereof) are appropriate to isolate the relevant complexity of different experiments.
- My January alignment theory Nanowrimo by Jan 2, 2025, 12:07 AM; 42 points) (
- Paper club: He et al. on modular arithmetic (part I) by Jan 13, 2025, 11:18 AM; 14 points) (
- Jan 11, 2025, 2:27 PM; 4 points) 's comment on Dmitry’s Koan by (
- Jan 28, 2025, 9:59 AM; 2 points) 's comment on Dmitry’s Koan by (
-
Thanks so much for this! Will edit
Thanks for the questions!
Yes, “QFT” stands for “Statistical field theory” :). We thought that this would be more recognizable to people (and also, at least to some extent, statistical is a special case of quantum). We aren’t making any quantum proposals.
We’re following (part of) this community, and interested in understanding and connecting the different parts better. Most papers in the “reference class” we have looked at come from (a variant of) this approach. (The authors usually don’t assume Gaussian inputs or outputs, but just high width compared to depth and number of datapoints—this does make them “NTK-like”, or at least perturbatively Gaussian, in a suitable sense).
Neither of us thinks that you should think of AI as being in this regime. One of the key issues here is that Gaussian models can not model any regularities of the data beyond correlational ones (and it’s a big accident that MNIST is learnable by Gaussian methods). But we hope that what AIs learn can largely be well-described by a hierarchical collection of different regimes where the “difference”, suitably operationalized, between the simpler interpretation and the more complicated one is well-modeled by a QFT-like theory (in a reference class that includes perturbatively Gaussian models but is not limited to them). In particular one thing that we’d expect to occur in certain operationalizations of this picture is that once you have some coarse interpretation that correctly captures all generalizing behaviors (but may need to be perturbed/suitably denoised to get good loss), the last and finest emergent layer will be exactly something in the perturbatively Gaussian regime.
Note that I think I’m more bullish about this picture and Lauren is more nuanced (maybe she’ll comment about this). But we both think that it is likely that having good understanding of perturbatively Gaussian renormalization would be useful for “patching in the holes”, as it were, of other interpretability schemes. A low-hanging fruit here is that whenever you have a discrete feature-level interpreatation of a model, instead of just directly measuring the reconstruction loss you should at minimum model the difference model-interpretation as a perturbative Gaussian (corresponding to assuming the difference has “no regularity beyond correlation information”).
We don’t want to assume homogeneity, and this is mostly covered by 2b-c above. I think the main point we want to get across is that it’s important and promising to try to go beyond the “homogeneity” picture—and to try to test this in some experiments. I think physics has a good track record here. Not on the level of tigers, but for solid-state models like semiconductors. In this case you have:
The “standard model” only has several-particle interactions (corresponding to the “small-data limit”).
By applying RG techniques to a regular metallic lattice (with initial interactions from the standard model), you end up with a good new universality class of QFT’s (this now contains new particles like phonons and excitons which are dictated by the RG analysis at suitable scales). You can be very careful and figure out the renormalization coupling parameters in this class exactly, but much more realistically and easily you just get them from applying a couple of measurements. On an NN level, “many particles arranged into a metallic pattern” corresponds to some highly regular structure in the data (again, we think “particles” here should correspond to datapoints, at least in the current RLTC paradigm).
The regular metal gives you a “background” theory, and now we view impurities as a discrete random-feature theory on top of this background. Physicists can still run RG on this theory by zooming out and treating the impurities as noise, but in fact you can also understand the theory on a fine-grained level near an impurity by a more careful form of renormalization, where you view the nearest several impurities as discrete sources and only coarsegrain far-away impurities as statistical noise. At least for me, the big hope is that this last move is also possible for ML systems. In other words, when you are interpreting a particular behavior of a neural net, you can model it as a linear combination of a few messy discrete local circuits that apply in this context (like the complicated diagram from Marks et al below) plus a correctly renormalized background theory associated to all other circuits (plus corrections from other layers plus …)