You might enjoy
which explains the role that the resulting problem (representing homology class of manifolds by submanifolds/cobordisms) played in inspiring the work of René Thom on cobordism, stable homotopy theory, singularity theory...
You might enjoy
which explains the role that the resulting problem (representing homology class of manifolds by submanifolds/cobordisms) played in inspiring the work of René Thom on cobordism, stable homotopy theory, singularity theory...
Here are two more closely related results in the same circle of ideas. The first one gives a description (a kind of fusion of Dold-Thom and Eilenberg-Steenrod) of homology purely internal to homotopy theory, and the second explains how homological algebra falls out of infinity-category theory:
Consider functors E:S_* --> S_* from the infinity-category of spaces to itself which commutes with filtered colimits, carries pushout squares to pullback squares, sends the one-point space to itself, and the 0-sphere (aka two points!) to a discrete space. Then A=E(S^0) has a natural structure of abelian group, and E is an (infinity-categorical version of) the Dold-Thom functor and satisfies pi_n E(X)=H_n(X,A) (reduced homology). In particular, E(S^n) is an Eilenberg-Maclane space K(A,n).
The category of functors E:S_* --> S_* satisfying all the properties above except the one about E(S^0) being discrete is a model for the infinity-category Sp of spectra, i.e. the “stabilization” (in a precise categorical sense) of the infinity-category of spaces. From this perspective, the functors from the previous points are called the Eilenberg-Maclane spectra HA. Moreover, the infinity-category of spectra has a symmetric monoidal structure (the “smash product”), HR is naturally an algebra object for this structure whenever R is a ring, and it makes sense to talk about the infinity-category LMod(HR) of left HR-modules in Sp. Then LMod(HR) is equivalent (essentially by a stable version of the Dold-Kan correspondence) to the infinity derived category of left R-modules D(R). In other words, for homotopy theorists, (chain complexes,quasi-isomorphisms) are just a funny point-set model for HR-module spectra!
All of this is discussed in the first chapter of Lurie’s Higher Algebra, except the last point which is not completely spelled out because monoidal structures and modules are only introduced later on.
I should point out that this perspective is largely a reformulation of the results you already mentioned, and in themselves certainly do not bring new computational techniques for singular homology. However they show that 1) homological algebra comes out “structurally” from homotopy theory, which itself comes out “structurally” from infinity-category theory and 2) homological algebra (including in more sophisticated contexts than just abelian groups, e.g. dg-categories), homotopy theory, sheaf theory… can be combined inside of a common flexible categorical framework, which elegantly subsumes previous point-set level techniques like model categories.
All the frames you are mentioning are good for intuition. I would say the deepest one is 4. and that everything falls into place cleanly once you formulate things in the language of infinity-category theory (at the price of a lot of technicalities to establish the “right” language). For example,
singular homology with coefficients in A can be characterised as the unique colimit-preserving infinity-functor from the infinity-category of spaces/homotopy types/infinity-grpoids/anima to the derived infinity-category of abelian groups which sends a one-point space (equivalently any contractible space) to A[0].
The derived infinity-category of abelian groups is itself in some sense the “(stable presentable) Z-linearization” of the infinity-category of spaces, although this is more tricky to state precisely and I won’t try to do this here.
Which formal properties of the KL-divergence do the proofs of your result use? It could be useful to make them all explicit to help generalize to other divergences or metrics between probability distributions.
Well, I can certainly emphasize with the feeing that compromising on a core part of your identity is threatening ;-)
More seriously, what you are describing as empathy seems to be asking the question:
“What if my mind was transported into their bodies?”
rather than
“What if I was (like) them, including all the relevant psychological and emotional factors?”
The latter question should lead feelings of disgust iff the target experiences feelings of disgust.
Of course, empathy is all the more difficult when the person you are trying to emphasize with is very different from you. Being an outlier can clearly make this harder. But unless you have never experienced any flavour of learned helplessness/procrastination/akrasia, you have the necessary ingredients to extrapolate.
Historically commutative algebra came out of algebraic number theory, and the rings involved—Z,Z_p, number rings, p-adic local rings… - are all (in the modern terminology) Dedekind domains.
Dedekind domains are not always principal, and this was the reason why mathematicians started studying ideals in the first place. However, the structure of finitely generated modules over Dedekind domains is still essentially determined by ideals (or rather fractional ideals), reflecting to some degree the fact that their geometry is simple (1-dim regular Noetherian domains).
This could explain why there was a period where ring theory developed around ideals but the need for modules was not yet clarified?
Modules are just much more flexible than ideals. Two major advantages:
Richer geometry. An ideal is a closed subscheme of Spec(R), while modules are quasicoherent sheaves. An element x of M is a global section of the associated sheaf, and the ideal Ann(x) corresponds to the vanishing locus of that section. This leads to a nice geometric picture of associated primes and primary decomposition which explains how finitely generated modules are built out of modules R/P with P prime ideal (I am not an algebraist at heart, so for me the only way to remember the statement of primary decomposition is to translate from geometry 😅)
Richer (homological) algebra. Modules form an abelian category in which ideals do not play an especially prominent role (unless one looks at monoidal structure but let’s not go there). The corresponding homological algebra (coherent sheaf cohomology, derived categories) is the core engine of modern algebraic geometry.
BTW the geometric perspective might sound abstract (and setting it up rigorously definitely is!) but it is many ways more concrete than the purely algebraic one. For instance, a quasicoherent sheaf is in first approximation a collection of vector spaces (over varying “residue fields”) glued together in a nice way over the topological space Spec(R), and this clarifies a lot how and when questions about modules can be reduced to ordinary linear algebra over fields.
Some of my favourite topics in pure mathematics! Two quick general remarks:
I don’t hold such a strong qualitative distinction between the theory of group actions, and in particular linear representations, and the theory of modules. They are both ways to study an object by having it act on auxiliary structures/geometry. Because there are in general fewer tools to study group actions than modules, a lot of pure mathematics is dedicated to linearizing the former to the latter in various ways.
There is another perspective on modules over commutative rings which is central to algebraic geometry: modules are a specific type of sheaves which generalize vector bundles. More precisely, a module over a commutative ring R is equivalent to a “quasicoherent sheaf” on the affine scheme Spec(R), and finitely generated projective modules correspond in this way to vector bundles over Spec(R). Once you internalise this equivalence, most of the basic theory of modules in commutative algebra becomes geometrically intuitive, and this is the basis for many further developments in algebraic geometry.
There is another interesting connection between computation and bounded treewidth: the control flow graphs of programs written in languages “without goto instructions” have uniformly bounded treewidth (e.g. <7 for goto-free C programs). This is due to Thorup (1998):
https://www.sciencedirect.com/science/article/pii/S0890540197926973
Combined with graphs algorithms for bounded treewidth graphs, this has apparently been used in the analysis of compiler optimization and program verification problems, see the recent reference:
https://dl.acm.org/doi/abs/10.1145/3622807
which also proves a similar bound for pathwidth.
Nice!
I would add the following, which is implicit in the presentation: this phenomenon of real representations is not specific to finite groups. Real irreducible representations of a group are always neatly divided into three types: real, complex or quaternionic. This is [Schur\’s lemma](https\://ncatlab\.org/nlab/show/Schur\%27s\+lemma\#statement) together with the fact that the real division algebras are exactly R, C and the quaternions H.
(Should ML interpretability people care about infinite groups to begin with—unlike mathematicians, who love them all? For once, models as well as datasets can exhibit (exact or approximate) continuous symmetries, and these symmetries be understood mathematically as actions of matrix Lie groups such as the group GL_n of all invertible matrices or the group O_n of n-dimensional rotations. Sometimes these actions are linear, so are themselves representations, and sometimes they can be studied by linearizing them. Using representation theory to study more general geometric group actions is one of those great tricks of mathematics which reduce complicated problems to linear algebra.)
On 1., you should consider that, for people who don’t know much about QFT and its relationship with SFT (like, say, me 18 months ago), it is not at all obvious that QFT can be applied beyond quantum systems!
In my case, the first time I read about “QFT for deep learning” I dismissed it automatically because I assumed it would involve some far-fetched analogies with quantum mechanics.
but in fact you can also understand the theory on a fine-grained level near an impurity by a more careful form of renormalization, where you view the nearest several impurities as discrete sources and only coarsegrain far-away impurities as statistical noise.
Where could I read about this?
Thanks a lot for writing this! Some clarifying questions:
In this context, is QFT roughly a shorthand for “statistical field theory, studied via the mathematical methods of Euclidean QFT”? Or do you expect intuitions from specifically quantum phenomena to play a role?
There is a community of statistical physicists who use techniques from statistical mechanics of disordered systems and phase transitions to study ML theory, mostly for simple systems (linear models, shallow networks) and simple data distributions (Gaussian data, student-teacher model with a similarly simple teacher). What do you think of this approach? How does it relate to what you have in mind?
Would this approach, at least when applied to the whole network, rely on an assumption that trained DNNs inherit from their initialization a relatively high level of “homogeneity” and relatively limited differentiation, compared say to biological organisms? For instance, as a silly thought experiment, suppose you had the same view into a tiger as you have a DNN: something like all the chemical-level data as a collection of time-series indexed by (spatially randomized) voxels, and you want to understand the behaviour of the tiger as function of the environment. How would you expect a QFT-based approach to proceed? What observables would it encoder first? Would it be able to go beyond the global thermodynamics of the tiger and say something about cell and tissue differentiation? How would it “put the tiger back together”? (Those are not gotcha questions—I don’t really know if any existing interpretability method would get far in this setting!)
For sufficiently nice regular, 1-dimensional Bayesian models, Edgeworth-type asymptotic expansions for the Bayesian posterior have been derived in
Q: How can I use LaTeX in these comments? I tried to follow https://www.lesswrong.com/tag/guide-to-the-lesswrong-editor#LaTeX but it does not seem to render.
Here is the simplest case I know, which is a sum of dependent identically distributed variables. In physical terms, it is about the magnetisation of the 1d Curie-Weiss (=mean-field Ising) model. I follow the notation of the paper https://arxiv.org/abs/1409.2849 for ease of reference, this is roughly Theorem 8 + Theorem 10:
Let $M_n=\sum_{i=1}^n \sigma(i)$ be the sum of n dependent Bernouilli random variables $\sigma(i)\in\{\pm 1}$, where the joint distribution is given by
$$
\mathbb{P}(\sigma)\sim \exp(\frac{\beta}{n}M_n^2))
$$
Then
When $\beta=1$, the fluctuations of $M_n$ are very large and we have an anomalous CLT: $\frac{M_n}{n^{3/4}}$ converges in law to the probability distribution $\sim \exp(-frac{x^4}{12})$.
When $\beta<1$, $M_n$ satisfies a normal CLT: $\frac{M_n}{n^{1/2}}$ converges to a Gaussian.
When $\beta>1$, $M_n$ does not satisfy a limit theorem (there are two lower energy configurations)
In statistical mechanics, this is an old result of Ellis-Newman from 1978; the paper above puts it into a more systematic probabilistic framework, and proves finer results about the fluctuations (Theorems 16 and 17).
The physical intuition is that $\beta=1$ is the critical inverse temperature at which the 1d Curie-Weiss model goes through a continuous phase transition. In general, one should expect such anomalous CLTs in the thermodynamic limit of continuous phase transitions in statistical mechanics, with the shape of the CLT controlled by the Taylor expansion of the microcanonical entropy around the critical parameters. Indeed Ellis and his collaborators have worked out a number of such cases for various mean-field models (which according to Meliot-Nikeghbali also fit in their mod-Gaussian framework). It is of course very difficult to prove such results rigorously outside of mean-field models, since even proving that there is a phase transition is often out of reach.
A limitation of the Curie-Weiss result is that it is 1d and so the “singularity” is pretty limited. The Meliot-Nikeghbali paper has 2d and 3d generalisations where the singularities are a bit more interesting: see Theorem 11 and Equations (10) and (11). And here is another recent example from the stat mech literature
https://link.springer.com/article/10.1007/s10955-016-1667-9
You were actually asking about Edgeworth expansions rather than just the CLT. It may be that with this method of producing anomalous CLTs, starting with a nice mod-Gaussian convergent sequence and doing a change of measure, one could write down further terms in the expansion? I haven’t thought about this.
Since the main result of SLT is roughly speaking an “anomalous CLT for the Bayesian posterior”, I would love to use the results above to think of singular Bayesian statistical models as “at a continuous phase transition” (probably with quenched disorder to be more physically accurate), with the tuning to criticality coming from a combination of structure in data and hyperparameter tuning, but I don’t really know what to do with this analogy!
I mentioned samples and expectations for the TLBP because it seems possible (and suggested by the role of degeneracies in SLT) that different samples can correspond to qualitatively different degradations of the model. Cartoon picture : besides the robust circuit X of interest, there are “fragile” circuits A and B, and most samples at a given loss scale degrade either A or B but not both.
I agree that there is no strong reason to overindex on the Watanabe temperature, which is derived from an idealised situation: global Bayesian inference, degeneracies exactly at the optimal parameters, “relatively finite variance”, etc. The scale you propose seems quite natural but I will let LLC-practitioners comment on that.
Is the following a fair summary of the thread ~up to “Natural degradation” from the SLT persepctive?
Current SLT-inspired approaches are right to consider samples of the “tempered local Bayesian posterior” provided by SGLD as natural degradations of the model.
However they mostly only use those samples (at a fixed Watanabe temperature) to compute the expectation of the loss and the resulting LLC, because that is theoretically grounded by Watanabe’s work.
You suggest instead to compute, using those sampled weights, the expectations of more complicated observables derived from other interpretability methods, and to interpret those expectations using the “natural scale” heuristics laid out in the post.
A closely related perspective on fluctuations of sequences of random variables has been studied recently in pure probability theory under the name of “mod-Gaussian convergence” (and more generally “mod-phi convergence”). Mod-Gaussian convergence of a sequence of RVs or random vectors is just the right amount of control over the characteristic functions—or in a useful variant, the whole complex Laplace transforms—to imply a clean description of the fluctuations at various scales (CLT, Edgeworth expansion, “normality zone”, local CLT, moderate deviations, sharp large deviations,...). Unsurprisingly, the theory is full of cumulants.
Here is a nice introduction with applications to statistical mechanics models:
https://arxiv.org/abs/1409.2849
and the book with the general theory (which I still have to read!)
https://link.springer.com/book/10.1007/978-3-319-46822-8
This leads for instance to a clean approach of some “anomalous” CLTs with non-Gaussian limit laws (not for the mod-Gaussian convergent sequences themselves but for modified versions thereof) for some stat mech models at continuous phase transitions, see Theorems 8 and 11 in the first reference above. As far as I know, those theorems are the simplest “SLT-like” phenomenon in probability theory!
Very nice!
Conversely, it may be possible to identify practical situations where some of these aphorisms are sub-optimal, which could help point out the limitations of applying AIT to real agents?