Been there, done that survey...
I’m curious about the results.
Been there, done that survey...
I’m curious about the results.
The example about stacks in 1.2 has a certain irony in context. This requires a small mathematical parenthese:
A stack is a certain sophisticated type of geometric structure which is increasingly used in algebraic geometry, algebraic topology (and spreading to some corners of differential geometry) to make sense of geometric intuitions and notions on “spaces” which occur “naturally” but are squarely out of the traditional geometric categories (like manifolds, schemes, etc.).
See www.ams.org/notices/200304/what-is.pdf for a very short introduction focusing on the basic example of the moduli of elliptic curves.
The upshot of this vague outlook is that in the relevant fields, everything of interest is a stack (or a more exotic beast like a derived stack), precisely because the notion has been designed to be as general and flexible as possible ! So asking someone working on stacks a good example of something which is not a stack is bound to create a short moment of confusion.
Even if you do not care for stacks (and I wouldn’t hold it against you), if you are interested in open source/Internet-based scientific projects, it is worth having a look at the web page of the Stacks project (http://stacks.math.columbia.edu/), a collaborative fully hyperlinked textbook on the topic, which is steadily growing towards the 3500 pages mark.
Thanks a lot for writing this! Some clarifying questions:
In this context, is QFT roughly a shorthand for “statistical field theory, studied via the mathematical methods of Euclidean QFT”? Or do you expect intuitions from specifically quantum phenomena to play a role?
There is a community of statistical physicists who use techniques from statistical mechanics of disordered systems and phase transitions to study ML theory, mostly for simple systems (linear models, shallow networks) and simple data distributions (Gaussian data, student-teacher model with a similarly simple teacher). What do you think of this approach? How does it relate to what you have in mind?
Would this approach, at least when applied to the whole network, rely on an assumption that trained DNNs inherit from their initialization a relatively high level of “homogeneity” and relatively limited differentiation, compared say to biological organisms? For instance, as a silly thought experiment, suppose you had the same view into a tiger as you have a DNN: something like all the chemical-level data as a collection of time-series indexed by (spatially randomized) voxels, and you want to understand the behaviour of the tiger as function of the environment. How would you expect a QFT-based approach to proceed? What observables would it encoder first? Would it be able to go beyond the global thermodynamics of the tiger and say something about cell and tissue differentiation? How would it “put the tiger back together”? (Those are not gotcha questions—I don’t really know if any existing interpretability method would get far in this setting!)
As far as major scientific facts go, I am surprised that evolution has yet to be mentioned. Let me try:
“All the complexity of Life on Earth comes from a single origin by the following process: organisms carry the plan to reproduce and make copies of themselves, this plan changes slightly and randomly over time, and the modified plans which lead to better survival and reproduction tend to outcompete the others and to become dominant.”
A closely related perspective on fluctuations of sequences of random variables has been studied recently in pure probability theory under the name of “mod-Gaussian convergence” (and more generally “mod-phi convergence”). Mod-Gaussian convergence of a sequence of RVs or random vectors is just the right amount of control over the characteristic functions—or in a useful variant, the whole complex Laplace transforms—to imply a clean description of the fluctuations at various scales (CLT, Edgeworth expansion, “normality zone”, local CLT, moderate deviations, sharp large deviations,...). Unsurprisingly, the theory is full of cumulants.
Here is a nice introduction with applications to statistical mechanics models:
https://arxiv.org/abs/1409.2849
and the book with the general theory (which I still have to read!)
https://link.springer.com/book/10.1007/978-3-319-46822-8
This leads for instance to a clean approach of some “anomalous” CLTs with non-Gaussian limit laws (not for the mod-Gaussian convergent sequences themselves but for modified versions thereof) for some stat mech models at continuous phase transitions, see Theorems 8 and 11 in the first reference above. As far as I know, those theorems are the simplest “SLT-like” phenomenon in probability theory!
“De notre naissance à notre mort, nous sommes un cortège d’autres qui sont reliés par un fil ténu.”
Jean Cocteau
(“From our birth to our death, we are a procession of others whom a fine thread connects.”)
Some of my favourite topics in pure mathematics! Two quick general remarks:
I don’t hold such a strong qualitative distinction between the theory of group actions, and in particular linear representations, and the theory of modules. They are both ways to study an object by having it act on auxiliary structures/geometry. Because there are in general fewer tools to study group actions than modules, a lot of pure mathematics is dedicated to linearizing the former to the latter in various ways.
There is another perspective on modules over commutative rings which is central to algebraic geometry: modules are a specific type of sheaves which generalize vector bundles. More precisely, a module over a commutative ring R is equivalent to a “quasicoherent sheaf” on the affine scheme Spec(R), and finitely generated projective modules correspond in this way to vector bundles over Spec(R). Once you internalise this equivalence, most of the basic theory of modules in commutative algebra becomes geometrically intuitive, and this is the basis for many further developments in algebraic geometry.
Modules are just much more flexible than ideals. Two major advantages:
Richer geometry. An ideal is a closed subscheme of Spec(R), while modules are quasicoherent sheaves. An element x of M is a global section of the associated sheaf, and the ideal Ann(x) corresponds to the vanishing locus of that section. This leads to a nice geometric picture of associated primes and primary decomposition which explains how finitely generated modules are built out of modules R/P with P prime ideal (I am not an algebraist at heart, so for me the only way to remember the statement of primary decomposition is to translate from geometry 😅)
Richer (homological) algebra. Modules form an abelian category in which ideals do not play an especially prominent role (unless one looks at monoidal structure but let’s not go there). The corresponding homological algebra (coherent sheaf cohomology, derived categories) is the core engine of modern algebraic geometry.
BTW the geometric perspective might sound abstract (and setting it up rigorously definitely is!) but it is many ways more concrete than the purely algebraic one. For instance, a quasicoherent sheaf is in first approximation a collection of vector spaces (over varying “residue fields”) glued together in a nice way over the topological space Spec(R), and this clarifies a lot how and when questions about modules can be reduced to ordinary linear algebra over fields.
I am a mathematician who is using category theory all the time in my work in algebraic geometry, so I am exactly the wrong audience for this write-up!
I think that talking about “bad definitions” and “confusing presentation” is needlessly confrontational. I would rather say that the traditional presentation of category theory is perfectly adapted to its original purpose, which is to organise and to clarify complicated structures (algebraic, topological, geometric, …) in pure mathematics. There the basic examples of categories are things like the category of groups, rings, vector spaces, topological spaces, manifolds, schemes, etc. and the notion of morphism, i.e. “structure-preserving map”, is completely natural.
As category theory is applied more broadly in computer science and the theory of networks and processes, it is great that new perspectives on the basic concepts are developed, but I think they should be thought of as complementary to the traditional view, which is extremely powerful in its domain of application.
Here are two more closely related results in the same circle of ideas. The first one gives a description (a kind of fusion of Dold-Thom and Eilenberg-Steenrod) of homology purely internal to homotopy theory, and the second explains how homological algebra falls out of infinity-category theory:
Consider functors E:S_* --> S_* from the infinity-category of spaces to itself which commutes with filtered colimits, carries pushout squares to pullback squares, sends the one-point space to itself, and the 0-sphere (aka two points!) to a discrete space. Then A=E(S^0) has a natural structure of abelian group, and E is an (infinity-categorical version of) the Dold-Thom functor and satisfies pi_n E(X)=H_n(X,A) (reduced homology). In particular, E(S^n) is an Eilenberg-Maclane space K(A,n).
The category of functors E:S_* --> S_* satisfying all the properties above except the one about E(S^0) being discrete is a model for the infinity-category Sp of spectra, i.e. the “stabilization” (in a precise categorical sense) of the infinity-category of spaces. From this perspective, the functors from the previous points are called the Eilenberg-Maclane spectra HA. Moreover, the infinity-category of spectra has a symmetric monoidal structure (the “smash product”), HR is naturally an algebra object for this structure whenever R is a ring, and it makes sense to talk about the infinity-category LMod(HR) of left HR-modules in Sp. Then LMod(HR) is equivalent (essentially by a stable version of the Dold-Kan correspondence) to the infinity derived category of left R-modules D(R). In other words, for homotopy theorists, (chain complexes,quasi-isomorphisms) are just a funny point-set model for HR-module spectra!
All of this is discussed in the first chapter of Lurie’s Higher Algebra, except the last point which is not completely spelled out because monoidal structures and modules are only introduced later on.
I should point out that this perspective is largely a reformulation of the results you already mentioned, and in themselves certainly do not bring new computational techniques for singular homology. However they show that 1) homological algebra comes out “structurally” from homotopy theory, which itself comes out “structurally” from infinity-category theory and 2) homological algebra (including in more sophisticated contexts than just abelian groups, e.g. dg-categories), homotopy theory, sheaf theory… can be combined inside of a common flexible categorical framework, which elegantly subsumes previous point-set level techniques like model categories.
All the frames you are mentioning are good for intuition. I would say the deepest one is 4. and that everything falls into place cleanly once you formulate things in the language of infinity-category theory (at the price of a lot of technicalities to establish the “right” language). For example,
singular homology with coefficients in A can be characterised as the unique colimit-preserving infinity-functor from the infinity-category of spaces/homotopy types/infinity-grpoids/anima to the derived infinity-category of abelian groups which sends a one-point space (equivalently any contractible space) to A[0].
The derived infinity-category of abelian groups is itself in some sense the “(stable presentable) Z-linearization” of the infinity-category of spaces, although this is more tricky to state precisely and I won’t try to do this here.
For me, the strongest argument in favor of evolutionary psychology is how well it works for explaining social behaviours of non-human animals. I think this is important background material to understand where evolutionary psychologists come from. I recommend parsing through the following textbooks:
An Introduction to Behavioural Ecology, Krebs and Davies
(Disclaimer: I have only read Alcock, but Krebs and Davies is supposed to be stronger and better organized from a theoretical point of view—Alcock has wonderful examples.)
Of course, human social behaviour is orders of magnitude more diverse and complicated than in any other species—and even for other primates, one already needs to adopt the point of view of sociology and social psychology to get a good picture. But the premise that culture somehow freed us from all this background of behavioural adaptations is very strange, especially given the tendancy of the evolutionary process to recycle everything in sight into new shapes and patterns.
There is another interesting connection between computation and bounded treewidth: the control flow graphs of programs written in languages “without goto instructions” have uniformly bounded treewidth (e.g. <7 for goto-free C programs). This is due to Thorup (1998):
https://www.sciencedirect.com/science/article/pii/S0890540197926973
Combined with graphs algorithms for bounded treewidth graphs, this has apparently been used in the analysis of compiler optimization and program verification problems, see the recent reference:
https://dl.acm.org/doi/abs/10.1145/3622807
which also proves a similar bound for pathwidth.
Is the following a fair summary of the thread ~up to “Natural degradation” from the SLT persepctive?
Current SLT-inspired approaches are right to consider samples of the “tempered local Bayesian posterior” provided by SGLD as natural degradations of the model.
However they mostly only use those samples (at a fixed Watanabe temperature) to compute the expectation of the loss and the resulting LLC, because that is theoretically grounded by Watanabe’s work.
You suggest instead to compute, using those sampled weights, the expectations of more complicated observables derived from other interpretability methods, and to interpret those expectations using the “natural scale” heuristics laid out in the post.
Which formal properties of the KL-divergence do the proofs of your result use? It could be useful to make them all explicit to help generalize to other divergences or metrics between probability distributions.
Historically commutative algebra came out of algebraic number theory, and the rings involved—Z,Z_p, number rings, p-adic local rings… - are all (in the modern terminology) Dedekind domains.
Dedekind domains are not always principal, and this was the reason why mathematicians started studying ideals in the first place. However, the structure of finitely generated modules over Dedekind domains is still essentially determined by ideals (or rather fractional ideals), reflecting to some degree the fact that their geometry is simple (1-dim regular Noetherian domains).
This could explain why there was a period where ring theory developed around ideals but the need for modules was not yet clarified?
An especially important example of macro choice that deserves some thought is the choice of a professional activity. See 80000 Hours:
You might enjoy
which explains the role that the resulting problem (representing homology class of manifolds by submanifolds/cobordisms) played in inspiring the work of René Thom on cobordism, stable homotopy theory, singularity theory...
Well, I can certainly emphasize with the feeing that compromising on a core part of your identity is threatening ;-)
More seriously, what you are describing as empathy seems to be asking the question:
“What if my mind was transported into their bodies?”
rather than
“What if I was (like) them, including all the relevant psychological and emotional factors?”
The latter question should lead feelings of disgust iff the target experiences feelings of disgust.
Of course, empathy is all the more difficult when the person you are trying to emphasize with is very different from you. Being an outlier can clearly make this harder. But unless you have never experienced any flavour of learned helplessness/procrastination/akrasia, you have the necessary ingredients to extrapolate.