This post is a summary of a collection of Singular Learning Theory experiments I ran during the fall of 2025. This post assumes some familiarity to SLT, namely its famous asymptotic relating free-energy $F_{n}$ with the local-learning coefficient (LLC) $λ$ . For a more in-depth description of these experiments, please see the arXiv preprint here. Please feel free to email me for any questions or feedback.

Brief Motivation

Singular Learning Theory (SLT) provides an elegant framework to understand phase transitions in modern neural networks through a physics-inspired Bayesian learning perspective. In particular, SLT introduces two meaningful quantities: the model free-energy $F_{n}$ and the local-learning coefficient (LLC) $λ$ , a model complexity measure. IMHO, the formal definitions of these two quantities are significantly more complex and unintuitive compared to Statistical Learning Theory’s trademark quantities such as VC Dimension or Rademacher Complexity. Due to SLT’s rise in AI safety research, better understanding the behavior of these two is increasingly important.

This work heavily leaned on Nina Panickssery and Dmitry Vaintrob’s experiments in their hackathon project.

Q1: Arrhenius Rate Reaction Theory for $F_{n}$

Our first experiment is to test the behavior of the free-energy through inspiration from Arrhenius Rate Reaction Theory. In particular, said theory tells us that the time it takes in a chemical reaction for reactants to transform into products is exponential in the energy barrier. Now suppose we have a phase transition (e.g. grokking) starting at time $i$ and ending at time $j$ . We posit the following hypothesis on this time-to-transition $r_{i \to j}$ :

$r_{i \to j} \propto exp (β_{eff} Δ F_{i \to j}), r_{i \to j} := j - i, Δ F_{i \to j} := F_{i} - F_{j}$

where $Δ F_{i \to j} < 0$ and $β_{eff}$ is an effective inverse temperature dependent on global hyperparameters (learning rate, batch size, etc.). We test this on two experimental setups, which we detail below. For all experiments, we used Lau et. al 2023′s SGLD-MCMC approach to estimate the local learning coefficient.

Experiment 1: Grokking Modulo Arithmetic

For this experiment setup, we use the same neural network architecture presented in these experiments with $p = 53$ . Put simply, the task is for the neural network to learn the function $f : Z_{53} \times Z_{53} \to Z_{53}$ given by $(a, b) \mapsto a + b mod 53$ . We trained 500 models (each using 40% of the $53^{2}$ possible dataset pairs) and for all the models that grokked, estimated their delta in free-energy $Δ F_{i \to j}$ and time-to-grok $r_{i \to j}$ . In total around 34% (168/500) of all models we trained grokked, yielding the following plot:

Scatter plot relating free-energy change and transition time for modulo-arithmetic grokking runs (p = 53). Points plot ΔFᵢ→ⱼ against log rᵢ→ⱼ, with a fitted straight line showing an overall negative trend (larger decreases in free energy correspond to longer transition times on the log scale).

(Note the log-scale on the bottom). While this linear fit does have a weak fit, it does match the expected downward direction. As such, I wanted to move on to testing this same hypothesis on another experimental setup.

Experiment 2: Anthropic’s Toy Models of Superposition (TMS)

This setup was largely similar to the previous one, with the exception that it is now possible for a single model run to have multiple phase transitions (as opposed to grokking, where there is just one.) As such, we trained 60 models and for each run, used a custom method to detect all transitions present (the number of which varied per run.) We report a similar plot as for the first experiment:

Scatter plot for Toy Models of Superposition transitions, plotting ΔFᵢ→ⱼ versus log rᵢ→ⱼ with a fitted straight line. The points form several vertical bands (clusters at a few transition-time values), and the fitted line has an overall upward slope in the displayed fit.

These results were clearly not what we were expecting. Two things to note:

These results were extremely sensitive to the method we used to detect transitions. For a more rudimentary method we used to detect transitions that did not involve time-based smoothing, it reported a downward slope.
We could not come up with any good explanation for why we had these three vertical clusters. One possible conjecture is that the uniqueness of the TMS setup (e.g. extremely low parameter) would force very fixed transition times. This was also highly sensitive to the method we used to detect transitions.

Given how variable this experiment was and due to time constraints, we unfortunately had to move on to the second research question.

Q2: LLC $λ$ and problem difficulty

The purpose of this research question was to gain intuition on the local-learning coefficient and how its measure of complexity scales with problem difficulty across a few simple hypothesis classes. These experiments were inspired again by these experiments.

Experiment One: Univariate Polynomial Regressors

Consider some instance space $X \subset R$ and define a polynomial regressor of degree $d$ to be given in the form of $\sum_{i = 0}^{d} a_{i} x^{i}$ where each $x \in X$ . Then for a given degree $d$ and instance space $X$ , we generated a 500-size $d$ -degree dataset on $X$ and trained 10 polynomial regressors to convergence. We then measured the LLC across these models to yield estimated means and confidence intervals. Our results are presented in the plot below for various instance space choices:

Line chart with error bars titled “Estimated LLC (with Error Bars) vs. Degree, across 𝒳.” X-axis: degree of polynomial regressor (from low degrees up to ~100). Y-axis: estimated LLC. Three colored series correspond to three input intervals 𝒳 ([-1,1], [-0.75,0.75], [-0.5,0.5]). The widest interval has the largest LLC values and rises the most with degree; tighter intervals stay lower and flatter, all with visible vertical error bars.

While seemingly simple, these results are actually meaningful. First observe that polynomial regressors are regular models, which means SLT should predict their LLC to linearly scale: $λ = \frac{d}{2}$ . This means the theory is an overestimate of the actual empirical results. A reason for this is that the constrained instance space $X$ induces singularities. In math this would refer to a case where, $\exists f, g \in Polyreg (d) s.t. f \neq g but f ∣_{X} = g ∣_{X}$ . These singularities would lower the LLCs (as is demonstrated by the sublinear behavior above). Moreover, more constrained domains would yield more singularities and lower LLCs, which is also consistent with the plot above.

The reason this matters is because singularities enable generalization. So the upshot of these results is that on practically constrained domains (e.g. images with pixels from 0-255), models may have more generalization power than we think.

Experiment Two: Low-Rank Neural Nets

We define a simple linear-neural network as $f (x) = W_{2} W_{1} x$ where $W_{2} \in R^{d \times r}, W_{1} \in R^{r \times d}$ for constants $r \leq d$ . Then $rank (W_{2} W_{1}) \leq r$ and moreover this model is a priori singularity-full as for any invertible $G \in R^{r \times r}$ , we have that $W_{2} W_{1} = (W_{2} G) (G^{- 1} W_{1})$ . For our experiments, we fix $d := 100$ and vary the rank $r$ as the problem difficutly knob. This yields the following plot:

Plot titled “Estimated LLC (with Error Bars) vs. Rank of Neural Network.” X-axis: rank (0–100). Y-axis: estimated LLC (ranging into the thousands). Blue points (with error bars) rise steeply at small ranks and then level off in a concave-down shape. A dashed quadratic fit curve closely overlays the points; the legend shows a quadratic formula for the fit and an R² near 0.999.

A posteriori, this result matches SLT expectations closely. Specifically, it is a standard fact from algebraic geometry that $r$ -rank matrices yield a smooth manifold of dimension of $r (2 d - r)$ . Hence, the fit $λ_{r} = \frac{1}{2} r (2 d - r) = - 0.5 r^{2} + 100 r$ closely matches the quadratic fit we report in the plot above.

Experiment Three: Low-Rank Autoencoders

This final experiment is very similar to the previous experiment, where we again use the rank as the problem difficulty knob. Specifically we generated a given sample $x = A z$ for some $z \sim N (0, I_{r})$ and $A \in R^{d \times r}$ where $r \leq d := 100$ . We present the plot below:

Plot titled “Estimated LLC (with Error Bars) vs. Rank of Input Data.” X-axis: rank of input data (0–100). Y-axis: estimated LLC (up to ~7000). Blue points with error bars increase nearly linearly across the full range. A dashed orange line shows a linear fit; the legend reports the fitted line equation and an R² near 0.998.

This result does not align with SLT’s expectations: recall regular models typically will demonstrate LLCs linear with parameter difficulty. However, here an autoencoder (which definitely has some symmetries) is exhibiting linear scaling in the LLC. We were unfortunately unable to come up with a compelling explanation for why this occurred (perhaps there is some connection to non-linear PCA?).

Conclusion and Future Directions

These results present a mixed bag: some align with SLT-expectations and others require further thought to understand. Overall, we still believe this work has importance as a starting point for further investigation. We would be grateful to hear any perspectives and interpretations you may have on this work. Moreover, we give the following few open directions for research:

Free-energy barriers and reaction rates. Develop a principled method to estimate the free-energy barrier(s) between two subsets of weight space. This would enable a more direct test of the Arrhenius-style reaction-rate hypothesis presented before.
Constrained domains induce singularities. Identify other standard deep-learning settings where natural domain constraints cause distinct parameterizations to yield the same function behavior. Then replicate the polynomial-regressor study of Q2, Experiment 1. This could clarify when such constraints meaningfully lower the LLC, which oftentimes enables generalization.
Comparing LLC across memorization-vs-generalization architectures. For a fixed task family (e.g. image classification), compare the LLC of a heavily over-parameterized memorization network and a more compact generalization network. See how this difference scales with problem difficulty (e.g. adding more classification classes).

Thank you! If you have any questions, please do not hesitate to email me.

^
For example, see the LLC Definition (Definition 1) provided in this paper: https://arxiv.org/abs/2308.12108.
^
All instance spaces had to have magnitude ≤ 1 for numerical stability as we tested regressors with degrees up to $10^{3}$ .

Experiments to understand Singular Learning Theory’s Free Energy & Local Learning Coefficient (LLC)