This post is a summary of a collection of Singular Learning Theory experiments I ran during the fall of 2025. This post assumes some familiarity to SLT, namely its famous asymptotic relating free-energy Fn with the local-learning coefficient (LLC) λ. For a more in-depth description of these experiments, please see the arXiv preprint here. Please feel free to email mefor any questions or feedback.
Brief Motivation
Singular Learning Theory (SLT) provides an elegant framework to understand phase transitions in modern neural networks through a physics-inspired Bayesian learning perspective. In particular, SLT introduces two meaningful quantities: the model free-energy Fn and the local-learning coefficient (LLC) λ, a model complexity measure. IMHO, the formal definitions of these two quantities are significantly more complex and unintuitive compared to Statistical Learning Theory’s trademark quantities such as VC Dimension or Rademacher Complexity. Due to SLT’s rise in AI safety research, better understanding the behavior of these two is increasingly important.
Our first experiment is to test the behavior of the free-energy through inspiration from Arrhenius Rate Reaction Theory. In particular, said theory tells us that the time it takes in a chemical reaction for reactants to transform into products is exponential in the energy barrier. Now suppose we have a phase transition (e.g. grokking) starting at time i and ending at time j. We posit the following hypothesis on this time-to-transitionri→j:
ri→j∝exp(βeffΔFi→j),ri→j:=j−i,ΔFi→j:=Fi−Fj
where ΔFi→j<0 and βeff is an effective inverse temperature dependent on global hyperparameters (learning rate, batch size, etc.). We test this on two experimental setups, which we detail below. For all experiments, we used Lau et. al 2023′s SGLD-MCMC approach to estimate the local learning coefficient.
Experiment 1: Grokking Modulo Arithmetic
For this experiment setup, we use the same neural network architecture presented in these experiments with p=53. Put simply, the task is for the neural network to learn the function f:Z53×Z53→Z53 given by (a,b)↦a+b mod 53. We trained 500 models (each using 40% of the 532 possible dataset pairs) and for all the models that grokked, estimated their delta in free-energy ΔFi→j and time-to-grok ri→j. In total around 34% (168/500) of all models we trained grokked, yielding the following plot:
(Note the log-scale on the bottom). While this linear fit does have a weak fit, it does match the expected downward direction. As such, I wanted to move on to testing this same hypothesis on another experimental setup.
Experiment 2: Anthropic’s Toy Models of Superposition (TMS)
This setup was largely similar to the previous one, with the exception that it is now possible for a single model run to have multiple phase transitions (as opposed to grokking, where there is just one.) As such, we trained 60 models and for each run, used a custom method to detect all transitions present (the number of which varied per run.) We report a similar plot as for the first experiment:
These results were clearly not what we were expecting. Two things to note:
These results were extremely sensitiveto the method we used to detect transitions. For a more rudimentary method we used to detect transitions that did not involve time-based smoothing, it reported adownwardslope.
We could not come up with any good explanation for why we had these three vertical clusters. One possible conjecture is that the uniqueness of the TMS setup (e.g. extremely low parameter) would force very fixed transition times. This was also highly sensitive to the method we used to detect transitions.
Given how variable this experiment was and due to time constraints, we unfortunately had to move on to the second research question.
Q2: LLC λ and problem difficulty
The purpose of this research question was to gain intuition on the local-learning coefficient and how its measure of complexity scales with problem difficulty across a few simple hypothesis classes. These experiments were inspired again bythese experiments.
Experiment One: Univariate Polynomial Regressors
Consider some instance space X⊂R and define a polynomial regressor of degree d to be given in the form of ∑di=0aixi where each x∈X. Then for a given degree d and instance space X, we generated a 500-size d-degree dataset on X and trained 10 polynomial regressors to convergence. We then measured the LLC across these models to yield estimated means and confidence intervals. Our results are presented in the plot below for various instance space choices:
While seemingly simple, these results are actually meaningful. First observe that polynomial regressors are regular models, which means SLT should predict their LLC to linearly scale: λ=d2. This means the theory isan overestimate of the actual empirical results. A reason for this is that the constrained instance space Xinduces singularities. In math this would refer to a case where, ∃f,g∈Polyreg(d) s.t. f≠g but f∣X=g∣X. These singularities would lower the LLCs (as is demonstrated by the sublinear behavior above). Moreover, more constrained domains would yield more singularities and lower LLCs, which is also consistent with the plot above.
The reason this matters is because singularities enable generalization. So the upshot of these results is that on practically constrained domains (e.g. images with pixels from 0-255), models may have more generalization power than we think.
Experiment Two: Low-Rank Neural Nets
We define a simple linear-neural network as f(x)=W2W1x where W2∈Rd×r,W1∈Rr×d for constants r≤d. Then rank(W2W1)≤r and moreover this model is a priori singularity-full as for any invertible G∈Rr×r, we have that W2W1=(W2G)(G−1W1). For our experiments, we fix d:=100 and vary the rank r as the problem difficutly knob. This yields the following plot:
A posteriori, this result matches SLT expectations closely. Specifically, it is a standard fact from algebraic geometry that r-rank matrices yield a smooth manifold of dimension of r(2d−r). Hence, the fit λr=12r(2d−r)=−0.5r2+100r closely matches the quadratic fit we report in the plot above.
Experiment Three: Low-Rank Autoencoders
This final experiment is very similar to the previous experiment, where we again use the rank as the problem difficulty knob. Specifically we generated a given sample x=Az for some z∼N(0,Ir) and A∈Rd×r where r≤d:=100. We present the plot below:
This result does not align with SLT’s expectations: recall regular models typically will demonstrate LLCs linear with parameter difficulty. However, here an autoencoder (which definitely has some symmetries) is exhibiting linear scaling in the LLC. We were unfortunately unable to come up with a compelling explanation for why this occurred (perhaps there is some connection to non-linear PCA?).
Conclusion and Future Directions
These results present a mixed bag: some align with SLT-expectations and others require further thought to understand. Overall, we still believe this work has importance as a starting point for further investigation. We would be grateful to hear any perspectives and interpretations you may have on this work. Moreover, we give the following few open directions for research:
Free-energy barriers and reaction rates. Develop a principled method to estimate the free-energy barrier(s) between two subsets of weight space. This would enable a more direct test of the Arrhenius-style reaction-rate hypothesis presented before.
Constrained domains induce singularities. Identify other standard deep-learning settings where natural domain constraints cause distinct parameterizations to yield the same function behavior. Then replicate the polynomial-regressor study of Q2, Experiment 1. This could clarify when such constraints meaningfully lower the LLC, which oftentimes enables generalization.
Comparing LLC across memorization-vs-generalization architectures. For a fixed task family (e.g. image classification), compare the LLC of a heavily over-parameterized memorization network and a more compact generalization network. See how this difference scales with problem difficulty (e.g. adding more classification classes).
Thank you! If you have any questions, please do not hesitate to email me.
Experiments to understand Singular Learning Theory’s Free Energy & Local Learning Coefficient (LLC)
This post is a summary of a collection of Singular Learning Theory experiments I ran during the fall of 2025. This post assumes some familiarity to SLT, namely its famous asymptotic relating free-energy Fn with the local-learning coefficient (LLC) λ. For a more in-depth description of these experiments, please see the arXiv preprint here. Please feel free to email me for any questions or feedback.
Brief Motivation
Singular Learning Theory (SLT) provides an elegant framework to understand phase transitions in modern neural networks through a physics-inspired Bayesian learning perspective. In particular, SLT introduces two meaningful quantities: the model free-energy Fn and the local-learning coefficient (LLC) λ, a model complexity measure. IMHO, the formal definitions of these two quantities are significantly more complex and unintuitive compared to Statistical Learning Theory’s trademark quantities such as VC Dimension or Rademacher Complexity. Due to SLT’s rise in AI safety research, better understanding the behavior of these two is increasingly important.
This work heavily leaned on Nina Panickssery and Dmitry Vaintrob’s experiments in their hackathon project.
Q1: Arrhenius Rate Reaction Theory for Fn
Our first experiment is to test the behavior of the free-energy through inspiration from Arrhenius Rate Reaction Theory. In particular, said theory tells us that the time it takes in a chemical reaction for reactants to transform into products is exponential in the energy barrier. Now suppose we have a phase transition (e.g. grokking) starting at time i and ending at time j. We posit the following hypothesis on this time-to-transition ri→j:
ri→j∝exp(βeffΔFi→j),ri→j:=j−i,ΔFi→j:=Fi−Fj
where ΔFi→j<0 and βeff is an effective inverse temperature dependent on global hyperparameters (learning rate, batch size, etc.). We test this on two experimental setups, which we detail below. For all experiments, we used Lau et. al 2023′s SGLD-MCMC approach to estimate the local learning coefficient.
Experiment 1: Grokking Modulo Arithmetic
For this experiment setup, we use the same neural network architecture presented in these experiments with p=53. Put simply, the task is for the neural network to learn the function f:Z53×Z53→Z53 given by (a,b)↦a+b mod 53. We trained 500 models (each using 40% of the 532 possible dataset pairs) and for all the models that grokked, estimated their delta in free-energy ΔFi→j and time-to-grok ri→j. In total around 34% (168/500) of all models we trained grokked, yielding the following plot:
(Note the log-scale on the bottom). While this linear fit does have a weak fit, it does match the expected downward direction. As such, I wanted to move on to testing this same hypothesis on another experimental setup.
Experiment 2: Anthropic’s Toy Models of Superposition (TMS)
This setup was largely similar to the previous one, with the exception that it is now possible for a single model run to have multiple phase transitions (as opposed to grokking, where there is just one.) As such, we trained 60 models and for each run, used a custom method to detect all transitions present (the number of which varied per run.) We report a similar plot as for the first experiment:
These results were clearly not what we were expecting. Two things to note:
These results were extremely sensitive to the method we used to detect transitions. For a more rudimentary method we used to detect transitions that did not involve time-based smoothing, it reported a downward slope.
We could not come up with any good explanation for why we had these three vertical clusters. One possible conjecture is that the uniqueness of the TMS setup (e.g. extremely low parameter) would force very fixed transition times. This was also highly sensitive to the method we used to detect transitions.
Given how variable this experiment was and due to time constraints, we unfortunately had to move on to the second research question.
Q2: LLC λ and problem difficulty
The purpose of this research question was to gain intuition on the local-learning coefficient and how its measure of complexity scales with problem difficulty across a few simple hypothesis classes. These experiments were inspired again by these experiments.
Experiment One: Univariate Polynomial Regressors
Consider some instance space X⊂R and define a polynomial regressor of degree d to be given in the form of ∑di=0aixi where each x∈X. Then for a given degree d and instance space X, we generated a 500-size d-degree dataset on X and trained 10 polynomial regressors to convergence. We then measured the LLC across these models to yield estimated means and confidence intervals. Our results are presented in the plot below for various instance space choices:
While seemingly simple, these results are actually meaningful. First observe that polynomial regressors are regular models, which means SLT should predict their LLC to linearly scale: λ=d2. This means the theory is an overestimate of the actual empirical results. A reason for this is that the constrained instance space X induces singularities. In math this would refer to a case where, ∃ f,g∈Polyreg(d) s.t. f≠g but f∣X=g∣X. These singularities would lower the LLCs (as is demonstrated by the sublinear behavior above). Moreover, more constrained domains would yield more singularities and lower LLCs, which is also consistent with the plot above.
The reason this matters is because singularities enable generalization. So the upshot of these results is that on practically constrained domains (e.g. images with pixels from 0-255), models may have more generalization power than we think.
Experiment Two: Low-Rank Neural Nets
We define a simple linear-neural network as f(x)=W2W1x where W2∈Rd×r,W1∈Rr×d for constants r≤d. Then rank(W2W1)≤r and moreover this model is a priori singularity-full as for any invertible G∈Rr×r, we have that W2W1=(W2G)(G−1W1). For our experiments, we fix d:=100 and vary the rank r as the problem difficutly knob. This yields the following plot:
A posteriori, this result matches SLT expectations closely. Specifically, it is a standard fact from algebraic geometry that r-rank matrices yield a smooth manifold of dimension of r(2d−r). Hence, the fit λr=12r(2d−r)=−0.5r2+100r closely matches the quadratic fit we report in the plot above.
Experiment Three: Low-Rank Autoencoders
This final experiment is very similar to the previous experiment, where we again use the rank as the problem difficulty knob. Specifically we generated a given sample x=Az for some z∼N(0,Ir) and A∈Rd×r where r≤d:=100. We present the plot below:
This result does not align with SLT’s expectations: recall regular models typically will demonstrate LLCs linear with parameter difficulty. However, here an autoencoder (which definitely has some symmetries) is exhibiting linear scaling in the LLC. We were unfortunately unable to come up with a compelling explanation for why this occurred (perhaps there is some connection to non-linear PCA?).
Conclusion and Future Directions
These results present a mixed bag: some align with SLT-expectations and others require further thought to understand. Overall, we still believe this work has importance as a starting point for further investigation. We would be grateful to hear any perspectives and interpretations you may have on this work. Moreover, we give the following few open directions for research:
Free-energy barriers and reaction rates. Develop a principled method to estimate the free-energy barrier(s) between two subsets of weight space. This would enable a more direct test of the Arrhenius-style reaction-rate hypothesis presented before.
Constrained domains induce singularities. Identify other standard deep-learning settings where natural domain constraints cause distinct parameterizations to yield the same function behavior. Then replicate the polynomial-regressor study of Q2, Experiment 1. This could clarify when such constraints meaningfully lower the LLC, which oftentimes enables generalization.
Comparing LLC across memorization-vs-generalization architectures. For a fixed task family (e.g. image classification), compare the LLC of a heavily over-parameterized memorization network and a more compact generalization network. See how this difference scales with problem difficulty (e.g. adding more classification classes).
Thank you! If you have any questions, please do not hesitate to email me.
For example, see the LLC Definition (Definition 1) provided in this paper: https://arxiv.org/abs/2308.12108.
All instance spaces had to have magnitude ≤ 1 for numerical stability as we tested regressors with degrees up to 103.