Thank you for writing this post Dmitry. I’ve only skimmed the post but clearly it merits a deeper dive.
I will now describe a powerful, central circle of ideas I’ve been obsessed with past year that I suspect is very close to the way you are thinking.
Free energy functionals
There is a very powerful, very central idea whose simplicity is somehow lost in physics obscurantism which I will call for lack of a better word ′ tempered free energy functionals’.
Let us be given a loss function $L$ [physicists will prefer to think of this as an energy function/ Hamiltonian]. The idea is that one consider a functional $F_{L}(\beta): \Delta(\Omega) \to \mathbb{R}$ taking a distribution $p$ and sending it to $L(p) + \beta H(p)$, $\beta\in \mathbb{R}$ is the inherent coolness or inverse temperature.
We are now interested in minimizers of this functional. The functional will typically be convex (e.g. if $L(p)=KL(q||p)$ the KL-divergence or $L(P)= NL_N(p)$, the empirical loss at $N$ data points) so it has a minimum. This is the tempered Bayesian posterior/ Boltzmann distribution at inverse temperature $\beta$.
I find the physics terminology inherently confusing. So instead of the mysterious word temperature; just think of $\beta$ as a variable that controls the tradeoff between loss and inherent simplicity bias/noise. In other words, \beta controls the inherent noise.
SLT of course describes the free energy functional when evaluated at this minimizer as a function of $N$ through the Watanabe free energy functional.
Another piece of the story is that the [continuum limit of] stochastic gradient langevin descent at a given noise level is equivalently gradient descent along the free energy functional [at the given noise level, in the Wasserstein metric].
Rate-distortion theory
Instead of a free energy functional we can better think of it as a complexity-accuracy functional.
Working in this generality it can be shown that every phase transition diagram is possible. There are also connections with Natural Abstractions/ sufficient statistics and time complexity.
Thanks! Yes the temperature picture is the direction I’m going in. I had heard the term “rate distortion”, but didn’t realize the connection with this picture. Might have to change the language for my next post
Thank you for writing this post Dmitry. I’ve only skimmed the post but clearly it merits a deeper dive.
I will now describe a powerful, central circle of ideas I’ve been obsessed with past year that I suspect is very close to the way you are thinking.
Free energy functionals
There is a very powerful, very central idea whose simplicity is somehow lost in physics obscurantism which I will call for lack of a better word ′ tempered free energy functionals’.
Let us be given a loss function $L$ [physicists will prefer to think of this as an energy function/ Hamiltonian]. The idea is that one consider a functional $F_{L}(\beta): \Delta(\Omega) \to \mathbb{R}$ taking a distribution $p$ and sending it to $L(p) + \beta H(p)$, $\beta\in \mathbb{R}$ is the inherent coolness or inverse temperature.
We are now interested in minimizers of this functional. The functional will typically be convex (e.g. if $L(p)=KL(q||p)$ the KL-divergence or $L(P)= NL_N(p)$, the empirical loss at $N$ data points) so it has a minimum. This is the tempered Bayesian posterior/ Boltzmann distribution at inverse temperature $\beta$.
I find the physics terminology inherently confusing. So instead of the mysterious word temperature; just think of $\beta$ as a variable that controls the tradeoff between loss and inherent simplicity bias/noise. In other words, \beta controls the inherent noise.
SLT of course describes the free energy functional when evaluated at this minimizer as a function of $N$ through the Watanabe free energy functional.
Another piece of the story is that the [continuum limit of] stochastic gradient langevin descent at a given noise level is equivalently gradient descent along the free energy functional [at the given noise level, in the Wasserstein metric].
Rate-distortion theory
Instead of a free energy functional we can better think of it as a complexity-accuracy functional.
This is the basics of rate-distortion theory. I note that there is a very important but little known purely algorithmic version of this theory. See here for an expansive breakdown on more of these ideas.
Working in this generality it can be shown that every phase transition diagram is possible. There are also connections with Natural Abstractions/ sufficient statistics and time complexity.
Thanks! Yes the temperature picture is the direction I’m going in. I had heard the term “rate distortion”, but didn’t realize the connection with this picture. Might have to change the language for my next post