TLDR: The simplicity bias in Bayesian statistics is not just a bias towards short description length.

The folklore relating the simplicity bias in Bayesian statistics to description length is incomplete: while it is true that the fewer parameters you use the better, the true complexity measure which appears in the mathematical theory of Bayesian statistics (that is, singular learning theory) is more exotic. The content of this complexity measure remains quite mysterious, but in this note we point out that in a particular setting it includes a bias towards runtime error-correction. This suggests caution when reasoning about the role of inductive biases in neural network training.

Acknowledgements. Thanks to Jesse Hoogland, Liam Carroll, Rumi Salazar and Simon Pepin Lehalleur for comments.

1. Background

1.1 Relevance to Deep Learning

Consider the problem of solving an ordinary differential equation. A constructive proof involves actually writing down a solution, or an algorithm that in finite time will produce a solution. The Picard-Lindelöf theorem proves that a solution to a broad class of initial value problems exists, but the proof is not constructive: it sets up a contraction mapping on a complete metric space and appeals to the Banach fixed point theorem.

While the Picard-Lindelöf theorem uniquely characterises the solution as the fixed point of a contraction mapping, and gives an iterative process for approximating the solution, it does not construct the solution. However a construction is not necessary for many of the applications of Picard-Lindelöf (in differential geometry, topology and many parts of analysis). This mode of reasoning about mathematical objects, where it suffices to have characterised^[1] them by (universal) properties, is pervasive in modern mathematics (in the above example, the characterising property is the differential equation, or its associated contraction mapping). However this may seem quite alien to a computer scientist or programmer, who for historical reasons tend to think that there is only one mode of reasoning about mathematical objects, and that is centred on the study of a construction.

In an era where programs are increasingly the product of gradient descent rather than human construction, this attitude is untenable. We may have to accept a mode of reasoning about learned programs, based on understanding the nature of the problems to which they are a solution and the iterative processes that produce them. To understand the implicit algorithms learned by neural networks, it may be necessary from this perspective to understand

the computational structures latent in the data distribution, and
the inductive biases of neural network training.

We do not currently have a good understanding of these matters. If we understood these inductive biases better, it could conceivably help us in the context of AI alignment to answer questions like “how likely is deceptive alignment”, “how likely is consequentialism”, and “what goals are instrumentally convergent”?

This note is about the inductive biases of the Bayesian learning process (conditioned on more samples, the posterior increasingly localises around true parameters). Since Bayesian statistics is both fundamental and theoretically tractable, this seems potentially useful for understanding the inductive biases of neural network training. However it is worth noting that the relation between these is not understood at present.

1.2 Singular Learning Theory

The asymptotic expansion of the Bayesian free energy, or “free energy formula″, proven by Watanabe in Singular Learning Theory (SLT) introduces the learning coefficient $λ$ as a measure of complexity that balances model accuracy in the process of model selection.

In models that are regular or minimally singular the learning coefficient is easy to understand: it is the effective number of parameters in the model, in a suitable sense. More precisely if a parameter is irrelevant, in the sense that any variation does not change the prediction, this leads to a decrease of $\frac{1}{2}$ in the learning coefficient and thus $2 λ$ agrees with an effective parameter count in minimally singular models.^[2] Therefore, in these cases, the tradeoff between model accuracy and complexity embodied in the minimisation of the free energy is familiar.

However, in more degenerate models such as neural networks, this connection between the learning coefficient and parameter counting breaks down. In particular, the learning coefficient depends on the data distribution. This is not an obstacle to theoretically deriving or empirically estimating the learning coefficient, but it does mean that we may lack good intuitions for what this (positive, rational) number is measuring.

1.3 Minimum Description Length

There is a circle of ideas containing Occam’s razor, Kolmogorov complexity and the Minimum Description Length (MDL) which strongly informs the intuitions in the machine learning community about the meaning of “simplicity” and its role in determining the inductive biases of neural network training. However it is important to note that the mathematical hypotheses required for the attractive coherence among these ideas do not apply to neural networks.

A good reference for the classical treatment of the MDL is

Grünwald, P. D. (2007). The Minimum Description Length Principle, MIT press.

For the relation of the MDL to Bayesian model selection and the asymptotic expansion of the free energy (in the minimally singular case) see

Balasubramanian, V. (1997). Statistical inference, Occam’s razor, and statistical mechanics on the space of probability distributions, Neural Computation, 9(2), 349-368.

There has been some attempt in recent years to address the fact that the classical treatment of MDL is not applicable to singular models like neural networks:

Dwevedi, R., Singh, C., Yu, B., Wainwright, M., (2023), Revisiting minimum description length complexity in overparametrized models.

It seems that using SLT one could give a generally correct treatment of MDL. However, until such results are established, one should not presume any fundamental theoretical connection between the simplicity bias of Bayesian statistics of neural networks and any simple notion of “description length”.

2. Turing Machines With Errors

In this section we give an example where the complexity measure in the free energy asymptotic for singular statistical models (the learning coefficient) is sensitive to algorithmic structure that goes beyond the number of effective parameters.

The example is derived from a literature that has attempted to bridge semantics of logic with statistical learning theory, based on ground-breaking work of Ehrhard-Regnier on differential linear logic. We have tried to make the presentation here self-contained, but the reader can find more information in the following references:

Clift, J., (2017), Turing Machines and Differential Linear Logic, MSc thesis.
Clift, J., Murfet, D., Wallbridge, J. (2021), Geometry of Program Synthesis.
Waring, T., (2021), Geometric Perspectives on Program Synthesis and Semantics, MSc thesis.

We consider the problem of predicting the outputs of a computable generating process by finding codes for a Universal Turing Machine (UTM). We have in mind a standard kind of UTM $U$ with a description tape (where we write the code which specifies the machine to be simulated), a state tape (where the state of the simulated machine is written) and a work tape (containing the state of the tape of the simulated machine). Some input sequence $x \in Σ^{*}$ is written to the work tape, a code $c$ is written to the description tape, an initial state is written to the state tape, and then the UTM proceeds to simulate the Turing Machine $M$ with code $c$ until it halts with the output $M (x)$ on the work tape, if it halts.

We can consider this process as actually instantiated in a machine or, more abstractly, an automata. The role of error in such processes is very interesting and leads ultimately to the problem of run-time error correction in modern computing: every step of communication or processing in a computer involves some probability of error and, although small, the large number of steps and size of the messages involved means that some error correction may be necessary for meaningful computation to take place.

We suppose that some computable process in the environment is outputting strings $y \in Σ^{*}$ according to some true distribution $q (y | x)$ , with each symbol having some small error from the output $T (x)$ of a TM $T$ . The problem of statistical inference is to figure out which TM it is, from a given set of input-output pairs $D_{n} = {(x_{i}, y_{i})}_{i = 1}^{n}$ . Of course there will be (infinitely) many TMs consistent with the set of samples $D_{n}$ , so we expect that the result of statistical inference will be a probability distribution over codes. This is what we call the Bayesian posterior $p (c | D_{n})$ . We might bias this towards shorter machines by putting a prior $φ (c)$ over the space of models (codes).

This is an old problem, but we consider it from a new angle. Rather than considering a discrete space of codes with a special prior, we extend our set of allowed models of the generating process beyond TMs (as represented by their codes) to include codes with errors. In just the same way that each “organ″ of the automata^[3] in von Neumann’s work may have some probability of error, we allow each symbol $c_{i}$ of the code $c$ of $M$ to have some probability of error when it is read by the UTM. The specification of the allowed distribution of errors for each symbol in $c$ defines a point in the space $W$ of codes-with-errors, which we take as the parameter space of our new extended statistical model.

Given one of these codes-with-errors $w \in W$ and an input sequence $x \in Σ^{*}$ the contents of the work tape of $U$ after some given number of steps depends on the errors encountered during the execution of the UTM; that is, it is a probability distribution $p (y | x, w)$ . This is our model. Those parameters $w \in W$ which lead to final work tapes close to the true output $y$ , in the sense that the KL divergence between the two probability distributions is small, are more highly preferred by the Bayesian posterior $p (w | D_{n})$ , which is now defined on the extended space of models $W$ .

2.1 Error as a Probe of Structure

Of course, if the generating process $T$ is computable then its distribution of outputs $q (y | x)$ can be realised by a model $p (y | x, w)$ where $w$ has no errors (for example, the code of $T$ ). So what is there to be gained by considering this larger space of models?

The distribution over outputs that results from a given distribution of possible errors in reading a symbol $c_{i}$ of the code $c$ reflects the way that the symbol is used in the computation. While this is clear informally (to understand how something works, try perturbing one of its components and see what happens) this can be given a formal interpretation in the context of differential linear logic, where there is a connection between this sensitivity analysis and the Taylor series expansion of a proof. The simplest example is that if the introduction of errors in $c_{i}$ does not affect the output distribution at all then it is reasonable to conclude that $c_{i}$ is irrelevant to the computation. This already suggests a natural connection between the local learning coefficient $λ (c)$ and the program length of $c$ whereby the posterior $p (w | D_{n})$ tends to concentrate around shorter programs which predict the dataset $D_{n}$ to the same degree of accuracy. We will turn to more interesting examples soon.

The upshot is that given a TM $M$ with code $c$ the way that the probability distribution $p (y | x, w)$ varies when we vary $w$ near $c \in W$ (noting that $p (M (x) | x, c) = 1$ ) encodes a lot of information about the structure of the algorithm $M$ when executed on the input $x$ , via the effects of perturbing each one of the bits in the code $c$ . This variation in turn is reflected in the local geometry of the function

K (w) = KL (q (x, y) | | p (x, y | w))

near $c$ , of which the local learning coefficient $λ (c)$ is a scalar invariant. The conclusion is that the algebraic geometry of the function germ $(K, c)$ should be expected to reflect the internal structure of the TM $M$ to some degree. At the moment it remains unclear how strong this connection between geometry of $K$ and internal structure of $M$ actually is.

2.2 The Example

We fix an input $x \in Σ^{*}$ for which the true output is $T (x) \in Σ$ . For simplicity we assume this is a single symbol and that we judge models $p (y | x, w)$ on their predictions for the contents of this single tape square. The contribution of the data sample $x$ to the KL divergence $K (w)$ can be shown to be equivalent^[4] to $\int q (x) H (x, w) d x$ where

H (x, w) = \sum σ \in Σ (δ_{σ = T (x)} - p (σ | x, w))^{2}

which is a polynomial in the coefficients specifying the probability distributions that make up $w$ .

Let us now consider the local behaviour of $K$ (equivalently $H$ ) at a point of parameter space which represents a TM $M$ with code $c$ . We fix a single symbol of this code and a perturbation of this code in some direction $u$ . That is, we consider a parameter $w$ which agrees with $c$ except that in the $i$ th position we replace the symbol $c_{i}$ by the distribution $c_{i} + ε u$ . For concreteness let us take the original symbol of the code to be $c_{i} = 0 -$ and take $u$ to be the formal linear combination of symbols $1 - - 0 -$ so that for small $ε$ , $c_{i} + ε u$ is a probability distribution representing a small chance of reading a $1 -$ instead of a $0 -$ . Then we can expand $H (x, w)$ in powers of $ε$

H (x, w) = h_{0} + h_{1} ε + h_{2} ε^{2} + \dots

where each $h_{i}$ is a scalar (since we assume there is no uncertainty in the other positions of $c$ ). Taking $ε \to 0$ we obtain $H (x, c)$ which we may assume is zero; that is, we assume that $M$ gives the correct answer $M (x) = T (x)$ for the input $x$ . Thus $h_{0} = 0$ and we have

H (x, w) = \sum i > 0 h_{i} ε^{i} .

We are interested in the order of this polynomial, that is the least $i$ such that $h_{i} \neq 0$ .

To go further we have to think in more detail about how the UTM $U$ computes the probability $p (σ | x, w)$ that is, how the uncertainty in the code $w$ is propagated to uncertainty in the output. Since we run the simulated machine for some fixed number of steps, the UTM itself is run for some number of steps, and thus makes use of exactly $l$ samples from the $i$ th symbol for some $l$ that is independent of $x$ . The possible trajectories of $U$ are thus dictated by the sequences of possible results $μ = (μ_{1}, \dots, μ_{l}) \in Σ^{l}$ and to each we assign a probability

P (μ) = P (μ_{1}) \dots P (μ_{l})

In our example, the distribution in $w$ at the $i$ th position is

c_{i} + ε u = 0 - + ε (1 - - 0 -) = (1 - ε) 0 - + ε 1 -

so that we need only consider sequences where samples $0 -, 1 -$ appear, and

P (μ) = (1 - ε)^{a} ε^{b} = [a \sum j = 0 (\frac{a}{j}) (- 1)^{j} ε^{j}] ε^{b} = a \sum j = 0 (\frac{a}{j}) (- 1)^{j} ε^{b + j}

where $a$ is the number of times $0 -$ is sampled in $μ$ .

One can compute that when you run the UTM $U$ as above, with some probability of error each time it reads the symbol $c_{i}$ given by

p (σ | x, w) = \sum μ δ_{σ = U (μ, x)} P (μ)

where the sum is over all executions of $U$ making use of a sequence of samples $μ$ that, on input $x$ , leads to the output symbol $σ$ . Note that these executions do not involve any uncertainty. To compute $U (μ, x)$ we run the UTM as usual, but intervene in any step where the UTM goes to read from the $i$ th symbol of the code, so that on the $r$ th read for $1 \leq r \leq l$ the UTM sees $μ_{r}$ .

Combining the above equations

\begin{matrix} h_{1} & = \frac{\partial}{\partial ε} H (x, w) {∣ ∣}_{ε = 0} = \sum σ \in Σ \frac{\partial}{\partial ε} (δ_{σ = T (x)} - p (σ | x, w))^{2} {∣ ∣}_{ε = 0} = - \sum σ \in Σ 2 (δ_{σ = T (x)} - p (σ | x, w)) {∣ ∣}_{ε = 0} \frac{\partial}{\partial ε} p (σ | x, w) {∣ ∣}_{ε = 0} \end{matrix}

which is zero since when $ε = 0$ we have $w = c$ and $p (T (x) | x, c) = 1$ so the first factor in summand vanishes. Thus $h_{1} = 0$ . For $t \geq 2$

\begin{matrix} h_{t} & = \frac{1}{t!} \frac{\partial^{t}}{\partial ε^{t}} H (x, w) {∣ ∣}_{ε = 0} = \frac{1}{t!} \sum σ \in Σ \frac{\partial^{t}}{\partial ε^{t}} (δ_{σ = T (x)} - p (σ | x, w))^{2} {∣ ∣}_{ε = 0} = - \frac{2}{t!} \sum σ \in Σ \frac{\partial^{t - 1}}{\partial ε^{t - 1}} {(δ_{σ = T (x)} - p (σ | x, w)) \frac{\partial}{\partial ε} p (σ | x, w)} {∣ ∣}_{ε = 0} = \frac{2}{t!} \sum σ \in Σ t - 1 \sum s = 1 (\frac{t - 1}{s}) {\frac{\partial^{s}}{\partial ε^{s}} p (σ | x, w) \frac{\partial^{t - s}}{\partial ε^{t - s}} p (σ | x, w)} {∣ ∣}_{ε = 0} . \end{matrix}

For $s \geq 1$

\begin{matrix} \frac{\partial^{s}}{\partial ε^{s}} p (σ | x, w) {∣ ∣}_{ε = 0} & = \sum μ δ_{σ = U (μ, x)} \frac{\partial^{s}}{\partial ε^{s}} P (μ) {∣ ∣}_{ε = 0} = \sum μ δ_{σ = U (μ, x)} a \sum j = 0 (\frac{a}{j}) (- 1)^{j} \frac{\partial^{s}}{\partial ε^{s}} ε^{b + j} {∣ ∣}_{ε = 0} . \end{matrix}

Recall that $b = b (μ)$ is $l - a (μ)$ where $a (μ)$ is the number of $0 -$ ‘s appearing in the sequence $μ$ . That is, $b (μ)$ is the number of $1 -$ ’s that are sampled, or the number of errors. The derivative is nonzero only when $s = b + j$ which for $0 \leq j \leq a$ is only possible (given that $s \geq 1$ and $0 \leq b (μ) \leq l$ ) is only possible if $b \leq s \leq l$ . That is, the only “paths″ or sequences of samples $μ$ that contribute to the $s$ -fold derivative are ones which contain $\leq s$ errors.

\frac{\partial^{s}}{\partial ε^{s}} p (σ | x, w) {∣ ∣}_{ε = 0} = s! \sum b (μ) \leq s (- 1)^{s - b} δ_{σ = U (μ, x)} (\frac{a}{s - b}) .

We can use this formula to compute the coefficient $h_{t}$ .

Definition. We say that the Turing Machine $M$ with code $c$ is robust to $s$ errors on input $x$ in position $i$ if, for any execution path $μ$ of the UTM $U$ initialised with $c$ on the description tape and $x$ on the work tape involving $b (μ) \leq s$ errors, $U (μ, x) = M (x)$ .

Lemma. If $M$ is robust to $s$ errors on input $x$ in position $i$ then $h_{t} = 0$ for $t \leq s + 1$ . That is, the order of $H (x, w)$ in $ε$ is strictly greater than $s + 1$ .

Proof. We set $A_{σ, j} = \frac{\partial^{j}}{\partial ε^{j}} p (σ | x, w) {∣ ∣}_{ε = 0}$ . If $M$ is robust to $s$ errors in the stated sense, then it is clear from the above that $A_{σ, j} = 0$ when $σ \neq M (x)$ and $j \leq s$ since it is never true that $U (μ, x) = σ$ when $μ$ is such that $b (μ) \leq j$ . In the case where $σ = M (x)$ and $j \leq s$

\begin{matrix} A_{σ, j} & = j! \sum b \leq j \sum b (μ) = b (- 1)^{j - b} (\frac{a}{j - b}) = j! \sum b \leq j (- 1)^{j - b} (\frac{l}{b}) (\frac{l - b}{j - b}) = \frac{l!}{(l - j)!} \sum b \leq j (- 1)^{j - b} \frac{j!}{b! (j - b)!} = \frac{l!}{(l - j)!} \sum b \leq s (- 1)^{j - b} (\frac{j}{b}) = 0 . \end{matrix}

Hence for any $σ \in Σ$ and any $j \leq s$ we have $A_{σ, j} = 0$ . Thus for $t \leq s + 1$

\begin{matrix} h_{t} & = \frac{2}{t!} \sum σ \in Σ t - 1 \sum j = 1 (\frac{t - 1}{j}) A_{σ, j} A_{σ, t - j} = \frac{2}{t!} t - 1 \sum j = 1 (\frac{t - 1}{j}) A_{M (x), j} A_{M (x), t - j} = 0 \end{matrix}

since every summand involves a term $A_{M (x), j}$ with $j \leq s$ . $□$

Example. Since any TM is robust to zero errors, we always have $h_{1} = 0$ . If $M$ is robust to one error on input $x$ then $h_{2} = 0$ , so the polynomial $H (x, w)$ is at least cubic in $ε$ .

If $M$ is robust to $s$ errors on every input $x$ then $H (w) = \int H (x, w) d x$ has order $> s + 1$ in $ε$ , from which we infer that $K (w)$ has order $> s + 1$ in $ε$ locally at $c$ . If $0 -, 1 -$ are the only allowed symbols in the code at this position then the local learning coefficient satisfies $λ (c) \leq \frac{1}{s + 2}$ .

2.3 Summary

The example illustrates one simple way in which the structure of a TM $M$ with code $c$ (in this case, the capability to recover from read errors in its specification during run-time) influences the geometry of the function germ $(K, c)$ . We expect there are many further forms of degeneracy, which for example involve sophisticated interactions between bits of the code.

In a minimally singular model every used parameter costs $\frac{1}{2}$ (in the sense that it increases $λ$ by that amount) and so our intuition might suggest that in the current setting every used bit in the code should cost $\frac{1}{2}$ . It is true that every used bit costs at most $\frac{1}{2}$ but we have just seen that not all bits are created equal: a bit which is “error-corrected” in the sense that when the UTM executes $c$ it is robust to $\leq s$ errors in that bit, only costs $\frac{1}{s + 2} < \frac{1}{2}$ . The principle of minimisation of free energy therefore suggests that, all else being equal, the Bayesian posterior will prefer codes where the bits are error-corrected in this way. That is, two codes of equal length and matching the outputs of the generating process equally well, may nonetheless have neighbourhoods assigned different probability by the posterior, if one of them has error-correction and the other does not. Thus “simplicity″ (in the sense of $λ$ ) includes robustness to errors.

We note that it is straightforward to provide a TM with run-time error-correction at the cost of execution time, by running the program multiple times and applying a majority vote to determine the answer. More sophisticated schemes are possible, and there is a literature that has worked out various ways of doing this.

3. Conclusion

The relation between description length and simplicity biases in Bayesian statistics is well-known, but is a phenomena that is confined to regular models and this class of models does not include neural networks. We do not yet possess conceptually simple general intuitions about the inductive biases of Bayesian statistics. In this note we have exhibited one case in which the simplicity bias is more exotic.

3.1 Questions and Open Problems

The analysis in this note is preliminary. The set of people working on ideas like this is small, and if you have relevant background in mathematics or ML you could probably figure out something useful. Notes:

It is not known whether the inductive bias of neural network training contains a preference for run-time error-correction. The phenomenon of “backup heads” observed in transformers seems like a good candidate. Can you think of others?
It seems that in deep RL, policies which take effective control of the environment might be able to achieve a kind of run-time error-correction which allows them to simplify their policies and thus minimise the free energy. This might lead to a connection between simplicity in Bayesian statistics and the emergence of goal-directedness.
Simon Pepin Lehalleur (henceforth SPLH) asks “It would be remarkable if some related fact or downstream consequence of this observation had not been observed somewhere in the literature on error-correction and information theory?” There is an extensive literature on error-correction in naturally occurring computational systems. Interesting observations can be found in “Noisy dynamical systems evolve error correcting codes and modularity” by T. McCourt et al, “Resource savings from fault-tolerant circuit design” by A. K. Tan and I. L. Chuang, “Biological error correction codes generate fault-tolerant neural networks” by A. Zlokapa et al. I have only a superficial familiarity with the literature, but it seems to me one could make a career out of bringing modern mathematical techniques to bear on this field.
SPLH suggests the modest open problem of proving this is a consequence of SLT :)

^
Simon Pepin Lehalleur says: “defined implicitly” is another common way to get at the same idea. Statistics is in some sense all about implicit definitions (“Statistics is the inverse problem to probability”).
^
In the case where the KL divergence or loss function is locally, after a change of variables, a sum of squares $\sum_{i = 1}^{d^{'}} w_{i}^{2}$ in $R^{d}$ then since changing each of the $w_{i}$ for $1 \leq i \leq d^{'}$ increases the loss, we refer to these parameters as “relevant” whereas changing each of the $w_{i}$ for $i > d^{'}$ does not change the loss so we refer to these parameters as “irrelevant”. More precisely, there is a finite range within which these parameters can be varied without changing the loss.
^
By an “organ” von Neumann means a fundamental unit, from which more complicated computations can be built.
^
In the sense that there exist $c_{1}, c_{2} > 0$ such that with $H (w) = \int H (x, w) d x$ we have $c_{1} H (w) \leq K (w) \leq c_{2} H (w)$ . In particular the local learning coefficients of $K$ and $H$ agree.

Simple versus Short: Higher-order degeneracy and error-correction