From SLT to AIT: NN generalisation out-of-distribution

Lucius Bushnaq4 Sep 2025 15:20 UTC

LW: 114 AF: 34

Singular Learning Theory Distributional Shifts Solomonoff induction AI

TL;DR: This post derives an upper bound on the prediction error of Bayesian learning on neural networks. Unlike the bound from vanilla Singular Learning Theory (SLT), this bound also holds for out-of-distribution generalization, not just for in-distribution generalization. Along the way, it shows some connections between SLT and Algorithmic Information Theory (AIT).

Written at Goodfire AI

Introduction

Singular Learning Theory (SLT) describes Bayesian learning on neural networks. But it currently has some limitations. One of these limitations is that it assumes model training data are drawn independently and identically (IID) from some distribution, making it difficult to use SLT to describe out-of-distribution (OOD) generalization. For example, if we train a model to classify pictures of animals taken outdoors, it’s very difficult to use vanilla SLT to reason about whether this model will generalize to correctly classify pictures of animals taken indoors. Vanilla SLT also assumes that the model has been trained on an asymptotically infinite number of data points.^[1]

Another theory about Bayesian learning is Algorithmic Information Theory (AIT), which describes Solomonoff induction (SI) [1,2].^[2] Solomonoff induction is Bayesian learning over an (infinite) ensemble of programs. It is, unfortunately, uncomputable. But it does not require an IID sampling assumption, does not rely on asymptotically infinite data, and does describe OOD generalization. AIT is also older, more developed, and more established than SLT. A lot of agent foundations work builds on the AIT framework.

So if we could derive an AIT-style description of Bayesian learning on neural networks, that description might tell us things about neural network generalization out-of-distribution.

Here, we:

Derive error bounds for a computationally bounded Solomonoff induction (SI).
1. First, we import Solomonoff induction to the learning-theoretic setting of data coming in input-label pairs $(x, y)$ , with the inductor trying to predict the labels $y$ conditional on the inputs $x$ and show that the standard error bound for SI still holds there.
2. Then, we define a bounded, computable form of induction over programs. We show that it still satisfies a weaker SI-style error bound. We also show that this bounded induction is still somewhat invariant to our choice of universal Turing machine.
Derive a bound on the expected prediction error of Bayesian learning on neural networks.
1. Specifically, we derive an error bound for Bayesian learning over parametrized functions, starting from a uniform prior over real-valued parameters. In contrast to SLT-style error bounds, we do not need to assume that our data are drawn IID. Hence the bound will also hold under distributional shifts.
2. We show how this error bound relates to the notion of $ϵ$ -volumes that define the learning coefficient in Singular Learning Theory.

The first part is mainly supposed to build intuition for the second, and help people see how the AIT picture and Bayesian learning on neural networks connect.^[3]

I also hope that this post might help readers used to thinking about probability from a more E. T. Jaynes-esque/native-Bayesian point of view a little with grokking Singular Learning Theory, which is usually described in a pretty stat-mech-esque/second-language-Bayesian manner.

Prediction error bounds for a computationally bounded Solomonoff induction

Claim 1: We can import Solomonoff induction into the learning-theoretic setting

We consider a prediction setting with inputs $x_{n} \in {0, 1}^{*}$ (finite binary strings) and labels $y_{n} \in {0, 1}$ . We assume that the labels were generated by a probabilistic program $μ$ ^[4], i.e., $P_{μ} (y_{n} ∣ x_{n})$ is set by $μ (x_{n})$ . We are given $N \geq 1$ such input-label pairs ${(x_{n}, y_{n})}_{n = 1}^{N}$ .

Our goal is to construct an inductor that predicts whether $y_{n}$ is $1$ given $x_{n}$ , using Bayesian updating on past input-label pairs $(x_{1}, y_{1}), \dots, (x_{n - 1}, y_{n - 1})$ that it has already seen. To do this, we will just slightly modify the standard Solomonoff setup for probabilistic programs.

Let $M_{1}$ be a plain^[5] monotone^[6] universal Turing machine (UTM) with a binary alphabet, one-way read-only input and output tapes, and a two-way work tape. We define the set of all programs $p$ of length $T \in N$ bits, and the set of augmented programs $p_{n} := (p_{read}, x_{n}, p)$ , where $p_{read}$ is some ‘read-in’ program that first copies $x_{n}$ from the input tape to the first $L_{in}$ cells of the work tape, then sets the read head to the first bit of the program string $p$ and finally resets the UTM’s internal state to its starting configuration.

Our inductor then makes its predictions using the ensemble of augmented programs:

P_{M_{1}} (y_{n} ∣ x_{n}, D_{< n}) = \sum p P_{M_{1}} (y_{n} ∣ p_{n}) P (p ∣ D_{< n}),

where $P_{M_{1}} (y_{n} = 1 ∣ p_{n}) = M_{1} (p_{n}) = M_{1} (p_{read}, x_{n}, p)$ is the output bit string of our UTM on input string $p_{n}$ interpreted as the probability that $y_{n} = 1$ ^[7], $P_{M_{1}} (y_{n} = 0 ∣ p_{n}) = 1 - P_{M_{1}} (y_{n} = 1 ∣ p_{n})$ and $P (p ∣ D_{< n})$ is the probability we assign to program $p$ after updating on the past prediction performance of $p_{n - 1}, \dots, p_{1}$ on the input-label pairs $(x_{n - 1}, y_{n - 1}), \dots, (x_{1}, y_{1})$ . For our starting prior $P (p ∣ D_{< 1}) = P (p)$ , we take the uniform distribution $P (p) = 2^{- T}$ over our set of programs $p$ .

Note that this setup does not make any assumptions about how the inputs $x_{n}$ are sampled. This is in contrast to the Singular Learning Theory (SLT) setup, which needs to assume that the inputs $x_{n}$ are sampled IID from some underlying distribution. This assumption makes it difficult to use SLT to say anything about the generalization behavior of a trained network under distributional shifts, such as switching from the training distribution to a very different deployment distribution. We do not have that problem here.

Claim

So long as the program that generated the data $μ$ has a minimum length less than or equal to $T$ on our UTM $M_{1}$ , $C (μ, M_{1}) \leq T$ , the conventional error bound for Solomonoff induction still applies in this setting. i.e., the total expected prediction error of our inductor will be bounded by

\begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) \leq C (μ, M_{1}) + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n})), \\ (1.1) \end{matrix}

for all $N \in N_{> 0}$ . In other words, the inductor’s total expected prediction error over the first $N$ data it sees, scored in bits of cross-entropy

N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) := - N \sum n = 1 \sum y_{n} P_{μ} (y_{n} ∣ x_{n}) {log}_{2} P_{M_{1}} (y_{n} ∣ x_{n}, D_{< n})

is bounded to stay below the Kolmogorov complexity of the data-generating process $μ$ on $M_{1}$ , $C (μ, M_{1})$ , meaning the minimum description length of the program $μ$ on the machine $M_{1}$ , plus the Shannon entropy $H (P_{μ} (\cdot ∣ x_{n}))$ of $μ$ on each data point $x_{n}$ summed:^[8]

N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n})) := - N \sum n = 1 \sum y_{n} P_{μ} (y_{n} ∣ x_{n}) {log}_{2} P_{μ} (y_{n} ∣ x_{n}) .

High-level proof summary

The proof works analogously to the derivation of the bound in the conventional SI setup found in most textbooks. The only differences are that

We have not taken program length $T$ to infinity, as is done in the universal prior. We just assume that it is set to some number larger than $C (μ, M_{1})$ .
$M_{1}$ is a plain UTM rather than a prefix-free UTM, and thus our starting prior over programs is just uniform, rather than falling off exponentially with program length. The degeneracy of program implementations ensures that simpler programs are exponentially preferred; we don’t need to insert a term favoring simpler programs in the prior by hand.
Our programs $p_{n} = (p_{read}, x_{n}, p)$ start with a prefix $(p_{read}, x_{n})$ , where $x_{n}$ changes between data points.

None of these changes affect the structure of the proof much. We rearrange the expression for $\sum_{n} H (P_{μ} (\cdot ∣ x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n}))$ to make its dependence on the prior $P (p)$ manifest, then use the existing result that the implementation of program $μ$ must be assigned probability $\geq 2^{- C (μ, M_{1})}$ in the prior due to degeneracy of implementation. I.e., there are at least $2^{T - C (μ, M_{1})}$ programs of length $T$ that give the same output as $μ$ on all data, since we can just append $T - C (μ, M_{1})$ arbitrary ‘garbage bits’ to the shortest implementation of $μ$ to increase its length to $T$ without affecting the computation. So, the total probability mass in $P (p)$ assigned to programs that output the same probability as $μ$ on all data must be at least $\frac{2^{T - C (μ, M_{1})}}{2^{T}} = 2^{- C (μ, M_{1})}$ .

Proof

First, we rearrange our expression for the total cross-entropy:

\begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) = - N \sum n = 1 \sum y_{n} P_{μ} (y_{n} ∣ x_{n}) {log}_{2} (P_{M_{1}} (y_{n} ∣ x_{n}, D_{< n})) = - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (\sum p P_{M_{1}} (y_{N} ∣ p_{N}) P (p ∣ D_{< N})) - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) N - 1 \sum n = 1 {log}_{2} (P_{M_{1}} (y_{n} ∣ x_{n}, D_{< n})) = - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (\sum p \frac{\prod_{n = 1}^{N} P_{M_{1}} (y_{n} ∣ p_{n})}{\prod_{n = 1}^{N - 1} P_{M_{1}} (y_{n} ∣ x_{n}, D_{< n})} P (p)) - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) N - 1 \sum n = 1 {log}_{2} (P_{M_{1}} (y_{n} ∣ x_{n}, D_{< n})) = - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (\sum p P_{M_{1}} (y_{N} ∣ p_{N}) \dots P_{M_{1}} (y_{1} ∣ p_{1}) P (p)) \end{matrix}

In the second equality, we used the assumption that our inductor adjusts its probabilities according to incoming evidence using Bayesian updating.

According to result 3.8.1 in this book, the semimeasure for programs on a monotone, plain UTM matching the outputs of a program $μ$ must be $\geq 2^{- C (μ, M_{1})}$ . Inserting this bound yields:

\begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) = - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (\sum p P_{M_{1}} (y_{N} ∣ p_{N}) \dots P_{M_{1}} (y_{1} ∣ p_{1}) P (p)) \leq - \sum y_{N}, \dots, y_{1} P_{μ} (y_{n} ∣ x_{n}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (2^{- C (μ, M_{1})} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1})) = C (μ, M_{1}) + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n})) . \end{matrix}

Claim 2: A bounded induction still efficiently predicts efficiently predictable data

Setup

Now, we’re going to modify the setup from the previous section, bounding our induction to make it computable. Instead of running programs $(p_{read}, x_{n}, p)$ , where $p$ is drawn from the set of all bit strings of length $T$ , we restrict ourselves to programs of length $T$ that require a maximum of $s$ space and a maximum of $t$ time steps.

Specifically, we construct a new bounded UTM $M_{2}$ from our UTM $M_{1}$ . $M_{2}$ only has $s + 1$ cells of work tape available. If the head moves to the last cell of the work tape, $M_{2}$ immediately halts. Further, $M_{2}$ always halts itself after $t$ execution steps if it has not halted yet. This bounded induction will no longer be optimal in the sense of having an SI-style bound on its total error that depends only on the entropy and Kolmogorov complexity of the data-generating process $μ$ , because $μ$ might not be computable in time $t$ and space $s$ .

Claim

Our bounded induction still has a weaker upper bound on its total expected prediction error: If there is some program $μ^{*}$ that is an “efficient predictor” of $μ$ , in the sense that $μ^{*}$ approximates the probability assignments of $μ$ up to some error we deem ‘acceptable’, our induction will predict the data as well as this efficient predictor, up to an offset equal to the efficient predictor’s description length.

Specifically, for any program $μ^{*}$ included in the set of programs our bounded induction runs over, we have the bound:^[9]

\begin{matrix} \begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{M_{2}} (\cdot ∣ x_{n}, D_{< n})) \leq & C (μ^{*}, M_{2}) + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{μ^{*}} (\cdot ∣ x_{n})), \end{matrix} \\ (1.2) \end{matrix}

for all $N \in N_{> 0}$ . So, for any program $μ^{*}$ with minimum description length $\leq T$ on $M_{2}$ that is computable in time $t$ and space $s$ , the computationally bounded induction will have its total prediction error bounded by the prediction error of $μ^{*}$ plus the minimum description length of $μ^{*}$ on $M_{2}$ .

Proof

Proceeds completely analogously to the proof for Claim 1, except that we substitute in $P_{μ^{*}} (y_{n} ∣ x_{n})$ instead of $P_{μ} (y_{n} ∣ x_{n})$ . So we won’t write it out again.

Because

\begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{μ^{*}} (\cdot ∣ x_{n})) = N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n})) + N \sum n = 1 D_{KL} (P_{μ} (\cdot ∣ x_{n}), P_{μ^{*}} (\cdot ∣ x_{n})), \end{matrix}

the only difference between this bound and the bound for the computationally unbounded induction from Equation (1.1) is an additional term $\sum_{n = 1}^{N} D_{KL} (P_{μ} (\cdot ∣ x_{n}), P_{μ^{*}} (\cdot ∣ x_{n}))$ , the KL divergence between the data-generating process $μ$ and the program $μ^{*}$ , summed over all input-label pairs.

Why would we suppose that a program $μ^{*}$ that gets low prediction error with limited compute exists? Well, for some problems, an efficient predictor probably doesn’t exist. But empirically, we can do pretty well on many prediction problems we encounter in the real world using an amount of compute that’s practically feasible to obtain. Why that is, I don’t know. Perhaps it is somehow related to why a lot of the world seems to abstract well.

Claim 3: The bounded induction is still somewhat invariant under our choice of UTM

Given any program $μ^{*}$ that can run on a plain, monotone, bounded UTM $M_{3}$ in space $s$ and time $t$ , we can run it on some other plain, monotone, bounded UTM $M_{2}$ in space $O (s)$ and time $O (t {log}_{2} (t))$ .^[10] Thus, the Kolmogorov complexity $C (μ^{*}, M_{3})$ of $μ^{*}$ on $M_{3}$ is bounded by

\begin{matrix} C (μ^{*}, M_{3}) \leq C (μ^{*}, M_{2}) + C (M_{2}, M_{3}) \end{matrix}

provided $M_{3}$ has sufficient computational resources available, because we can just simulate $M_{2}$ on $M_{3}$ and then run $μ^{*}$ on the simulated $M_{2}$ .

Therefore, our total prediction error for bounded induction on any other plain monotone UTM $M_{3}$ bounded to $s^{'} \geq c_{s} s$ space and $t^{'} \geq c_{t} t {log}_{2} (t)$ time, where $c_{s}, c_{t} \geq 1$ are the constant prefactors for simulating $M_{2}$ on $M_{3}$ , will still be bounded by

\begin{matrix} \begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{M_{3}} (\cdot ∣ x_{n}, D_{< n})) \leq & C (μ^{*}, M_{2}) + C (M_{2}, M_{3}) + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{μ^{*}} (\cdot ∣ x_{n})) . \end{matrix} \\ (1.3) \end{matrix}

for all $N \in N_{> 0}$ . So, our choice of UTM matters somewhat, since we might need more space and execution time on a different UTM to implement an equally efficient predictor. However, converting to a different UTM only increases the required execution time from $t$ to $O (t {log}_{2} (t))$ and the required space from $s$ to $O (s)$ . So, provided $M_{3}$ has sufficient space $s^{'}$ and time $t^{'}$ , the additional prediction error we get is just $C (M_{2}, M_{3})$ , which is typically small for most $M_{2}, M_{3}$ we might usually consider using.

Prediction error bound for Bayesian learning on neural networks

Claim 4: We can obtain a similar prediction error bound for Bayesian learning on neural networks

Now, we will derive a similar style of bound for an inductor using Bayesian learning on neural networks.

Setup

In this setting, our inputs $x$ are finite-dimensional vectors of real numbers $x \in X \subseteq R^{d_{in}}$ . The labels are still binary, $y \in {0, 1}$ .^[11] The inductor now makes its predictions using functions $f (w)$ , which take vectors $x_{n} \in X$ as input and output probabilities for the data labels $y_{n}$ . The functions are all part of some space of functions $f \in F$ . The functions are parametrized by a parameter-function map $M : R^{D} \to F$ with $D \in N$ . We call $D$ the number of neural network parameters. We refer to the probabilities the function $f (w)$ outputs for input vector $x_{n}$ as $f (y_{n} ∣ x_{n}, w)$ . As in the previous sections, we assume that the data labels $y_{n}$ were generated from the inputs $x_{n}$ by some probabilistic program $μ$ . Note that we do not make any assumptions about the inputs $x_{n}$ being drawn IID.

The inductor starts with a uniform prior over some finite hypervolume $W$ in parameter space: $W = (- \frac{L}{2}, \frac{L}{2})^{D} \subset R^{D}$ . The inductor makes its predictions as

P_{f} (y_{n} ∣ x_{n}, D_{< n}) = \int d w f (y_{n} ∣ x_{n}, w) P (w ∣ D_{< n}) .

where $P (w ∣ D_{< n})$ is the probability density the inductor places on parameter configuration $w$ after updating on data-label pairs $(x_{1}, y_{1}, \dots, x_{n - 1}, y_{n - 1})$ .

Claim

The total expected prediction error of this induction measured in bits of cross-entropy is upper bounded by

\begin{matrix} \begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{f} (\cdot ∣ x_{n}, D_{< n})) \leq & C (w^{*}, ϵ, f) + N ϵ + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), f (\cdot ∣ x_{n}, w^{*})), \end{matrix} \\ (2.1) \end{matrix}

for all $N \in N_{> 0}$ , all parameter configurations $w^{*} \in W$ , and all $ϵ \in R_{\geq 0}$ . Here, $C (w^{*}, ϵ, f)$ is the logarithm of the ratio between the volume of the whole prior $V_{W} := \int_{W} d w = L^{D}$ , and the volume taken up by the set $W_{ϵ, w^{*}}$ of parameter configurations that are mapped to functions which match the log-probabilities of $f (y_{n} ∣ x_{n}, w^{*})$ up to tolerance $ϵ$ :

\begin{matrix} C (w^{*}, ϵ, f) := - {log}_{2} ⎛ ⎜ ⎝ \frac{\int_{W_{ϵ, w^{*}}} d w}{V_{W}} ⎞ ⎟ ⎠ W_{ϵ, w^{*}} := {w \in W ∣ | {log}_{2} f (y ∣ x, w) - {log}_{2} f (y ∣ x, w^{*}) | \leq ϵ \forall (x, y) \in X \times {0, 1}} \end{matrix}

In other words, if there is a parameter configuration $w^{*} \in (- \frac{L}{2}, \frac{L}{2})^{D}$ that is mapped to a function $f (w^{*})$ that is an efficient predictor of the data, in the sense that $f (w^{*})$ approximates the probability assignments of $μ$ up to some error, the inductor will predict the data as well as this efficient predictor $f (w^{*})$ , plus a tolerance $ϵ \in R_{\geq 0}$ we can freely choose, up to an offset equal to the cost in bits of specifying the predictor $f (w^{*})$ in the parameter space to $ϵ$ precision. Intuitively, the reason we don’t just always choose $ϵ = 0$ in this bound is that our parameters are real numbers, so describing the location of $f (w^{*})$ in parameter space exactly could sometimes take infinitely many bits.

High-level proof summary

The proof works analogously to the proof for the error bound in Equation (1.1), except that sums over programs are replaced by integrals over parameters, and the semimeasure of programs in the prior matching the outputs of efficient predictor $μ^{*}$ on all possible input bit strings $x \in {0, 1}^{*}$ is replaced by the volume of $W_{ϵ, w^{*}}$ , the set of parameter configurations matching the log-probabilities output by function $f (w^{*})$ to $ϵ$ precision.

Proof

As before, we start by rearranging our expression for the total cross-entropy:

\begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{f} (\cdot ∣ x_{n}, D_{< n})) = - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) N \sum n = 1 {log}_{2} (\int_{W} d w f (y_{n} ∣ x_{n}, w) P (w ∣ D_{< n})) = - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (\frac{1}{V_{W}} \int_{W} d w N \prod n = 1 f (y_{n} ∣ x_{n}, w)) \end{matrix}

In the second equality, we used the assumption that our inductor adjusts its probabilities over parameter configurations in response to incoming evidence using Bayesian updating.

Now, we restrict the integral $\int_{W} d w$ to points in $W_{ϵ, w^{*}}$ , then use the bound $| {log}_{2} f (y ∣ x, w) - {log}_{2} f (y ∣ x, w^{*}) | \leq ϵ$ we have for points in this set:

\begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{f} (\cdot ∣ x_{n}, D_{< n})) = - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (\frac{1}{V_{W}} \int_{W} d w N \prod n = 1 f (y_{n} ∣ x_{n}, w)) \leq - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (\frac{1}{V_{W}} \int_{W_{ϵ, w^{*}}} d w N \prod n = 1 f (y_{n} ∣ x_{n}, w)) \leq - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} (\frac{1}{V_{W}} \int_{W_{ϵ, w^{*}}} d w 2^{\sum_{n = 1}^{N} {log}_{2} (f (y_{n} ∣ x_{n}, w^{*})) - N ϵ}) = - \sum y_{N}, \dots, y_{1} P_{μ} (y_{N} ∣ x_{N}) \dots P_{μ} (y_{1} ∣ x_{1}) {log}_{2} ⎛ ⎜ ⎝ \frac{\int_{W_{ϵ, w^{*}}} d w}{V_{W}} 2^{\sum_{n = 1}^{N} {log}_{2} (f (y_{n} ∣ x_{n}, w^{*}))} 2^{- N ϵ} ⎞ ⎟ ⎠ = - {log}_{2} ⎛ ⎜ ⎝ \frac{\int_{W_{ϵ, w^{*}}} d w}{V_{W}} ⎞ ⎟ ⎠ + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), f (\cdot ∣ x_{n}, w^{*})) + N ϵ = C (w^{*}, ϵ, f) + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), f (\cdot ∣ x_{n}, w^{*})) + N ϵ \end{matrix}

Comments

The bound holds for any $w^{*} \in W$ and any $ϵ \in R_{\geq 0}$ , though the tightest bound will of course come from the $w^{*}$ and $ϵ$ that balance the tradeoff between predictive accuracy and description length best.
The larger $N$ is, the more predictive accuracy matters relative to description length, and the smaller we might want to set $ϵ$ if we want to obtain the tightest bound.
If we have a more specific question about generalization error, we can get a tighter bound by restricting $W_{ϵ, w^{*}}$ to range over a more limited set of possible input/output pairs. For example, if we’re confident an image classifier will never receive input images above a certain brightness, we can exclude all images above that brightness level from the set $X$ in the definition of $W_{ϵ, w^{*}}$ , which might make $C (w^{*}, ϵ, f)$ smaller and the upper bound tighter.
This derivation assumes a uniform prior over parameters for simplicity, but it goes through analogously for other priors such as Gaussians.

Relating $C (w^{*}, ϵ, f)$ to SLT quantities

$C (w^{*}, ϵ, f)$ is a pretty similar quantity to the $ϵ$ -ball that defines the learning coefficient in Singular Learning Theory. There, the prediction error in-distribution includes the logarithm of a term of the form^[12]

\begin{matrix} V (ϵ) := \frac{\int_{W_{ϵ}} d w}{V_{W}} W_{ϵ} := {w \in W ∣ L (w) < ϵ} L (w) := \int d x q (x) \sum y D_{KL} (P_{μ} (y ∣ x), f (y ∣ x, w)) \end{matrix}

where $q (x)$ is the ‘training distribution’ over inputs $x$ .

Comparing $- {log}_{2} V (ϵ)$ to $C (w^{*}, ϵ, f)$ :

The uniform distribution over all inputs $x$ over the whole domain $X$ plays the same role for the definition of $C (w^{*}, ϵ, f)$ that the ‘training distribution’ does for $V (ϵ)$ .
In $C (w^{*}, ϵ, f)$ , the outputs $f (y ∣ x, w^{*})$ of the reference function $f (w^{*})$ play the role that the ‘data labels’ $P_{μ} (y ∣ x)$ do in $V (ϵ)$ .
$C (w^{*}, ϵ, f)$ cares about the supremum of divergences over the distribution, $V (ϵ)$ cares about the average. $V (ϵ)$ is defined as the integral over the set of parameter configurations that diverge from the data labels by less than $ϵ$ on average over the training data, as scored by KL-divergence, whereas $C (w^{*}, ϵ, f)$ is defined as the volume of the set of points that diverge from $f (w^{*})$ by epsilon or less on all data points, as scored by log‑probability difference.

These differences are the price of describing prediction error in general instead of ‘in-distribution’. In SLT, we assume that our data $x_{n}$ are drawn IID from some known distribution $q (x)$ . In that case, average error over that distribution is enough to bound the inductor’s expected average error on future data. Without this assumption of IID sampling, we need to classify functions $f (w)$ based on how closely their outputs align with our reference $f (w^{*})$ on all possible inputs $x$ . Or at least, all inputs $x$ we think the inductor might possibly encounter.

In SLT, $ϵ$ is also set to $ϵ \propto \frac{1}{N}$ in the expression for the prediction error instead of being left free as in Equation (2.1), because the theory is primarily concerned with the infinite data limit $N \to \infty$ , where describing the solution to exact precision is ‘worth it’ and yields a tighter bound. To put that slightly more quantitatively: One term for the prediction error in SLT scales as $N ϵ$ , while $- {log}_{2} V (ϵ)$ can be shown to only scale as $- λ {log}_{2} ϵ + o ({log}_{2} ϵ)$ ^[13], where $λ$ is a scalar number called the learning coefficient. So, in the vanilla SLT setup, letting $ϵ$ go to zero for $N \to \infty$ as $ϵ \propto \frac{1}{N}$ yields a tighter bound for the prediction error than setting it to some finite non-zero value. But the bound in this post is supposed to work for any amount of data $N$ , so we don’t set $ϵ$ to any particular value.

Open problems and questions

Section TL;DR: What kind of neural network architectures have a prior equivalent to that of a bounded universal Turing machine? How can we estimate the description length $C (w^{*}, ϵ, f)$ in practice?

How do the priors actually relate to each other?

So, we can sort of think about Bayesian learning on neural networks and compute-bounded Solomonoff induction using a similar set of mathematical tools. But how similar are these induction processes really? That is, how much does a uniform prior over parameter configurations on some neural network architecture resemble a uniform prior over programs on some compute-bounded UTM?

Another way of asking that question is whether we can get something like Equation (1.3) for switching between a bounded UTM and a neural network, and vice versa. So, can we get something like:

Conjecture 1: For any parameter vector $w^{*}$ parametrizing some function $f (w^{*}) \in F$ , the prediction error of an inductor using bounded UTM $M_{2}$ is bounded by

\begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{M_{2}} (\cdot ∣ x_{n}, D_{< n})) \leq & C (w^{*}, ϵ, f) + C (f, ~ ϵ, M_{2}) + N (ϵ + ~ ϵ) + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), f (\cdot ∣ x_{n}, w^{*})), \end{matrix}

where $C (f, ~ ϵ, M_{2})$ is the description length of the neural network architecture $f$ on UTM $M_{2}$ to precision $~ ϵ$ .^[14] Provided $M_{2}$ has enough time and memory to simulate $f (w^{*})$ , of course.

Conjecture 2 (Likely false for arbitrary NN architectures): For any program $μ^{*}$ on bounded UTM $M_{2}$ , the prediction error of an inductor using neural network architecture $f$ is bounded by

$\begin{matrix} N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{f} (\cdot ∣ x_{n}, D_{< n})) \leq & C (μ^{*}, M_{2}) + C (M_{2}, ϵ, f) + N ϵ + N \sum n = 1 H (P_{μ} (\cdot ∣ x_{n}), P_{μ^{*}} (\cdot ∣ x_{n})), \end{matrix}$

where $C (M_{2}, ϵ, f)$ is the description length in bits of bounded UTM $M_{2}$ on neural network architecture $f$ to precision $ϵ$ .

If both of these conjectures were true, doing Bayesian learning using neural networks instead of an ensemble of programs would be almost no more of a change than doing Bayesian learning in one programming language instead of another.^[15]

My guess is that something vaguely like Conjecture 1 is true. With sufficient compute, we can use a UTM to simulate any neural network architecture to some arbitrary desired floating point precision. But I think Conjecture 2 is generally false, because not every neural network architecture can simulate a bounded UTM. Remember, our definition of ‘neural network’ here was basically just ‘a space of functions $F$ parametrized by some parameter-function map $M : R^{D} \to F$ ’. A family of polynomials would satisfy that definition. And I doubt a uniform prior over polynomials acts much like a uniform prior over compute-bounded programs.

But maybe Conjecture 2 still holds for some more restricted set of neural network architectures? Like, say, architectures capable of simulating a bounded UTM. RNNs and transformers run in chain-of-thought mode can do this, for example.^[16] I originally thought that this requirement for the architecture would be sufficient for Conjecture 2 to hold. But proving the conjecture for transformers in chain-of-thought mode turned out to be trickier than I thought.^[17] Maybe you just need to be a bit cleverer about how to do the UTM simulation on the transformer than I was so far. Or maybe, being able to simulate a bounded UTM actually isn’t a sufficient requirement for Conjecture 2, and the architecture needs to have some additional property. In that case, it’d be interesting to find out what that property is.

What does $C (w^{*}, ϵ, f)$ look like in practice?

For example, does $C (w^{*}, ϵ, f)$ behave like $V (ϵ)$ in vanilla SLT in the limit $ϵ \to 0$ ? That is, for small enough $ϵ$ and some reasonable conditions on the NN architecture^[18] do we often get something like

C (w^{*}, ϵ, f) \propto - λ^{'} {log}_{2} ϵ

where $λ^{'}$ would then be some number analogous to the learning coefficient $λ$ in vanilla SLT? Can we use stochastic gradient Langevin dynamics (SGLD) sampling to estimate $C (w^{*}, ϵ, f)$ on neural networks in real life, the same way people use it to determine the learning coefficient? If a solution can generalize out-of-distribution, it can certainly generalize in-distribution, but the reverse is generally not true. So, presumably $C (w^{*}, ϵ, f)$ would tend to be smaller than $- {log}_{2} V (ϵ)$ . But how much smaller does it tend to be in practice? In other words, how much does OOD generalization error usually differ from in-distribution generalization error?

Acknowledgments

Thanks to Aram Ebtekar for providing many useful citations. Thanks to Lee Sharkey for helping me come up with a title. Thanks to Cole Wyeth for spotting an unnecessary condition in my original definition of the bounded UTM. And thanks to Aram Ebtekar as well as Mathias Dellago, Alexander Gietelink Oldenziel, Daniel Murfet, David Quarel, Kaarel Hänni, John Wentworth, Dmitry Vaintrob, Jake Mendel, Linda Linsefors, and Scott Aaronson for various discussions and ideas that were causally upstream of my writing this.

^
Though in practice, using the formulas for models trained on finite amounts of data does seem to (maybe?) work alright a lot of the time. In our experience so far at least. Provided we aren’t screwing up things without noticing, which tends to happen a lot in research.
^
It also does other things, but this is maybe its most famous result, and the one that matters most for this post.
^
I expect that basically nothing in the first part is actually novel, though I haven’t managed to track down citations for every result in it in the exact form given here. If you think you know your AIT and don’t need to read this part, you might be right. But I’d advise some caution before skipping it. I wrote it partially because many AIT-knowledgeable people I talked to about this topic turned out not to know these things.^[19]
^
As in a program that outputs probabilities as strings. Those probabilities are then used to sample the labels. Note that the program can output probabilities $1.0$ and $0$ , so the labels $y_{n}$ can be deterministic functions of the inputs $x_{n}$ , they just don’t need to be.
^
So, not prefix-free. Lots of standard AIT results assume a prefix-free UTM. This post does not. AIT-savvy skim readers beware. If you compare the theorems here to results phrased for the setting of prefix-free programs, you might become pretty confused.
^
See e.g. the start of section three here for more on the definition of monotone Turing machines.
^
As in the standard setup for Solomonoff induction with probabilistic programs, $M_{1} (p_{n})$ can keep running and printing more output digits indefinitely, representing the output probabilities with ever higher precision.
^
So, if $μ$ is deterministic and only outputs probabilities $0$ or $1$ , this second term will vanish.
^
See also, for example, theorem 1 this paper, which shows almost the same thing in an RL setting.
^
See the proof here for example.
^
The derivation should work for other kinds of settings as well though, such as next token prediction in language modelling. I’m just sticking to a specific simple case to make the math easier to follow.
^
For the special case of uniform prior, discrete outputs $y$ , and our loss function being cross-entropy error measured in bits.
^
SLT convention usually considers prediction error scored in nats, I’m translating to bits here.
^
For some to-be-determined definition of ‘precision’ that makes the conjecture work. Basically, we’d want some measure of the floating‑point precision the UTM uses to approximate the real-number-valued parameters and network inputs.
^
Still only “almost”, because of the extra $ϵ$ terms. Those don’t show up when converting between UTMs.
^
The results in those links come with some annoying caveats, so if you want to jump on this you might want to use different constructions to do the UTM emulation.
^
Basically, making unused NN parameters not interfere with the computation no matter what values they’re set to is kind of tough. As a consequence, I keep ending up with prefactors $> 1.0$ in front of the $C (μ^{*}, M_{2})$ term.
^
e.g., analyticity.
^
Not judging. I also thought myself somewhat AIT-knowledgeable, but I did not realize some of what it had to say about degeneracy until fairly recently. I spent years being very confused about how learning works because of this.

What links here?

Lucius Bushnaq4 Sep 2025 15:20 UTC

LW: 114 AF: 34

8 comments14 min readLW link

Singular Learning Theory Distributional Shifts Solomonoff induction AI

testingthewaters 7 Sep 2025 16:57 UTC
4 points
0
First of all, this is great stuff and very clearly written.
~~With regards to this:~~
~~Now, we’re going to modify the setup from the previous section, bounding our induction to make it computable. Instead of running programs~~ $(p_{read}, x_{n}, p)$ ~~, where~~ $p$ ~~is drawn from the set of all bit strings of length~~ $T$ ~~, we restrict ourselves to programs of length~~ $T$ ~~that require a maximum of~~ $s$ ~~space and a maximum of~~ $t$ ~~time steps.~~
~~Specifically, we construct a new bounded UTM~~ $M_{2}$ ~~from our UTM~~ $M_{1}$ . $M_{2}$ ~~only has~~ $s + 1$ ~~cells of work tape available. If the head moves to the last cell of the work tape,~~ $M_{2}$ ~~immediately halts. Further,~~ $M_{2}$ ~~always halts itself after~~ $t$ execution steps if it has not halted yet. This bounded induction will no longer be optimal in the sense of having an SI-style bound on its total error that depends only on the entropy and Kolmogorov complexity of the data-generating process $μ$ ~~, because~~ $μ$ ~~might not be computable in time~~ $t$ ~~and space~~ $s$ .
~~And the open question~~
~~What kind of neural network architectures have a prior equivalent to that of a bounded universal Turing machine?~~
~~It seems intuitive to me that what you’re describing is an RNN. Recalling Andrei Karpathy’s writing on the~~ ~~unreasonable effectiveness of RNNs~~:
Moreover, as we’ll see in a bit, RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables. Viewed this way, RNNs essentially describe programs. In fact, it is known that ~~RNNs are Turing-Complete~~ in the sense that they can to simulate arbitrary programs (with proper weights). But similar to universal approximation theorems for neural nets you shouldn’t read too much into this. In fact, forget I said anything.
~~If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.~~
~~More directly: Instead of bit-string programs with length~~ T~~, we have the parameters of the RNN. The “work tape” with fixed length~~ s~~+1 can be compared to the hidden state vector of the RNN, which also has a fixed length. The enforced halting condition is similar to running an RNN for~~ t ~~steps.~~
Edited to add: I see that you also noted the ability of RNNs to simulate UTMs with regards to conjecture 2. Sorry for the somewhat redundant comment. However, you noted that you didn’t succeed in proving conjecture 2 wrt transformers and chain of thought. Did you succeed for RNNs?
- Lucius Bushnaq 7 Sep 2025 17:22 UTC
  4 points
  0
  Parent
  The same could be said of transformers run in chain of thought mode. But I tried deriving conjecture 2 for those, and didn’t quite succeed.
  The trouble is that you need to store the programs $p$ in the RNN/transformer weights, and do it in a way that doesn’t ‘waste’ degrees of freedom. Suppose for example that we try to store the code for the programs in the MLPs, using one ReLU neuron to encode each bit via query/key lookups. Then, if we have more neurons than we need because the program $p$ is short, we have a lot of freedom in choosing the weights and biases of those unnecessary neurons. For example, we could set their biases to some very negative value to ensure the neurons never fire, and then set their input and output weights to pretty much any values. So long as the weights stay small enough to not overwhelm the bias, the computation of the network won’t be affected by this, since the ReLUs never fire.
  
  The problem is that this isn’t enough freedom. To get $C (μ^{*}, M_{2})$ in the formula without a prefactor $> 1.0$ , we’d need the biases and weights for those neurons to be completely free, able to take any value in $W$ .
  
  EDIT: Wrote the comment before your edit. No, I haven’t tried it for RNNs.
  - testingthewaters 7 Sep 2025 17:36 UTC
    4 points
    2
    Parent
    Just to clarify, if there was some way to “fill up” the weights of the RNN/Transformer such that no weights are “free”, that would satisfy the conditions that you’ve set up? As in, if the encoding of $p$ takes up every parameter, that would qualify as a valid solution for the purposes of $C (μ^{*}, M_{2})$ ? (Forgive me, I’m still very new to AIT and the associated work, and my knowledge is still very poor)
    - Lucius Bushnaq 7 Sep 2025 17:54 UTC
      4 points
      0
      Parent
      If every bit of every weight were somehow used to store one bit of $p$ , excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes.^[1]
      
      I think that’s maybe even harder than what I tried to do though. It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters, so long as the leftover parameter description bits are free to vary.
      ^
      There’d be some extra trickiness in that under these definitions, the parameters are technically real numbers and thus have infinity bits of storage capacity, though in real life they’re of course actually finite precision floating point numbers.
      - testingthewaters 7 Sep 2025 18:11 UTC
        2 points
        0
        Parent
        
        the parameters are technically real numbers and thus have infinity bits of storage capacity
        
        Yeah, a lot of the stuff about ARNNs having superturing computing capacity rely on the Analog part of the (A)RNN and the weights being therefore real valued. RNNs with rational weights are strictly turing complete I think. (cf. this article)
        
        If every bit of every weight were somehow used to store one bit of p, excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes
        
        It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters
        
        Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
        
        Otherwise, I don’t think it’ll be easy find a way for the remaining parameters to be fully free to vary, since it is a densely connected network. It might be interesting for you to look into the lottery ticket hypothesis literature, though even there I don’t think the remaining non-ticket weights can be set to any value.
        Lucius Bushnaq 7 Sep 2025 18:36 UTC
        6 points
        0
        Parent
        Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
        There’s a few other cases like this where we know how various specific forms of simplicity in the computation map onto freedom in the parameters. But those are not enough in this case. We need more freedom than that.
Paul W 20 Sep 2025 7:43 UTC
3 points
0
Hi! Interesting post!
Just to make sure I got it right: in Claim 1, if you substract $\sum_n H(P_{\mu}(\cdot | x_n))$ on both sides of inequality (1.1), you get that the sum of the KL-divergences $\sum_n KL(P_{\mu}(\cdot | x_n), P_{M_1}(\cdot | x_n, D_{<n})$ is smaller than the constant $C(\mu, M_1)$, right?
Then, dividing by N on both sides, you get that the average of said KL divergences goes to 0 as N goes to infinity, at least with rate 1/N, is that correct?
- Lucius Bushnaq 20 Sep 2025 8:46 UTC
  3 points
  1
  Parent
  Yes, subtracting $\sum_{n} H (P_{μ} (\cdot | x_{n}))$ from inequality (1.1) does yield $\sum_{n = 1}^{N} D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) \leq C (μ, M_{1})$ . So, since the total KL divergence summed over the first $N$ data points is bounded by the same constant for any $N$ , and KL-divergences are never negative, $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n}))$ must go to zero for large $n$ fast enough for the sum to not diverge to infinity, which implies it has to go to zero faster than 1/n.
  Though note that in real life, where $N$ is finite, $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n}))$ can still go to zero very unevenly; it doesn’t have to be monotonic.
  
  For example, you might have $D_{KL} (P_{μ} (\cdot | x_{n}), P_{M_{1}} (\cdot ∣ x_{n}, D_{< n})) = 0$ from $n = 10^{3}$ to $n = 10^{6}$ , then suddenly see a small upward spike at $n = 10^{6} + 1$ . A way this might happen is if the first $10^{6}$ data points the inductor receives come from one data distribution, and the subsequent data points are drawn from a very different distribution. If there is a program $μ^{'}$ that is shorter than $μ$ (so $C (μ^{'}, M_{1}) < C (μ, M_{1})$ ) and that can predict the data labels for the first distribution but not the second distribution, whereas $μ$ can predict both distributions, the inductor would favour $μ^{'}$ over $μ$ and assign it higher probability until it starts seeing data from the second distribution. It might make up to $C (μ^{'}, M_{1})$ bits of prediction error early on before its posterior becomes largely dominated by predictions that match $μ^{'}$ at $n = 10^{3}$ . After that, the KL-divergence would go to zero for a while because everything is getting predicted accurately. Then, at $n = 10^{6} + 1$ , when we switch to the second data distribution, the KL-divergence would go up again for while, until the inductor has added another $\leq C (μ, M_{1}) - C (μ^{'}, M_{1})$ bits of prediction error to the total KL-divergence. From then on the inductor would make predictions that match $μ$ and so the KL-divergence would go back down to zero again and this time stay zero permanently.

From SLT to AIT: NN generalisation out-of-distribution

Introduction

Prediction error bounds for a computationally bounded Solomonoff induction

Claim 1: We can import Solomonoff induction into the learning-theoretic setting

Claim

High-level proof summary

Claim 2: A bounded induction still efficiently predicts efficiently predictable data

Setup

Claim

Claim 3: The bounded induction is still somewhat invariant under our choice of UTM

Prediction error bound for Bayesian learning on neural networks

Claim 4: We can obtain a similar prediction error bound for Bayesian learning on neural networks

Setup

Claim

High-level proof summary

Comments

Relating C(w∗,ϵ,f) to SLT quantities

Open problems and questions

How do the priors actually relate to each other?

What does C(w∗,ϵ,f) look like in practice?

Acknowledgments

Relating $C (w^{*}, ϵ, f)$ to SLT quantities

What does $C (w^{*}, ϵ, f)$ look like in practice?