From SLT to AIT: NN generalisation out-of-distribution
TL;DR: This post derives an upper bound on the prediction error of Bayesian learning on neural networks. Unlike the bound from vanilla Singular Learning Theory (SLT), this bound also holds for out-of-distribution generalization, not just for in-distribution generalization. Along the way, it shows some connections between SLT and Algorithmic Information Theory (AIT).
Written at Goodfire AI
Introduction
Singular Learning Theory (SLT) describes Bayesian learning on neural networks. But it currently has some limitations. One of these limitations is that it assumes model training data are drawn independently and identically (IID) from some distribution, making it difficult to use SLT to describe out-of-distribution (OOD) generalization. For example, if we train a model to classify pictures of animals taken outdoors, it’s very difficult to use vanilla SLT to reason about whether this model will generalize to correctly classify pictures of animals taken indoors. Vanilla SLT also assumes that the model has been trained on an asymptotically infinite number of data points.[1]
Another theory about Bayesian learning is Algorithmic Information Theory (AIT), which describes Solomonoff induction (SI) [1,2].[2] Solomonoff induction is Bayesian learning over an (infinite) ensemble of programs. It is, unfortunately, uncomputable. But it does not require an IID sampling assumption, does not rely on asymptotically infinite data, and does describe OOD generalization. AIT is also older, more developed, and more established than SLT. A lot of agent foundations work builds on the AIT framework.
So if we could derive an AIT-style description of Bayesian learning on neural networks, that description might tell us things about neural network generalization out-of-distribution.
Here, we:
Derive error bounds for a computationally bounded Solomonoff induction (SI).
First, we import Solomonoff induction to the learning-theoretic setting of data coming in input-label pairs , with the inductor trying to predict the labels conditional on the inputs and show that the standard error bound for SI still holds there.
Then, we define a bounded, computable form of induction over programs. We show that it still satisfies a weaker SI-style error bound. We also show that this bounded induction is still somewhat invariant to our choice of universal Turing machine.
Derive a bound on the expected prediction error of Bayesian learning on neural networks.
Specifically, we derive an error bound for Bayesian learning over parametrized functions, starting from a uniform prior over real-valued parameters. In contrast to SLT-style error bounds, we do not need to assume that our data are drawn IID. Hence the bound will also hold under distributional shifts.
We show how this error bound relates to the notion of -volumes that define the learning coefficient in Singular Learning Theory.
The first part is mainly supposed to build intuition for the second, and help people see how the AIT picture and Bayesian learning on neural networks connect.[3]
I also hope that this post might help readers used to thinking about probability from a more E. T. Jaynes-esque/native-Bayesian point of view a little with grokking Singular Learning Theory, which is usually described in a pretty stat-mech-esque/second-language-Bayesian manner.
Prediction error bounds for a computationally bounded Solomonoff induction
Claim 1: We can import Solomonoff induction into the learning-theoretic setting
We consider a prediction setting with inputs (finite binary strings) and labels . We assume that the labels were generated by a probabilistic program [4], i.e., is set by . We are given such input-label pairs .
Our goal is to construct an inductor that predicts whether is given , using Bayesian updating on past input-label pairs that it has already seen. To do this, we will just slightly modify the standard Solomonoff setup for probabilistic programs.
Let be a plain[5] monotone[6] universal Turing machine (UTM) with a binary alphabet, one-way read-only input and output tapes, and a two-way work tape. We define the set of all programs of length bits, and the set of augmented programs , where is some ‘read-in’ program that first copies from the input tape to the first cells of the work tape, then sets the read head to the first bit of the program string and finally resets the UTM’s internal state to its starting configuration.
Our inductor then makes its predictions using the ensemble of augmented programs:
where is the output bit string of our UTM on input string interpreted as the probability that [7], and is the probability we assign to program after updating on the past prediction performance of on the input-label pairs . For our starting prior , we take the uniform distribution over our set of programs .
Note that this setup does not make any assumptions about how the inputs are sampled. This is in contrast to the Singular Learning Theory (SLT) setup, which needs to assume that the inputs are sampled IID from some underlying distribution. This assumption makes it difficult to use SLT to say anything about the generalization behavior of a trained network under distributional shifts, such as switching from the training distribution to a very different deployment distribution. We do not have that problem here.
Claim
So long as the program that generated the data has a minimum length less than or equal to on our UTM , , the conventional error bound for Solomonoff induction still applies in this setting. i.e., the total expected prediction error of our inductor will be bounded by
for all . In other words, the inductor’s total expected prediction error over the first data it sees, scored in bits of cross-entropy
is bounded to stay below the Kolmogorov complexity of the data-generating process on , , meaning the minimum description length of the program on the machine , plus the Shannon entropy of on each data point summed:[8]
High-level proof summary
The proof works analogously to the derivation of the bound in the conventional SI setup found in most textbooks. The only differences are that
We have not taken program length to infinity, as is done in the universal prior. We just assume that it is set to some number larger than .
is a plain UTM rather than a prefix-free UTM, and thus our starting prior over programs is just uniform, rather than falling off exponentially with program length. The degeneracy of program implementations ensures that simpler programs are exponentially preferred; we don’t need to insert a term favoring simpler programs in the prior by hand.
Our programs start with a prefix , where changes between data points.
None of these changes affect the structure of the proof much. We rearrange the expression for to make its dependence on the prior manifest, then use the existing result that the implementation of program must be assigned probability in the prior due to degeneracy of implementation. I.e., there are at least programs of length that give the same output as on all data, since we can just append arbitrary ‘garbage bits’ to the shortest implementation of to increase its length to without affecting the computation. So, the total probability mass in assigned to programs that output the same probability as on all data must be at least .
Proof
First, we rearrange our expression for the total cross-entropy:
In the second equality, we used the assumption that our inductor adjusts its probabilities according to incoming evidence using Bayesian updating.
According to result 3.8.1 in this book, the semimeasure for programs on a monotone, plain UTM matching the outputs of a program must be . Inserting this bound yields:
Claim 2: A bounded induction still efficiently predicts efficiently predictable data
Setup
Now, we’re going to modify the setup from the previous section, bounding our induction to make it computable. Instead of running programs , where is drawn from the set of all bit strings of length , we restrict ourselves to programs of length that require a maximum of space and a maximum of time steps.
Specifically, we construct a new bounded UTM from our UTM . only has cells of work tape available. If the head moves to the last cell of the work tape, immediately halts. Further, always halts itself after execution steps if it has not halted yet. This bounded induction will no longer be optimal in the sense of having an SI-style bound on its total error that depends only on the entropy and Kolmogorov complexity of the data-generating process , because might not be computable in time and space .
Claim
Our bounded induction still has a weaker upper bound on its total expected prediction error: If there is some program that is an “efficient predictor” of , in the sense that approximates the probability assignments of up to some error we deem ‘acceptable’, our induction will predict the data as well as this efficient predictor, up to an offset equal to the efficient predictor’s description length.
Specifically, for any program included in the set of programs our bounded induction runs over, we have the bound:[9]
for all . So, for any program with minimum description length on that is computable in time and space , the computationally bounded induction will have its total prediction error bounded by the prediction error of plus the minimum description length of on .
Proof
Proceeds completely analogously to the proof for Claim 1, except that we substitute in instead of . So we won’t write it out again.
Because
the only difference between this bound and the bound for the computationally unbounded induction from Equation (1.1) is an additional term , the KL divergence between the data-generating process and the program , summed over all input-label pairs.
Why would we suppose that a program that gets low prediction error with limited compute exists? Well, for some problems, an efficient predictor probably doesn’t exist. But empirically, we can do pretty well on many prediction problems we encounter in the real world using an amount of compute that’s practically feasible to obtain. Why that is, I don’t know. Perhaps it is somehow related to why a lot of the world seems to abstract well.
Claim 3: The bounded induction is still somewhat invariant under our choice of UTM
Given any program that can run on a plain, monotone, bounded UTM in space and time , we can run it on some other plain, monotone, bounded UTM in space and time .[10] Thus, the Kolmogorov complexity of on is bounded by
provided has sufficient computational resources available, because we can just simulate on and then run on the simulated .
Therefore, our total prediction error for bounded induction on any other plain monotone UTM bounded to space and time, where are the constant prefactors for simulating on , will still be bounded by
for all . So, our choice of UTM matters somewhat, since we might need more space and execution time on a different UTM to implement an equally efficient predictor. However, converting to a different UTM only increases the required execution time from to and the required space from to . So, provided has sufficient space and time , the additional prediction error we get is just , which is typically small for most we might usually consider using.
Prediction error bound for Bayesian learning on neural networks
Claim 4: We can obtain a similar prediction error bound for Bayesian learning on neural networks
Now, we will derive a similar style of bound for an inductor using Bayesian learning on neural networks.
Setup
In this setting, our inputs are finite-dimensional vectors of real numbers . The labels are still binary, .[11] The inductor now makes its predictions using functions , which take vectors as input and output probabilities for the data labels . The functions are all part of some space of functions . The functions are parametrized by a parameter-function map with . We call the number of neural network parameters. We refer to the probabilities the function outputs for input vector as . As in the previous sections, we assume that the data labels were generated from the inputs by some probabilistic program . Note that we do not make any assumptions about the inputs being drawn IID.
The inductor starts with a uniform prior over some finite hypervolume in parameter space: . The inductor makes its predictions as
where is the probability density the inductor places on parameter configuration after updating on data-label pairs .
Claim
The total expected prediction error of this induction measured in bits of cross-entropy is upper bounded by
for all , all parameter configurations , and all . Here, is the logarithm of the ratio between the volume of the whole prior , and the volume taken up by the set of parameter configurations that are mapped to functions which match the log-probabilities of up to tolerance :
In other words, if there is a parameter configuration that is mapped to a function that is an efficient predictor of the data, in the sense that approximates the probability assignments of up to some error, the inductor will predict the data as well as this efficient predictor , plus a tolerance we can freely choose, up to an offset equal to the cost in bits of specifying the predictor in the parameter space to precision. Intuitively, the reason we don’t just always choose in this bound is that our parameters are real numbers, so describing the location of in parameter space exactly could sometimes take infinitely many bits.
High-level proof summary
The proof works analogously to the proof for the error bound in Equation (1.1), except that sums over programs are replaced by integrals over parameters, and the semimeasure of programs in the prior matching the outputs of efficient predictor on all possible input bit strings is replaced by the volume of , the set of parameter configurations matching the log-probabilities output by function to precision.
Proof
As before, we start by rearranging our expression for the total cross-entropy:
In the second equality, we used the assumption that our inductor adjusts its probabilities over parameter configurations in response to incoming evidence using Bayesian updating.
Now, we restrict the integral to points in , then use the bound we have for points in this set:
Comments
The bound holds for any and any , though the tightest bound will of course come from the and that balance the tradeoff between predictive accuracy and description length best.
The larger is, the more predictive accuracy matters relative to description length, and the smaller we might want to set if we want to obtain the tightest bound.
If we have a more specific question about generalization error, we can get a tighter bound by restricting to range over a more limited set of possible input/output pairs. For example, if we’re confident an image classifier will never receive input images above a certain brightness, we can exclude all images above that brightness level from the set in the definition of , which might make smaller and the upper bound tighter.
This derivation assumes a uniform prior over parameters for simplicity, but it goes through analogously for other priors such as Gaussians.
Relating to SLT quantities
is a pretty similar quantity to the -ball that defines the learning coefficient in Singular Learning Theory. There, the prediction error in-distribution includes the logarithm of a term of the form[12]
where is the ‘training distribution’ over inputs .
Comparing to :
The uniform distribution over all inputs over the whole domain plays the same role for the definition of that the ‘training distribution’ does for .
In , the outputs of the reference function play the role that the ‘data labels’ do in .
cares about the supremum of divergences over the distribution, cares about the average. is defined as the integral over the set of parameter configurations that diverge from the data labels by less than on average over the training data, as scored by KL-divergence, whereas is defined as the volume of the set of points that diverge from by epsilon or less on all data points, as scored by log‑probability difference.
These differences are the price of describing prediction error in general instead of ‘in-distribution’. In SLT, we assume that our data are drawn IID from some known distribution . In that case, average error over that distribution is enough to bound the inductor’s expected average error on future data. Without this assumption of IID sampling, we need to classify functions based on how closely their outputs align with our reference on all possible inputs . Or at least, all inputs we think the inductor might possibly encounter.
In SLT, is also set to in the expression for the prediction error instead of being left free as in Equation (2.1), because the theory is primarily concerned with the infinite data limit , where describing the solution to exact precision is ‘worth it’ and yields a tighter bound. To put that slightly more quantitatively: One term for the prediction error in SLT scales as , while can be shown to only scale as [13], where is a scalar number called the learning coefficient. So, in the vanilla SLT setup, letting go to zero for as yields a tighter bound for the prediction error than setting it to some finite non-zero value. But the bound in this post is supposed to work for any amount of data , so we don’t set to any particular value.
Open problems and questions
Section TL;DR: What kind of neural network architectures have a prior equivalent to that of a bounded universal Turing machine? How can we estimate the description length in practice?
How do the priors actually relate to each other?
So, we can sort of think about Bayesian learning on neural networks and compute-bounded Solomonoff induction using a similar set of mathematical tools. But how similar are these induction processes really? That is, how much does a uniform prior over parameter configurations on some neural network architecture resemble a uniform prior over programs on some compute-bounded UTM?
Another way of asking that question is whether we can get something like Equation (1.3) for switching between a bounded UTM and a neural network, and vice versa. So, can we get something like:
Conjecture 1: For any parameter vector parametrizing some function , the prediction error of an inductor using bounded UTM is bounded by
where is the description length of the neural network architecture on UTM to precision .[14] Provided has enough time and memory to simulate , of course.
Conjecture 2 (Likely false for arbitrary NN architectures): For any program on bounded UTM , the prediction error of an inductor using neural network architecture is bounded by
where is the description length in bits of bounded UTM on neural network architecture to precision .
If both of these conjectures were true, doing Bayesian learning using neural networks instead of an ensemble of programs would be almost no more of a change than doing Bayesian learning in one programming language instead of another.[15]
My guess is that something vaguely like Conjecture 1 is true. With sufficient compute, we can use a UTM to simulate any neural network architecture to some arbitrary desired floating point precision. But I think Conjecture 2 is generally false, because not every neural network architecture can simulate a bounded UTM. Remember, our definition of ‘neural network’ here was basically just ‘a space of functions parametrized by some parameter-function map ’. A family of polynomials would satisfy that definition. And I doubt a uniform prior over polynomials acts much like a uniform prior over compute-bounded programs.
But maybe Conjecture 2 still holds for some more restricted set of neural network architectures? Like, say, architectures capable of simulating a bounded UTM. RNNs and transformers run in chain-of-thought mode can do this, for example.[16] I originally thought that this condition would be sufficient. But proving the conjecture for transformers in chain-of-thought mode turned out to be trickier than I thought.[17] Maybe you just need to be a bit cleverer about how to do the UTM simulation than I was so far. Or maybe, being able to simulate a bounded UTM actually isn’t a sufficient requirement for Conjecture 2, and the architecture needs to have some additional property. In that case, it’d be interesting to find out what that property is.
What does look like in practice?
For example, does behave like in vanilla SLT in the limit ? That is, for small enough and some reasonable conditions on the NN architecture[18] do we often get something like
where would then be some number analogous to the learning coefficient in vanilla SLT? Can we use stochastic gradient Langevin dynamics (SGLD) sampling to estimate on neural networks in real life, the same way people use it to determine the learning coefficient? If a solution can generalize out-of-distribution, it can certainly generalize in-distribution, but the reverse is generally not true. So, presumably would tend to be smaller than . But how much smaller does it tend to be in practice? In other words, how much does OOD generalization error usually differ from in-distribution generalization error?
Acknowledgments
Thanks to Aram Ebtekar for providing many useful citations. Thanks to Lee Sharkey for helping me come up with a title. Thanks to Cole Wyeth for spotting an unnecessary condition in my original definition of the bounded UTM. And thanks to Aram Ebtekar as well as Mathias Dellago, Alexander Gietelink Oldenziel, Daniel Murfet, David Quarel, Kaarel Hänni, John Wentworth, Dmitry Vaintrob, Jake Mendel, Linda Linsefors, and Scott Aaronson for various discussions and ideas that were causally upstream of my writing this.
- ^
Though in practice, using the formulas for models trained on finite amounts of data does seem to (maybe?) work alright a lot of the time. In our experience so far at least. Provided we aren’t screwing up things without noticing, which tends to happen a lot in research.
- ^
It also does other things, but this is maybe its most famous result, and the one that matters most for this post.
- ^
I expect that basically nothing in the first part is actually novel, though I haven’t managed to track down citations for every result in it in the exact form given here. If you think you know your AIT and don’t need to read this part, you might be right. But I’d advise some caution before skipping it. I wrote it partially because many AIT-knowledgeable people I talked to about this topic turned out not to know these things.[19]
- ^
As in a program that outputs probabilities as strings. Those probabilities are then used to sample the labels. Note that the program can output probabilities and , so the labels can be deterministic functions of the inputs , they just don’t need to be.
- ^
So, not prefix-free. Lots of standard AIT results assume a prefix-free UTM. This post does not. AIT-savvy skim readers beware. If you compare the theorems here to results phrased for the setting of prefix-free programs, you might become pretty confused.
- ^
See e.g. the start of section three here for more on the definition of monotone Turing machines.
- ^
As in the standard setup for Solomonoff induction with probabilistic programs, can keep running and printing more output digits indefinitely, representing the output probabilities with ever higher precision.
- ^
So, if is a deterministic program, this second term will vanish.
- ^
See also, for example, theorem 1 this paper, which shows almost the same thing in an RL setting.
- ^
See the proof here for example.
- ^
The derivation should work for other kinds of settings as well though, such as next token prediction in language modelling. I’m just sticking to a specific simple case to make the math easier to follow.
- ^
For the special case of uniform prior, discrete outputs , and our loss function being cross-entropy error measured in bits.
- ^
SLT convention usually considers prediction error scored in nats, I’m translating to bits here.
- ^
For some to-be-determined definition of ‘precision’ that makes the conjecture work. Basically, we’d want some measure of the floating‑point precision the UTM uses to approximate the real-number-valued parameters and network inputs.
- ^
Still only “almost”, because of the extra terms. Those don’t show up when converting between UTMs.
- ^
The results in those links come with some annoying caveats, so if you want to jump on this you might want to use different constructions to do the UTM emulation.
- ^
Basically, making unused NN parameters not interfere with the computation no matter what values they’re set to is kind of tough. As a consequence, I keep ending up with prefactors in front of the term.
- ^
e.g., analyticity.
- ^
Not judging. I also thought myself somewhat AIT-knowledgeable, but I did not realize some of what it had to say about degeneracy until fairly recently. I spent years being very confused about how learning works because of this.
First of all, this is great stuff and very clearly written.
With regards to this:And the open questionIt seems intuitive to me that what you’re describing is an RNN. Recalling Andrei Karpathy’s writing on theunreasonable effectiveness of RNNs:More directly: Instead of bit-string programs with lengthT, we have the parameters of the RNN. The “work tape” with fixed lengths+1 can be compared to the hidden state vector of the RNN, which also has a fixed length. The enforced halting condition is similar to running an RNN fortsteps.Edited to add: I see that you also noted the ability of RNNs to simulate UTMs with regards to conjecture 2. Sorry for the somewhat redundant comment. However, you noted that you didn’t succeed in proving conjecture 2 wrt transformers and chain of thought. Did you succeed for RNNs?
The same could be said of transformers run in chain of thought mode. But I tried deriving conjecture 2 for those, and didn’t quite succeed.
The trouble is that you need to store the programs p in the RNN/transformer weights, and do it in a way that doesn’t ‘waste’ degrees of freedom. Suppose for example that we try to store the code for the programs in the MLPs, using one ReLU neuron to encode each bit via query/key lookups. Then, if we have more neurons than we need because the program p is short, we have a lot of freedom in choosing the weights and biases of those unnecessary neurons. For example, we could set their biases to some very negative value to ensure the neurons never fire, and then set their input and output weights to pretty much any values. So long as the weights stay small enough to not overwhelm the bias, the computation of the network won’t be affected by this, since the ReLUs never fire.
The problem is that this isn’t enough freedom. To get C(μ∗,M2) in the formula without a prefactor >1.0, we’d need the biases and weights for those neurons to be completely free, able to take any value in W.
EDIT: Wrote the comment before your edit. No, I haven’t tried it for RNNs.
Just to clarify, if there was some way to “fill up” the weights of the RNN/Transformer such that no weights are “free”, that would satisfy the conditions that you’ve set up? As in, if the encoding of p takes up every parameter, that would qualify as a valid solution for the purposes of C(μ∗,M2)? (Forgive me, I’m still very new to AIT and the associated work, and my knowledge is still very poor)
If every bit of every weight were somehow used to store one bit of p, excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes.[1]
I think that’s maybe even harder than what I tried to do though. It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters, so long as the leftover parameter description bits are free to vary.
There’d be some extra trickiness in that under these definitions, the parameters are technically real numbers and thus have infinity bits of storage capacity, though in real life they’re of course actually finite precision floating point numbers.
Yeah, a lot of the stuff about ARNNs having superturing computing capacity rely on the Analog part of the (A)RNN and the weights being therefore real valued. RNNs with rational weights are strictly turing complete I think. (cf. this article)
Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
Otherwise, I don’t think it’ll be easy find a way for the remaining parameters to be fully free to vary, since it is a densely connected network. It might be interesting for you to look into the lottery ticket hypothesis literature, though even there I don’t think the remaining non-ticket weights can be set to any value.
There’s a few other cases like this where we know how various specific forms of simplicity in the computation map onto freedom in the parameters. But those are not enough in this case. We need more freedom than that.