First of all, this is great stuff and very clearly written.
With regards to this:
Now, we’re going to modify the setup from the previous section, bounding our induction to make it computable. Instead of running programs (pread,xn,p), where p is drawn from the set of all bit strings of length T, we restrict ourselves to programs of length T that require a maximum of s space and a maximum of t time steps.
Specifically, we construct a new bounded UTM M2 from our UTM M1. M2 only has s+1 cells of work tape available. If the head moves to the last cell of the work tape, M2 immediately halts. Further, M2 always halts itself after t execution steps if it has not halted yet. This bounded induction will no longer be optimal in the sense of having an SI-style bound on its total error that depends only on the entropy and Kolmogorov complexity of the data-generating process μ, because μ might not be computable in time t and space s.
And the open question
What kind of neural network architectures have a prior equivalent to that of a bounded universal Turing machine?
It seems intuitive to me that what you’re describing is an RNN. Recalling Andrei Karpathy’s writing on the unreasonable effectiveness of RNNs:
Moreover, as we’ll see in a bit, RNNs combine the input vector with their state vector with a fixed (but learned) function to produce a new state vector. This can in programming terms be interpreted as running a fixed program with certain inputs and some internal variables. Viewed this way, RNNs essentially describe programs. In fact, it is known that RNNs are Turing-Complete in the sense that they can to simulate arbitrary programs (with proper weights). But similar to universal approximation theorems for neural nets you shouldn’t read too much into this. In fact, forget I said anything.
If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.
More directly: Instead of bit-string programs with length T, we have the parameters of the RNN. The “work tape” with fixed length s+1 can be compared to the hidden state vector of the RNN, which also has a fixed length. The enforced halting condition is similar to running an RNN for t steps.
Edited to add: I see that you also noted the ability of RNNs to simulate UTMs with regards to conjecture 2. Sorry for the somewhat redundant comment. However, you noted that you didn’t succeed in proving conjecture 2 wrt transformers and chain of thought. Did you succeed for RNNs?
The same could be said of transformers run in chain of thought mode. But I tried deriving conjecture 2 for those, and didn’t quite succeed.
The trouble is that you need to store the programs p in the RNN/transformer weights, and do it in a way that doesn’t ‘waste’ degrees of freedom. Suppose for example that we try to store the code for the programs in the MLPs, using one ReLU neuron to encode each bit via query/key lookups. Then, if we have more neurons than we need because the program p is short, we have a lot of freedom in choosing the weights and biases of those unnecessary neurons. For example, we could set their biases to some very negative value to ensure the neurons never fire, and then set their input and output weights to pretty much any values. So long as the weights stay small enough to not overwhelm the bias, the computation of the network won’t be affected by this, since the ReLUs never fire.
The problem is that this isn’t enough freedom. To get C(μ∗,M2) in the formula without a prefactor >1.0, we’d need the biases and weights for those neurons to be completely free, able to take any value in W.
EDIT: Wrote the comment before your edit. No, I haven’t tried it for RNNs.
Just to clarify, if there was some way to “fill up” the weights of the RNN/Transformer such that no weights are “free”, that would satisfy the conditions that you’ve set up? As in, if the encoding of p takes up every parameter, that would qualify as a valid solution for the purposes of C(μ∗,M2)? (Forgive me, I’m still very new to AIT and the associated work, and my knowledge is still very poor)
If every bit of every weight were somehow used to store one bit of p, excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes.[1]
I think that’s maybe even harder than what I tried to do though. It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters, so long as the leftover parameter description bits are free to vary.
There’d be some extra trickiness in that under these definitions, the parameters are technically real numbers and thus have infinity bits of storage capacity, though in real life they’re of course actually finite precision floating point numbers.
the parameters are technically real numbers and thus have infinity bits of storage capacity
Yeah, a lot of the stuff about ARNNs having superturing computing capacity rely on the Analog part of the (A)RNN and the weights being therefore real valued. RNNs with rational weights are strictly turing complete I think. (cf. this article)
If every bit of every weight were somehow used to store one bit of p, excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes
It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters
Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
Otherwise, I don’t think it’ll be easy find a way for the remaining parameters to be fully free to vary, since it is a densely connected network. It might be interesting for you to look into the lottery ticket hypothesis literature, though even there I don’t think the remaining non-ticket weights can be set to any value.
Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
There’s a few other cases like this where we know how various specific forms of simplicity in the computation map onto freedom in the parameters. But those are not enough in this case. We need more freedom than that.
First of all, this is great stuff and very clearly written.
With regards to this:And the open questionIt seems intuitive to me that what you’re describing is an RNN. Recalling Andrei Karpathy’s writing on theunreasonable effectiveness of RNNs:More directly: Instead of bit-string programs with lengthT, we have the parameters of the RNN. The “work tape” with fixed lengths+1 can be compared to the hidden state vector of the RNN, which also has a fixed length. The enforced halting condition is similar to running an RNN fortsteps.Edited to add: I see that you also noted the ability of RNNs to simulate UTMs with regards to conjecture 2. Sorry for the somewhat redundant comment. However, you noted that you didn’t succeed in proving conjecture 2 wrt transformers and chain of thought. Did you succeed for RNNs?
The same could be said of transformers run in chain of thought mode. But I tried deriving conjecture 2 for those, and didn’t quite succeed.
The trouble is that you need to store the programs p in the RNN/transformer weights, and do it in a way that doesn’t ‘waste’ degrees of freedom. Suppose for example that we try to store the code for the programs in the MLPs, using one ReLU neuron to encode each bit via query/key lookups. Then, if we have more neurons than we need because the program p is short, we have a lot of freedom in choosing the weights and biases of those unnecessary neurons. For example, we could set their biases to some very negative value to ensure the neurons never fire, and then set their input and output weights to pretty much any values. So long as the weights stay small enough to not overwhelm the bias, the computation of the network won’t be affected by this, since the ReLUs never fire.
The problem is that this isn’t enough freedom. To get C(μ∗,M2) in the formula without a prefactor >1.0, we’d need the biases and weights for those neurons to be completely free, able to take any value in W.
EDIT: Wrote the comment before your edit. No, I haven’t tried it for RNNs.
Just to clarify, if there was some way to “fill up” the weights of the RNN/Transformer such that no weights are “free”, that would satisfy the conditions that you’ve set up? As in, if the encoding of p takes up every parameter, that would qualify as a valid solution for the purposes of C(μ∗,M2)? (Forgive me, I’m still very new to AIT and the associated work, and my knowledge is still very poor)
If every bit of every weight were somehow used to store one bit of p, excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes.[1]
I think that’s maybe even harder than what I tried to do though. It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters, so long as the leftover parameter description bits are free to vary.
There’d be some extra trickiness in that under these definitions, the parameters are technically real numbers and thus have infinity bits of storage capacity, though in real life they’re of course actually finite precision floating point numbers.
Yeah, a lot of the stuff about ARNNs having superturing computing capacity rely on the Analog part of the (A)RNN and the weights being therefore real valued. RNNs with rational weights are strictly turing complete I think. (cf. this article)
Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
Otherwise, I don’t think it’ll be easy find a way for the remaining parameters to be fully free to vary, since it is a densely connected network. It might be interesting for you to look into the lottery ticket hypothesis literature, though even there I don’t think the remaining non-ticket weights can be set to any value.
There’s a few other cases like this where we know how various specific forms of simplicity in the computation map onto freedom in the parameters. But those are not enough in this case. We need more freedom than that.