Lucius Bushnaq comments on From SLT to AIT: NN generalisation out-of-distribution

Lucius Bushnaq 7 Sep 2025 17:22 UTC
4 points
0
The same could be said of transformers run in chain of thought mode. But I tried deriving conjecture 2 for those, and didn’t quite succeed.
The trouble is that you need to store the programs $p$ in the RNN/transformer weights, and do it in a way that doesn’t ‘waste’ degrees of freedom. Suppose for example that we try to store the code for the programs in the MLPs, using one ReLU neuron to encode each bit via query/key lookups. Then, if we have more neurons than we need because the program $p$ is short, we have a lot of freedom in choosing the weights and biases of those unnecessary neurons. For example, we could set their biases to some very negative value to ensure the neurons never fire, and then set their input and output weights to pretty much any values. So long as the weights stay small enough to not overwhelm the bias, the computation of the network won’t be affected by this, since the ReLUs never fire.

The problem is that this isn’t enough freedom. To get $C (μ^{*}, M_{2})$ in the formula without a prefactor $> 1.0$ , we’d need the biases and weights for those neurons to be completely free, able to take any value in $W$ .

EDIT: Wrote the comment before your edit. No, I haven’t tried it for RNNs.
- testingthewaters 7 Sep 2025 17:36 UTC
  4 points
  2
  Parent
  Just to clarify, if there was some way to “fill up” the weights of the RNN/Transformer such that no weights are “free”, that would satisfy the conditions that you’ve set up? As in, if the encoding of $p$ takes up every parameter, that would qualify as a valid solution for the purposes of $C (μ^{*}, M_{2})$ ? (Forgive me, I’m still very new to AIT and the associated work, and my knowledge is still very poor)
  - Lucius Bushnaq 7 Sep 2025 17:54 UTC
    4 points
    0
    Parent
    If every bit of every weight were somehow used to store one bit of $p$ , excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes.^[1]
    
    I think that’s maybe even harder than what I tried to do though. It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters, so long as the leftover parameter description bits are free to vary.
    ^
    There’d be some extra trickiness in that under these definitions, the parameters are technically real numbers and thus have infinity bits of storage capacity, though in real life they’re of course actually finite precision floating point numbers.
    - testingthewaters 7 Sep 2025 18:11 UTC
      2 points
      0
      Parent
      
      the parameters are technically real numbers and thus have infinity bits of storage capacity
      
      Yeah, a lot of the stuff about ARNNs having superturing computing capacity rely on the Analog part of the (A)RNN and the weights being therefore real valued. RNNs with rational weights are strictly turing complete I think. (cf. this article)
      
      If every bit of every weight were somehow used to store one bit of p, excepting those weights used to simulate the UTM, that should suffice to derive the conjecture, yes
      
      It’s theoretically fine if our scheme is kind of inefficient in terms of how much code it can store in a given number of parameters
      
      Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
      
      Otherwise, I don’t think it’ll be easy find a way for the remaining parameters to be fully free to vary, since it is a densely connected network. It might be interesting for you to look into the lottery ticket hypothesis literature, though even there I don’t think the remaining non-ticket weights can be set to any value.
      - Lucius Bushnaq 7 Sep 2025 18:36 UTC
        6 points
        0
        Parent
        Assuming that the bits to parameters encoding can be relaxed, there’s some literature about redundant computations in neural networks. If the feature vectors in a weight matrix aren’t linearly independent, for example, the same computation can be “spread” over many linearly dependent features, with the result that there are no free parameters but the total amount of computational work is the same.
        There’s a few other cases like this where we know how various specific forms of simplicity in the computation map onto freedom in the parameters. But those are not enough in this case. We need more freedom than that.