Neel Nanda comments on 1a3orn’s Shortform

Neel Nanda 12 Feb 2026 11:38 UTC
11 points
6

bandwidth might not be better in this case; it isn’t in all cases

A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can’t see a world where the optimal bandwidth given a good implementation of both is lower
- cfoster0 12 Feb 2026 21:27 UTC
  3 points
  0
  Parent
  
  A several thousand dimensional vector can contain so much more information than is in an integer between 1 and ~200K. The implementation is likely painful, but I can’t see a world where the optimal bandwidth given a good implementation of both is lower
  
  The transformer already has thousands of dimensions available through attention, no? How much does removing the tokenization buy you in addition? I agree it buys you some but seems unclear how much.
  - Neel Nanda 13 Feb 2026 7:09 UTC
    11 points
    3
    Parent
    A lot. Because the only thing that is recurrent is the text/vector CoT. The residual stream is very rich but the number of sequential steps of computation is bounded by the number of layers, without being able to send the intermediate information back to the beginning with some recurrence
- 1a3orn 12 Feb 2026 19:46 UTC
  2 points
  0
  Parent
  But there are systems that work better with lower bandwidth or have deliberately lower bandwidth, like autoencoders.
  
  I understand that the bandwidth is certainly higher for one than the other, but this both might not be an advantage in this circumstance or could be an advantage in some respects but a greater disadvantage in others.
  - Neel Nanda 13 Feb 2026 7:09 UTC
    7 points
    3
    Parent
    The point of an autoencoder is to form good representations, not to perform well. I’m struggling to think of any other examples where low bandwidth is good, that arent just implementation issues (and, again, in current systems text CoT > neuralese, so obviously low bandwidth can be good)