I think one lesson of superposition research is that neural nets are the compressed version. The world is really complicated, and NNs that try to model it are incentivized to try to squeeze as much in as they can.
It’s also worth noting that LLMs are not learning directly from the raw input stream but from a crux of that data (LLMs learn on compressed data) i.e. the LLMs are fed tokenized data, and the tokenizers act as compressors. This benefits the models by enabling them to have a more information-rich context.
Would you say that tokenization is part of the architecture?
And, in your wildest moments, would you say that language is also part of the architecture :)? I mean the latent space is probably mapping either a) brain states or b) world states right? Is everything between latent spaces architecture?
Thanks!
I think one lesson of superposition research is that neural nets are the compressed version. The world is really complicated, and NNs that try to model it are incentivized to try to squeeze as much in as they can.
I guess I also wrote a hot take about this.
It’s also worth noting that LLMs are not learning directly from the raw input stream but from a crux of that data (LLMs learn on compressed data) i.e. the LLMs are fed tokenized data, and the tokenizers act as compressors. This benefits the models by enabling them to have a more information-rich context.
Would you say that tokenization is part of the architecture?
And, in your wildest moments, would you say that language is also part of the architecture :)? I mean the latent space is probably mapping either a) brain states or b) world states right? Is everything between latent spaces architecture?