So when trying to work with language data vs image data, an interesting assumption of the ml vision research community clashes with an assumption of the language research community. For a language model, you represent the logits as a tensor with shape [batch_size, sequence_length, vocab_size]. For each position in the sequence, there are a variety of likelihood values of possible tokens for that position.
In vision models, the assumption is that the data will be in the form [batch_size, color_channels, pixel_position]. Pixel position can be represented as a 2d tensor or flattened to 1d.
See the difference? Sequence position comes first, pixel position comes second. Why? Because a color channel has a particular meaning, and thus it is intuitive for a researcher working with vision data to think about the ‘red channel’ as a thing which they might want to separate out to view. What if we thought of 2nd-most-probable tokens the same way? Is it meaningful to read a sequence of all 1st-most-probable tokens, then read a sequence of all 2nd-most-probable tokens? You could compare the semantic meaning, and the vibe, of the two sets. But this distinction doesn’t feel as natural for language logits as it does for color channels.
Each position in vision models you are trying to transform points in a continuous 3-dimensional space (RGB) to and from the model representation. That is, to embed a pixel you go c3→Rd_model , and to unembed you go Rd_model→c3 where c∈R and 0≤c<2color_depth_in_bits.
In a language model, you are trying to transform 100,000-dimensional categorical data to and from the model representation. That is, to embed a token you go t→Rd_model and to unembed Rd_model→Rd_vocab where t∈R and 0≤t<d_vocab -- for embedding, you can think of the embedding as a 1-hot t→Rd_vocab followed by a Rd_vocab→Rd_model, though in practice you just index into a tensor of shape (d_vocab, d_model) because 1-hot encoding and then multiplying is a waste of memory and compute. So you can think of a language model as having 100,000 “channels”, which encode “the token is the” / “the token is Bob” / “the token is |”.
Yeah, I was playing around with using a VAE to compress the logits output from a language transformer. I did indeed settle on treating the vocab size (e.g. 100,000) as the ‘channels’.
So when trying to work with language data vs image data, an interesting assumption of the ml vision research community clashes with an assumption of the language research community. For a language model, you represent the logits as a tensor with shape [batch_size, sequence_length, vocab_size]. For each position in the sequence, there are a variety of likelihood values of possible tokens for that position.
In vision models, the assumption is that the data will be in the form [batch_size, color_channels, pixel_position]. Pixel position can be represented as a 2d tensor or flattened to 1d.
See the difference? Sequence position comes first, pixel position comes second. Why? Because a color channel has a particular meaning, and thus it is intuitive for a researcher working with vision data to think about the ‘red channel’ as a thing which they might want to separate out to view. What if we thought of 2nd-most-probable tokens the same way? Is it meaningful to read a sequence of all 1st-most-probable tokens, then read a sequence of all 2nd-most-probable tokens? You could compare the semantic meaning, and the vibe, of the two sets. But this distinction doesn’t feel as natural for language logits as it does for color channels.
Somewhat of an oversimplification below, but
Each position in vision models you are trying to transform points in a continuous 3-dimensional space (RGB) to and from the model representation. That is, to embed a pixel you go c3→Rd_model , and to unembed you go Rd_model→c3 where c∈R and 0≤c<2color_depth_in_bits.
In a language model, you are trying to transform 100,000-dimensional categorical data to and from the model representation. That is, to embed a token you go t→Rd_model and to unembed Rd_model→Rd_vocab where t∈R and 0≤t<d_vocab -- for embedding, you can think of the embedding as a 1-hot t→Rd_vocab followed by a Rd_vocab→Rd_model, though in practice you just index into a tensor of shape
(d_vocab, d_model)
because 1-hot encoding and then multiplying is a waste of memory and compute. So you can think of a language model as having 100,000 “channels”, which encode “the token isthe
” / “the token isBob
” / “the token is|
”.Yeah, I was playing around with using a VAE to compress the logits output from a language transformer. I did indeed settle on treating the vocab size (e.g. 100,000) as the ‘channels’.
The computer vision researchers just chose the wrong standard. Even the images they train on come in [pixel_position, color_channels] format.