Nathan Helm-Burger comments on Nathan Helm-Burger’s Shortform

Nathan Helm-Burger 30 Apr 2024 17:09 UTC
8 points
0
So when trying to work with language data vs image data, an interesting assumption of the ml vision research community clashes with an assumption of the language research community. For a language model, you represent the logits as a tensor with shape [batch_size, sequence_length, vocab_size]. For each position in the sequence, there are a variety of likelihood values of possible tokens for that position.
In vision models, the assumption is that the data will be in the form [batch_size, color_channels, pixel_position]. Pixel position can be represented as a 2d tensor or flattened to 1d.
See the difference? Sequence position comes first, pixel position comes second. Why? Because a color channel has a particular meaning, and thus it is intuitive for a researcher working with vision data to think about the ‘red channel’ as a thing which they might want to separate out to view. What if we thought of 2nd-most-probable tokens the same way? Is it meaningful to read a sequence of all 1st-most-probable tokens, then read a sequence of all 2nd-most-probable tokens? You could compare the semantic meaning, and the vibe, of the two sets. But this distinction doesn’t feel as natural for language logits as it does for color channels.
- faul_sname 30 Apr 2024 18:32 UTC
  4 points
  2
  Parent
  Somewhat of an oversimplification below, but
  Each position in vision models you are trying to transform points in a continuous 3-dimensional space (RGB) to and from the model representation. That is, to embed a pixel you go $c^{3} \to R^{d_model}$ , and to unembed you go $R^{d_model} \to c^{3}$ where $c \in R and 0 \leq c < 2^{color_depth_in_bits}$ .
  In a language model, you are trying to transform 100,000-dimensional categorical data to and from the model representation. That is, to embed a token you go $t \to R^{d_model}$ and to unembed $R^{d_model} \to R^{d_vocab}$ where $t \in R and 0 \leq t < d_vocab$ -- for embedding, you can think of the embedding as a 1-hot $t \to R^{d_vocab}$ followed by a $R^{d_vocab} \to R^{d_model}$ , though in practice you just index into a tensor of shape (d_vocab, d_model) because 1-hot encoding and then multiplying is a waste of memory and compute. So you can think of a language model as having 100,000 “channels”, which encode “the token is the” / “the token is Bob” / “the token is |”.
  - Nathan Helm-Burger 1 May 2024 2:26 UTC
    4 points
    0
    Parent
    Yeah, I was playing around with using a VAE to compress the logits output from a language transformer. I did indeed settle on treating the vocab size (e.g. 100,000) as the ‘channels’.
- James Camacho 30 Apr 2024 18:48 UTC
  3 points
  0
  Parent
  The computer vision researchers just chose the wrong standard. Even the images they train on come in [pixel_position, color_channels] format.