Gerald Monroe comments on GPT-4: What we (I) know about it

Gerald Monroe 17 Mar 2023 4:29 UTC
1 point
0
Umm...the vision? How did they even train it?
Assuming they did it like Gato:
• Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √ 16 = 4).
- gwern 17 Mar 2023 14:15 UTC
  3 points
  0
  Parent
  There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
  - sairjy 20 Mar 2023 3:41 UTC
    2 points
    0
    Parent
    OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
    The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.
- Robert_AIZI 17 Mar 2023 12:54 UTC
  1 point
  0
  Parent
  Earlier this month PALM-E gives a hint of one way to incorporate vision into LLMs (statement, paper) though obviously its a different company so GPT-4 might have taken a different approach. Choice quote from the paper:
  Inputs such as images and state estimates are embedded into the same latent embedding as language tokens and processed by the self-attention layers of a Transformer-based LLM in the same way as text