• Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √ 16 = 4).
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.
Earlier this month PALM-E gives a hint of one way to incorporate vision into LLMs (statement, paper) though obviously its a different company so GPT-4 might have taken a different approach. Choice quote from the paper:
Inputs such as images and state estimates are embedded into the same latent embedding as language tokens and processed by the self-attention layers of a Transformer-based LLM in the same way as text
Umm...the vision? How did they even train it?
Assuming they did it like Gato:
• Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. √ 16 = 4).
There are many easy ways to incorporate vision. Vision+text models are a dime a dozen these days—as I said, this currently looks like ‘DALL-E 1 but bigger’ (VQVAE tokens → token sequence → autoregressive modeling of text/image tokens). What we have seen so far doesn’t look like 3 years of progress by the best DL researchers.
OpenAI has transitioned from being a purely research company to an engineering one. GPT-3 was still research after all, and it was trained a relatively small amount of compute. After that, they had to build infrastructure to serve the models via API and a new supercomputing infrastructure to train new models with 100x compute of GPT-3 in an efficient way.
The fact that we are openly hearing rumours of GPT-5 being trained and nobody is denying them, it means that it is likely that they will ship a new version every year or so from now on.
Earlier this month PALM-E gives a hint of one way to incorporate vision into LLMs (statement, paper) though obviously its a different company so GPT-4 might have taken a different approach. Choice quote from the paper: