Thanks for pointing this out—funnily enough, I actually read the OpenAI thing last year and thought it was cool, but then forgot about it by the time this came out! (The thing from a decade ago I hadn’t heard of)
I very definitely noticed Sparse Transformer, but what you’re missing is that Sparse Transformers showed good compression performance but was small-scale & primarily about describing the Sparse Transformer/showing it works, and there’s nothing about few-shot/transfer learning. There is no guarantee that it is learning particularly useful representations just because it predicts pixel-by-pixel well which may be distributed throughout the GPT, somewhat like the problem in finding the equivalent of Gram matrices in text models (unlike the semi-supervised CNNs where you can expect the embedding or pre-embedding to distill all the knowledge into one place, by design), and you can see in iGPT that getting the representation out is nontrivial—you can easily pick a bad layer to use as the embedding.
There is no guarantee that it is learning particularly useful representations just because it predicts pixel-by-pixel well which may be distributed throughout the GPT,
However, I’ll admit that the fact that theres an optimal layer to tap into, and that they showed that this trick works specifically with transformer autoregressive models is novel to my knowledge.
Being able to accomplish something is important even if it was predicted to be possible. No one is surprised that generative models do embody a lot of useful knowledge (that’s much of the point), but it can be hard to tap into it.
The difference between GPT & iGPT for transfer learning is that GPT can be queried directly via its modality by putting in text: “Translate this into French”, “what genre of text is this?”, “tldr”, etc. On the other hand, if you were querying iGPT by handing it half an image and expecting it to complete it in a useful way, there is absolutely nothing surprising about that being useful, obviously; but I have a hard time thinking of how you could implement classification by image completion! You normally have to get the knowledge out a different way, through an embedding which can be fed into a linear classification layer; if you can’t do that, it’s unclear what exactly you do. It was unclear how you use Sparse Transformers, PixelRNN, GANs, etc to do any of that. Now it’s clearer.
As an analogous example, consider textual style transfer. You can’t do it (pre-GPT-3, anyway). Do char-RNNs and Transformers understand the difference between authors and styles and content? Are they capable of textual style transfer? I would be shocked if they weren’t. Probably, yes, after all, they can uncannily mimic authors and write plausibly about all sorts of content. But nevertheless, they lack a Gram matrix like CNNs you can easily optimize to do style transfer with. So, no one can do it. Someone finally figuring out how to do it would be big news even if the end output is not surprising.
How hard do you think it would be to do Image GPT but for video? That sounds like it could be pretty cool to see. Probably can be used to create some pretty trippy shit. Once it gets really good it could be used in robotics. Come to think of it, isn’t that sorta what self driving cars need? Something that looks at a video of the various things happening around the car and predicts what’s going to happen next?
Video is just a very large image (n times bigger). So as a quick heuristic, you can say that whatever you can do with images, you can do with video, just n times more expensive… Since iGPT is pretty expensive, I don’t expect iGPT for video anymore than I expect it for 512px images. With efficient attention mechanisms and hierarchy, it seems a lot more plausible. There’s already RNNs for 64px video out 25 frames, for example. I’m not sure directly modeling video is all that useful for self-driving cars. Working at the pixel-level is useful pretraining, but it’s not necessarily where you want to be for planning. (Would MuZero play Go better if we forced it to emit, based on its latent space being used for planning, a 1024px RGB image of a photorealistic Go board at every step in a rollout? Most attempts to do planning while forcing reconstruction of hypothetical states don’t show good results.)
Right. The use case I had in mind for electric cars was the standard “You see someone walking by the edge of the street; are they going to step out into the street or not? It depends on e.g. which way they are facing, whether they just dropped something into the street, … etc.” That seems like something where pixel-based image prediction would be superior to e.g. classifying the entity as a pedestrian and then adding a pedestrian token to your 3D model of your enviornment.
This isn’t news, we knew that sequence predictors could model images for almost a decade now and openAI did the same thing last year with less compute, but no one noticed.
Thanks for pointing this out—funnily enough, I actually read the OpenAI thing last year and thought it was cool, but then forgot about it by the time this came out! (The thing from a decade ago I hadn’t heard of)
I very definitely noticed Sparse Transformer, but what you’re missing is that Sparse Transformers showed good compression performance but was small-scale & primarily about describing the Sparse Transformer/showing it works, and there’s nothing about few-shot/transfer learning. There is no guarantee that it is learning particularly useful representations just because it predicts pixel-by-pixel well which may be distributed throughout the GPT, somewhat like the problem in finding the equivalent of Gram matrices in text models (unlike the semi-supervised CNNs where you can expect the embedding or pre-embedding to distill all the knowledge into one place, by design), and you can see in iGPT that getting the representation out is nontrivial—you can easily pick a bad layer to use as the embedding.
Personally, I felt that that wasn’t really surprising either. Remember that this whole deep learning thing started with exactly what OpenAI just did. Train a generative model of the data, and then fine tune it to the relevant task.
However, I’ll admit that the fact that theres an optimal layer to tap into, and that they showed that this trick works specifically with transformer autoregressive models is novel to my knowledge.
Being able to accomplish something is important even if it was predicted to be possible. No one is surprised that generative models do embody a lot of useful knowledge (that’s much of the point), but it can be hard to tap into it.
The difference between GPT & iGPT for transfer learning is that GPT can be queried directly via its modality by putting in text: “Translate this into French”, “what genre of text is this?”, “tldr”, etc. On the other hand, if you were querying iGPT by handing it half an image and expecting it to complete it in a useful way, there is absolutely nothing surprising about that being useful, obviously; but I have a hard time thinking of how you could implement classification by image completion! You normally have to get the knowledge out a different way, through an embedding which can be fed into a linear classification layer; if you can’t do that, it’s unclear what exactly you do. It was unclear how you use Sparse Transformers, PixelRNN, GANs, etc to do any of that. Now it’s clearer.
As an analogous example, consider textual style transfer. You can’t do it (pre-GPT-3, anyway). Do char-RNNs and Transformers understand the difference between authors and styles and content? Are they capable of textual style transfer? I would be shocked if they weren’t. Probably, yes, after all, they can uncannily mimic authors and write plausibly about all sorts of content. But nevertheless, they lack a Gram matrix like CNNs you can easily optimize to do style transfer with. So, no one can do it. Someone finally figuring out how to do it would be big news even if the end output is not surprising.
How hard do you think it would be to do Image GPT but for video? That sounds like it could be pretty cool to see. Probably can be used to create some pretty trippy shit. Once it gets really good it could be used in robotics. Come to think of it, isn’t that sorta what self driving cars need? Something that looks at a video of the various things happening around the car and predicts what’s going to happen next?
Video is just a very large image (n times bigger). So as a quick heuristic, you can say that whatever you can do with images, you can do with video, just n times more expensive… Since iGPT is pretty expensive, I don’t expect iGPT for video anymore than I expect it for 512px images. With efficient attention mechanisms and hierarchy, it seems a lot more plausible. There’s already RNNs for 64px video out 25 frames, for example. I’m not sure directly modeling video is all that useful for self-driving cars. Working at the pixel-level is useful pretraining, but it’s not necessarily where you want to be for planning. (Would MuZero play Go better if we forced it to emit, based on its latent space being used for planning, a 1024px RGB image of a photorealistic Go board at every step in a rollout? Most attempts to do planning while forcing reconstruction of hypothetical states don’t show good results.)
Right. The use case I had in mind for electric cars was the standard “You see someone walking by the edge of the street; are they going to step out into the street or not? It depends on e.g. which way they are facing, whether they just dropped something into the street, … etc.” That seems like something where pixel-based image prediction would be superior to e.g. classifying the entity as a pedestrian and then adding a pedestrian token to your 3D model of your enviornment.