Veedrac comments on DALL-E by OpenAI

Veedrac 10 Jan 2021 3:57 UTC
4 points
I expect getting a dataset an order of magnitude larger than The Pile without significantly compromising on quality will be hard, but not impractical. Two orders of magnitude (~100 TB) would be extremely difficult, if even feasible. But it’s not clear that this matters; per Scaling Laws, dataset requirements grow more slowly than model size, and a 10 TB dataset would already be past the compute-data intersection point they talk about.
Note also that 10 TB of text is an exorbitant amount. Even if there were a model that would hit AGI with, say, a PB of text, but not with 10 TB of text, it would probably also hit AGI with 10 TB of text plus some fairly natural adjustments to its training regime to inhibit overfitting. I wouldn’t argue this all the way down to human levels of data, since the human brain has much more embedded structure than we assume for ANNs, but certainly huge models like GPT-3 start to learn new concepts in only a handful of updates, and I expect that trend of greater learning efficiency to continue.
I’m also skeptical that images, video, and such would substantially change the picture. Images are very information sparse. Consider the amount you can learn from 1MB of text, versus 1MB of pixels.
Correlations among these senses gives rise to understanding causality. Moreover, human brains might have evolved innate structures for things like causality, agency, objecthood, etc which don’t have to be learned.
Correlation is not causation ;). I think it’s plausible that agenthood would help progress towards some of those ideas, but that doesn’t much argue for multiple distinct senses. You can find mere correlations just fine with only one.
It’s true that even a deafblind person will have mental structures that evolved for sight and hearing, but that’s not much of an argument that it’s needed for intelligence, and given the evidence (lack of mental impairment in deafblind people), a strong argument seems necessary.
For sure I’ll accept that you’ll want to train multimodal agents anyway, to round out their capabilities. A deafblind person might still be intellectually capable, but it doesn’t mean they can paint.