I think this is a very interesting discussion, and I enjoyed your exposition. However, the piece fails to engage with the technical details or existing literature, to its detriment.
Take your first example, “Tricking GPT-3”. GPT is not: give someone a piece of paper and ask them to finish it. GPT is: You sit behind one way glass watching a man at a typewriter. After every key he presses you are given a chance to press a key on an identical typewriter of your own. If typewriter-man’s next press does not match your prediction, you get an electric shock. You always predict every keystroke, even before he starts typing.
In this situation, would a human really do better? They might well begin a “proper continuation” after rule 3 only to receive a nasty shock when the typist continues “4. ”. Surely by rule 11 a rule 12 is ones best guess? And recall that GPT in its auto-regressive generation mode experiences text in exactly the same way as when simply predicting; there is no difference in its operation, only in how we interpret that operation. So after 12 should come 13, 14… There are several other issues with the prompt, but this is the most egregious.
As for Winograd, the problem of surface associations mimicking deeper understanding is well known. All testing today is done on WinoGrande which is strongly debiased and even adversarially mined (see in particular page 4 figure 1). GPT-3 0-shot scores (70%) well below the human level (94%) but also well above chance (50%). For comparison, BERT (340 million param) 0-shot scores 50.2%.
There are also cases, like multiplication, where GPT-3 unequivocally extracts a deeper “world model”, demonstrating that it is at least possible to do so as a language model.
Of course, all of this is likely to be moot! Since GPT-3′s release, a primary focus of research has been multimodality, which provides just the sort of grounding you desire. It’s very difficult to argue that CLIP, for instance, doesn’t know what an avocado looks like, or that these multimodal agents from Deepmind aren’t grounded as they follow natural language instructions (video, top text is received instruction).
In all, I find the grounding literature interesting but I remain unconvinced it puts any limits on the capabilities even of the simplest unimodal, unagentic models (unlike, say, the causality literature).
This does a great job of importing and translating a set of intuitions from a much more established and rigorous field. However, as with all works framing deep learning as a particular instance of some well-studied problem, it’s vital to keep the context in mind:
Despite literally thousands of papers claiming to “understand deep learning” from experts in fields as various as computational complexity, compressed sensing, causal inference, and—yes—statistical learning, NO rigorous, first-principles analysis has ever computed any aspect of any deep learning model beyond toy settings. ALL published bounds are vacuous in practice.
It’s worth exploring why, despite strong results in their own settings, and despite strong “intuitive” parallels to deep learning, this remains true. The issue is that all these intuitive arguments have holes big enough to accommodate, well, the end of the world. There are several such challenges in establishing a tight correspondence between “classical machine learning” and deep learning, but I’ll focus on one that’s been the focus of considerable effort: defining simplicity.
This notion is essential. If we consider a truly arbitrary function, there is no need for a relationship between the behavior on one input and the behavior on another—the No Free Lunch Theorem. If we want our theory to have content (that is, to constrain the behavior of a Deep Learning system whatsoever) we’ll need to narrow the range of possibilities. Tools from statistical learning like the VC dimension are useless as is due to overparameterization, as you mention. We’ll need a notion of simplicity that captures what sorts of computational structures SGD finds in practice. Maybe circuit size, or minima sharpness, or noise sensitivity… - how hard could it be?
Well no one’s managed it. To help understand why, here are two (of many) barrier cases:
Sparse parity with noise. For an input bitstring x, y is defined as the xor of a fixed, small subset of indices. E.g if the indices are 1,3,9 and x is 101000001 then y is 1 xor 1 xor 1 = 1. Some small (tending to zero) measurement error is assumed. Though this problem is approximated almost perfectly by extremely small and simple boolean circuits (a log-depth tree of xor gates with inputs on the chosen subset), it is believed to require an exponential amount of computation to predict even marginally better than random! Neural networks require exponential size to learn it in practice.
Deep Learning Fails
The protein folding problem. Predict the shape of a protein from its amino acid sequence. Hundreds of scientists have spent decades scouring for regularities, and failed. Generations of supercomputers have been built to attempt to simulate the subtle evolution of molecular structure, and failed. Billions of pharmaceutical dollars were invested—hundreds of billions were on the table for success. The data is noisy and multi-modal. Protein language models learn it all anyway.
Deep Learning Succeeds!
For what notion is the first problem complicated, and the second simple?
Again, without such a notion, statistical learning theory makes no prediction whatsoever about the behavior of DL systems on new examples. If a model someday outputted a sequence of actions which caused the extinction of the human race, we couldn’t object on principle, only say “so power-seeking was simpler after all”. And even with such a notion, we’d still have to prove that Gradient Descent tends to find it in practice and a dozen other difficulties...
Without a precise mathematical framework to which we can defer, we’re left with Empirics to help us choose between a bunch of sloppy, spineless sets of intuitions. Much less pleasant. Still, here’s a few which push me towards Deep Learning as a “computationally general, pattern-finding process” rather than function approximation:
Neural networks optimized only for performance show surprising alignment with representations in the human brain, even exhibiting 1-1 matches between particular neurons in ANNs and living humans. This is an absolutely unprecedented level of predictivity, despite the models not being designed for such and taking no brain data as input.
LLM’s have been found to contain rich internal structure such as grammatical parse-trees, inference-time linear models, and world models. This sort of mechanistic picture is missing from any theory that considers only i/o
Small changes in loss (i.e. function approximation accuracy) have been associated with large qualitative changes in ability and behavior—such as learning to control robotic manipulators using code, productively recurse to subagents, use tools, or solve theory of mind tasks.
I know I’ve written a lot, so I appreciate your reading it. To sum up:
Despite intuitive links, efforts to apply statistical learning theory to deep learning have failed, and seem to face substantial difficulties
So, we have to resort to experiment where I feel this intuitive story doesn’t fit the data, and provide some challenge cases