I think the appeal of symbolic and hybrid approaches is clear, and progress in this direction would absolutely transform ML capabilities. However, I believe the approach remains immature in a way that the phrase “Human-Level Reinforcement Learning” doesn’t communicate.
The paper uses classical symbolic methods and so faces that classic enemy of GOFAI: super-exponential asymptotics. In order to make the compute more manageable, the following are hard-coded into EMPA:
Direct access to game state (unlike the neural networks, which learned from pixels)
The existence of walls, and which objects are walls
The 14 possible object interactions (That some objects are dangerous, that some can be pushed, some are walls, etc)
Which object is the player, and what type of player (Shooter or MovingAvatar), and which objects are the player’s bullets
The form of the objective (always some object count == 0)
That object interactions are deterministic
That picking up resources is good
The physics of projectile firing: reward was directly transported from a simulation of what a fired projectile hit, obviating the need to plan over that long time horizon
etc, etc, etc
Additionally, the entire algorithm is tuned to their own custom dataset,. None of this would be feasible for Atari games, or indeed the GVGAI competition, whose video game descriptive language they use to write their own environments. There’s a reason they don’t evaluate on any of the many existing benchmarks.
I come across a paper like this every once in a while: “The Revenge of GOFAI”. Dileep George et al’s Recursive cortical networks. Deepmind’s Apperception engine. Tenenbaum’s own Omniglot solver. They have splashy titles and exciting abstracts, but look into the methods section and you’ll find a thousand bespoke and clever shortcuts, feature engineering for the modern age. It’s another form of overfitting, it doesn’t generalize. The super-exponential wall remains as sheer ever and these approaches simply cannot scale.
I’ll reiterate that any progress in these areas would mean substantially more powerful, more explainable models. I applaud these researchers for their work on a hard and important problem. However, I can’t consider these papers to represent progress. Instead, I find them aspirational, like the human mind itself: that our methods might someday truly be this capable, without the tricks. I’m left hoping and waiting for insight of a qualitatively different sort.
I think this is a very interesting discussion, and I enjoyed your exposition. However, the piece fails to engage with the technical details or existing literature, to its detriment.
Take your first example, “Tricking GPT-3”. GPT is not: give someone a piece of paper and ask them to finish it. GPT is: You sit behind one way glass watching a man at a typewriter. After every key he presses you are given a chance to press a key on an identical typewriter of your own. If typewriter-man’s next press does not match your prediction, you get an electric shock. You always predict every keystroke, even before he starts typing.
In this situation, would a human really do better? They might well begin a “proper continuation” after rule 3 only to receive a nasty shock when the typist continues “4. ”. Surely by rule 11 a rule 12 is ones best guess? And recall that GPT in its auto-regressive generation mode experiences text in exactly the same way as when simply predicting; there is no difference in its operation, only in how we interpret that operation. So after 12 should come 13, 14… There are several other issues with the prompt, but this is the most egregious.
As for Winograd, the problem of surface associations mimicking deeper understanding is well known. All testing today is done on WinoGrande which is strongly debiased and even adversarially mined (see in particular page 4 figure 1). GPT-3 0-shot scores (70%) well below the human level (94%) but also well above chance (50%). For comparison, BERT (340 million param) 0-shot scores 50.2%.
There are also cases, like multiplication, where GPT-3 unequivocally extracts a deeper “world model”, demonstrating that it is at least possible to do so as a language model.
Of course, all of this is likely to be moot! Since GPT-3′s release, a primary focus of research has been multimodality, which provides just the sort of grounding you desire. It’s very difficult to argue that CLIP, for instance, doesn’t know what an avocado looks like, or that these multimodal agents from Deepmind aren’t grounded as they follow natural language instructions (video, top text is received instruction).
In all, I find the grounding literature interesting but I remain unconvinced it puts any limits on the capabilities even of the simplest unimodal, unagentic models (unlike, say, the causality literature).