Edited for clarity and to correct misinterpretations of central arguments.
This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of “fixed” weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transparency efforts to be made by others, since I’m mainly writing this to provoke further thought.
However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) because they do seem to be exhibiting the type of behavior (i.e. few-shot learning, out of distribution generalization) that we’d expect would scale to AGI sufficiently, if improvements in these respects were to continue further.
I see that you are already familiar with transformers, and I will reference this description of their architecture throughout.
Epistemic Status: What follows are currently incomplete, likely fatally flawed arguments that I may correct down the line.
Caveats: It’s reasonable to dismiss transformers/GPT-N as falling into the same general class of fully connected architectures in the sense that:
They’re data hungry, like most DNNs, at least during the pre-training phase.
They’re not explicitly replicating neocortical algorithms (such as bottom-up feedback on model predictions) that we know are important for systematic generalization.
They have extraneous inductive biases besides those in the neocortex, which hinder efficiency.
Some closer approximation of the neocortex, such as hierarchical temporal memory, is necessary for efficiently scaling to AGI.
Looking closer: How Transformers Depart “Functionally” from Most DNNs
Across two weight matrices of a fully-connected DNN, we see something like:
σ(Ax)^T * B^T, for some vector x, and hidden layers {A, B}, which gives just another vector of activations where σ is an element-wise activation function.
These activations are “dynamic”. But I think you would be right to say that they do not in any sense modify the weights applied to activations downstream; however, this was a behavior you implied was missing in the transformer that one might find in the neocortex.
In a Transformer self-attention matrix (A=QK^T), though, we see a dynamic weight matrix:
(Skippable) As such, the values (inner products) of this matrix are contrastive similarity scores between each (k∈K,q∈Q) vector pairs
(Skippable) Furthermore, matrix A consists of n*n inner products S_i = {<k_i,q_1>..<k_i,q_n> | ki∈K,qi∈Q, i<=n}, where each row A_i of A has entries mapped to an ordered set S_i
Crucially, the rows S_i of A are coefficients of a convex combination of the rows of the value matrix (V) when taking softmax(A)V. This is in order to compute the output (matrix) of a self-attention layer, which is also different from the kind of matrix multiplication that was performed when computing similarity matrix A.
Note: The softmax(A) function in this case is also normalizing the values of matrix A row-wise, not column-wise, matrix-wise, etc.
3. Onto the counterargument:
Given what was described in (2), I mainly argue that softmax(A=QK^T)V is a very different computation from the kind that fully connected neural networks perform and what you may’ve envisioned.
We specifically see that:
Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate “dynamic” weights that are then applied to any activations.
It is possible that the transformer explicitly conditions a self-attention matrix A_l at layer l such that the downstream layers l+1..L (given L self-attention layers) are more likely to produce the correct embedding tokens. This is because we’re only giving the transformer L layers to compute the final result.
Regardless of if the above is happening, the transformer is being “guided” to do implicit meta-learning as part of its pre-training, because:
(a) It’s conditioning its weight matrices (A_l) on the given context (X_l) to maximize the probability of the correct autoregressive output in X_L, in a different manner from learning an ordinary, hierarchical representation upstream.
(b) As it improves the conditioning described in (a) during pre-training, it gets closer to optimal performance on some downstream, unseen task (via 0-shot learning). This is assumed to be true on an evidential basis.
(c) I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.
(d) Further, I argue that this online-learning improves sample-efficiency when doing gradient updates on an unseen task T, by approximately recognizing a similar task T’ given the information in the context (X_1). Sample efficiency is improved because the training loss on T can be determined by its few-shot performance on T, which is related to few-shot accuracy, and because training-steps-to-convergence is directly related to training loss.
So, when actually performing such updates, a better few-shot learner will take fewer training steps. Crucially, it improves the sample efficiency of its future training in not just a “prosaic” manner of having improved its held-out test accuracy, but through (a-c) where it “learns to adapt” to an unseen task (somehow).
Unfortunately, I wouldn’t know what is precisely happening in (a-d) that allows for systematic meta-learning to occur, in order for the key proposition:
First, for the reason mentioned above, I think the sample efficiency is bound to be dramatically worse for training a Transformer versus training a real generative-model-centric system. And this [sample inefficiency] makes it difficult or impossible for it to learn or create concepts that humans are not already using.
to be weakened substantially. I just think that meta-learning is indeed happening given the few-shot generalization to unseen tasks that was demonstrated, which only looks like it has something to do with the dynamic weight matrix behavior suggested by (a-d). However, I do not think that it’s enough to show the dynamic weights mechanism described initially (is doing such and such contrastive learning), or to show that it’s an overhaul from ordinary DNNs and therefore robustly solves the generative objective (even if that were the case). Someone would instead have to demonstrate that transformers are systematically performing meta-learning (hence out-of-distribution and few-shot generalization) on task T, which I think is worthwhile to investigate given what they have accomplished experimentally.
Granted, I do believe that more closely replicating cortical algorithms is important for efficiently scaling to AGI and for explainability (I’ve read On Intelligence, Surfing Uncertainty, and several of your articles). The question, then, is whether there are multiple viable paths to efficiently-scaled, safe AGI in the sense that we can functionally (though not necessarily explicitly) replicate those algorithms.
Going by GPT-2′s BPEs [1], and based on the encoder downloaded via OpenAI’s script, there are 819 (single) tokens/embeddings that uniquely map to the numbers from 0-1000, 907 when going up to 10,000, and 912 up to 200,000 [2]. These embeddings of course get preferentially fed into the model in order to maximize the number of characters in the context window and thereby leverage the statistical benefit of BPEs for language modeling. Which bears to mind that the above counts exclude numeric tokens that have a space at the beginning [3].
My point here being that, IIUC, for the language model to actually be able to manipulate individual digits, as well as pick up on the elementary operations of arithmetic (e.g. carry, shift, etc.), the expected number of unique tokens/embeddings might have to be limited to 10 – the base of the number system – when counting from 0 to the largest representable number [2].
[1] From the GPT-3 paper, it was noted:
[2] More speculatively, I think that this limitation makes extrapolation on certain abilities (arithmetic, algebra, coding) quite difficult without knowing whether its BPE will be optimized for the manipulation of individual digits/characters if need be, and that this limits the generalizability of studies such as GPT-3 not being able to do math.
[3] For such tokens, there are a total 505 up to 1000. Like the other byte pairs, these may have been automatically mapped based on the distribution of n-grams in some statistical sample (and so easily overlooked).