Although I don’t agree with everything in this site, I found this cluster of knowledge related advice (learning abstractions) and the rest of the site (made by a LW’er IIRC) very interesting if not helpful thus far; it seems to have advocated that:
Forced learning/too fast pacing (cramming) can be counterproductive since you’re no longer learning for the sake of learning (mostly true in my experience).
Abstract knowledge (math) tends to be most useful since it can be applied fruitfully. And you can actually readily use those abstractions for practical things, through honing intuitions about how to approach a lot of technical problems, mainly by mapping subproblems to mathematical abstractions. Those problems (coding/calculation) are made harder to forget how to solve.
Being curiosity driven is instrumentally useful (since it does help with future learning, delaying aging, etc.), and is of course rational.
Spaced repetition seems to work well for math and algorithms and is self-reinforcing if done in a curiosity driven approach. However, instead of using specific software to “gamify” this, I personally just recall certain key principles in my head, ask myself the motivations behind certain concepts, and keep a list of summarized points/derivations/copied diagrams around in a simple Notes document to review things “offline”. (But I’ll need to check out Anki sometime.)
That’s most of what I took away from the resources that the site offered.
Some disclaimers/reservations (strictly opinions) based on personal experiences, followed by some open questions:
I don’t think the “forgetting curve” is as important as the site makes it sound, particularly when it comes to abstractions, but this curve might have been about “general” knowledge, i.e. learning facts in general. The situation with abstract knowledge seems to be the opposite.
Hence, forgetting might not be as “precious” with abstractions, and might in fact impair ability to learn in the future. Abstractions, including lessons in rationality, are (IMO) meant to help with learning, not always for communicating/framing concepts.
It might require a fair bit of object level experiences (recallable from long term memory) to integrate abstract knowledge meaningfully and efficiently. Otherwise that knowledge isn’t grounded in experience, and we know that that’s just as disadvantageous for humans as AI.
Q1: It remains unclear whether there exists a broader applicable scope here (in terms of other ways that knowledge itself can be used to build competence) except by honing rationality, Bayesianism, and general mathematical knowledge. Would it make sense if there was or wasn’t?
Q2: It seems important to be able to figure out (on a self-supervised, intuitive level) when a learned abstraction is interfering with learning something new or being competent, in the sense that one has to detect whether it is being misapplied or is complicating the representation of knowledge more so than simplifying. Appropriate & deep knowledge of the motivations behind abstractions, their situations, and invariances would seem to help at first glance, in addition to prioritizing first-principles priors when approaching a problem instead of rigid assumptions.
Q3: Doing this may not suit everyone who isn’t a student or full-time autodidact, (and reads textbooks for fun, and has a technical background). Also, I haven’t come across an example of someone who prolonged their useful careers, earned millions of dollars, etc., as a provable result of abstraction. Conversely, practitioners develop a lot of skills that directly help within a specialized economy. There still remain very obvious reasons to condense a whole bunch of mathy (and some computer-sciency) abstractions as flashcards and whatnot to save time.
Edited for clarity and to correct misinterpretations of central arguments.
This response is to consider (contra your arguments) the ways in which the transformer might be fundamentally different from the model of a NN that you may be thinking about, which is as a series of matrix multiplications of “fixed” weight matrices. This is the assumption that I will first try to undermine. In so doing, I might hopefully lay some groundwork for an explanatory framework for neural networks that have self-attention layers (for much later), or (better) inspire transparency efforts to be made by others, since I’m mainly writing this to provoke further thought.
However, I do hope to make some justifiable case below for transformers being able to scale in the limit to an AGI-like model (i.e. which was an emphatic “no” from you) because they do seem to be exhibiting the type of behavior (i.e. few-shot learning, out of distribution generalization) that we’d expect would scale to AGI sufficiently, if improvements in these respects were to continue further.
I see that you are already familiar with transformers, and I will reference this description of their architecture throughout.
Epistemic Status: What follows are currently incomplete, likely fatally flawed arguments that I may correct down the line.
Caveats: It’s reasonable to dismiss transformers/GPT-N as falling into the same general class of fully connected architectures in the sense that:
They’re data hungry, like most DNNs, at least during the pre-training phase.
They’re not explicitly replicating neocortical algorithms (such as bottom-up feedback on model predictions) that we know are important for systematic generalization.
They have extraneous inductive biases besides those in the neocortex, which hinder efficiency.
Some closer approximation of the neocortex, such as hierarchical temporal memory, is necessary for efficiently scaling to AGI.
Looking closer: How Transformers Depart “Functionally” from Most DNNs
Across two weight matrices of a fully-connected DNN, we see something like:
σ(Ax)^T * B^T, for some vector x, and hidden layers {A, B}, which gives just another vector of activations where σ is an element-wise activation function.
These activations are “dynamic”. But I think you would be right to say that they do not in any sense modify the weights applied to activations downstream; however, this was a behavior you implied was missing in the transformer that one might find in the neocortex.
In a Transformer self-attention matrix (A=QK^T), though, we see a dynamic weight matrix:
(Skippable) As such, the values (inner products) of this matrix are contrastive similarity scores between each (k∈K,q∈Q) vector pairs
(Skippable) Furthermore, matrix A consists of n*n inner products S_i = {<k_i,q_1>..<k_i,q_n> | ki∈K,qi∈Q, i<=n}, where each row A_i of A has entries mapped to an ordered set S_i
Crucially, the rows S_i of A are coefficients of a convex combination of the rows of the value matrix (V) when taking softmax(A)V. This is in order to compute the output (matrix) of a self-attention layer, which is also different from the kind of matrix multiplication that was performed when computing similarity matrix A.
Note: The softmax(A) function in this case is also normalizing the values of matrix A row-wise, not column-wise, matrix-wise, etc.
3. Onto the counterargument:
Given what was described in (2), I mainly argue that softmax(A=QK^T)V is a very different computation from the kind that fully connected neural networks perform and what you may’ve envisioned.
We specifically see that:
Ordinary, fully connected (as well as convolutional, most recurrent) neural nets don’t generate “dynamic” weights that are then applied to any activations.
It is possible that the transformer explicitly conditions a self-attention matrix A_l at layer l such that the downstream layers l+1..L (given L self-attention layers) are more likely to produce the correct embedding tokens. This is because we’re only giving the transformer L layers to compute the final result.
Regardless of if the above is happening, the transformer is being “guided” to do implicit meta-learning as part of its pre-training, because:
(a) It’s conditioning its weight matrices (A_l) on the given context (X_l) to maximize the probability of the correct autoregressive output in X_L, in a different manner from learning an ordinary, hierarchical representation upstream.
(b) As it improves the conditioning described in (a) during pre-training, it gets closer to optimal performance on some downstream, unseen task (via 0-shot learning). This is assumed to be true on an evidential basis.
(c) I argue that such zero-shot learning on an unseen task T requires online-learning on that task, which is being described in the given context.
(d) Further, I argue that this online-learning improves sample-efficiency when doing gradient updates on an unseen task T, by approximately recognizing a similar task T’ given the information in the context (X_1). Sample efficiency is improved because the training loss on T can be determined by its few-shot performance on T, which is related to few-shot accuracy, and because training-steps-to-convergence is directly related to training loss.
So, when actually performing such updates, a better few-shot learner will take fewer training steps. Crucially, it improves the sample efficiency of its future training in not just a “prosaic” manner of having improved its held-out test accuracy, but through (a-c) where it “learns to adapt” to an unseen task (somehow).
Unfortunately, I wouldn’t know what is precisely happening in (a-d) that allows for systematic meta-learning to occur, in order for the key proposition:
to be weakened substantially. I just think that meta-learning is indeed happening given the few-shot generalization to unseen tasks that was demonstrated, which only looks like it has something to do with the dynamic weight matrix behavior suggested by (a-d). However, I do not think that it’s enough to show the dynamic weights mechanism described initially (is doing such and such contrastive learning), or to show that it’s an overhaul from ordinary DNNs and therefore robustly solves the generative objective (even if that were the case). Someone would instead have to demonstrate that transformers are systematically performing meta-learning (hence out-of-distribution and few-shot generalization) on task T, which I think is worthwhile to investigate given what they have accomplished experimentally.
Granted, I do believe that more closely replicating cortical algorithms is important for efficiently scaling to AGI and for explainability (I’ve read On Intelligence, Surfing Uncertainty, and several of your articles). The question, then, is whether there are multiple viable paths to efficiently-scaled, safe AGI in the sense that we can functionally (though not necessarily explicitly) replicate those algorithms.