Your definition of general intelligence would include SGD on large neural networks. It is able to generalize from very few examples, learn and transform novel mathematical objects, be deployed on a wide variety of problems, and so on. Though it seems a pretty weak form of general intelligence, like evolution or general function optimization algorithms. Though perhaps its less general than evolution and less powerful than function optimization algorithms.
If we take this connection at face-value, we can maybe use SGD as a prototypical example for general intelligence, and ask: what about SGD makes it so generally capable? A few answers come to mind:
Neural networks with SGD have a pretty good (though perhaps imperfect) prior
Neural networks with SGD scale adequately with more network nodes and data
Intuitively you would expect to only need one of these. You should be able to make up a faulty prior with a ton of data or make up for not much data with a pretty good prior. Neural networks with SGD seem pretty good at both of these, indicating perhaps that they don’t trade off against each other as much as a Bayesian may naively think. In particular, worlds where both of these components are necessary for an adequate general intelligence are worlds where data are generated from a long-tailed distribution over distributions. Aka, where its not that uncommon to encounter novel information despite already having a ton of information.
SGD seems like it can be made tremendously smarter by feeding in more data and stacking more layers. If you have some SGD-like process in your neural network, these improvements seem easy enough to develop by further training. You could imagine an alpha-go-style process which takes in a data structure including a problem statement and a goal, and derives a winning plan for the problem. Increasing the number of layers would be the equivalent of devoting more subspaces to the computation, and more data would involve replicating a similar process in the next layer or through recurrency. With current LLMs, both processes seem difficult to implement, but people have found SGD-like structures inside transformers (and many other networks with residual connections), so even a sprinkle of SGD without scaling seems to pull its weight.
Your definition of general intelligence would include SGD on large neural networks
I don’t count it in, actually. In my view, the boundaries of the algorithm here aren’t “SGD + NN”, but “the training loop” as a whole, which includes the dataset and the loss/reward function. A general intelligence implemented via SGD, then, would correspond to an on-line training loop that can autonomously (without assistance from another generally-intelligent entity, like a human overseer) learn to navigate any environment.
I don’t think any extant training-loop setup fits this definition. They all need externally-defined policy gradients. If the distribution on which they’re trained changes significantly, the policy gradient (loss/reward function) would need to be changed to suit — and that’d need to be done by something external to the training loop, which already understands the new environment (e. g., the human overseer) and knows how the policy gradient needs to be adapted to keep the system on-target.
(LLMs trained via SSL are a degenerate case: in their case the prediction gradient = the policy gradient. They also can’t autonomously generalize to generating new classes of text without first being shown a carefully curated dataset of such texts. They’re not an exception.)
Your definition of general intelligence would include SGD on large neural networks. It is able to generalize from very few examples, learn and transform novel mathematical objects, be deployed on a wide variety of problems, and so on. Though it seems a pretty weak form of general intelligence, like evolution or general function optimization algorithms. Though perhaps its less general than evolution and less powerful than function optimization algorithms.
If we take this connection at face-value, we can maybe use SGD as a prototypical example for general intelligence, and ask: what about SGD makes it so generally capable? A few answers come to mind:
Neural networks with SGD have a pretty good (though perhaps imperfect) prior
Neural networks with SGD scale adequately with more network nodes and data Intuitively you would expect to only need one of these. You should be able to make up a faulty prior with a ton of data or make up for not much data with a pretty good prior. Neural networks with SGD seem pretty good at both of these, indicating perhaps that they don’t trade off against each other as much as a Bayesian may naively think. In particular, worlds where both of these components are necessary for an adequate general intelligence are worlds where data are generated from a long-tailed distribution over distributions. Aka, where its not that uncommon to encounter novel information despite already having a ton of information.
SGD seems like it can be made tremendously smarter by feeding in more data and stacking more layers. If you have some SGD-like process in your neural network, these improvements seem easy enough to develop by further training. You could imagine an alpha-go-style process which takes in a data structure including a problem statement and a goal, and derives a winning plan for the problem. Increasing the number of layers would be the equivalent of devoting more subspaces to the computation, and more data would involve replicating a similar process in the next layer or through recurrency. With current LLMs, both processes seem difficult to implement, but people have found SGD-like structures inside transformers (and many other networks with residual connections), so even a sprinkle of SGD without scaling seems to pull its weight.
I don’t count it in, actually. In my view, the boundaries of the algorithm here aren’t “SGD + NN”, but “the training loop” as a whole, which includes the dataset and the loss/reward function. A general intelligence implemented via SGD, then, would correspond to an on-line training loop that can autonomously (without assistance from another generally-intelligent entity, like a human overseer) learn to navigate any environment.
I don’t think any extant training-loop setup fits this definition. They all need externally-defined policy gradients. If the distribution on which they’re trained changes significantly, the policy gradient (loss/reward function) would need to be changed to suit — and that’d need to be done by something external to the training loop, which already understands the new environment (e. g., the human overseer) and knows how the policy gradient needs to be adapted to keep the system on-target.
(LLMs trained via SSL are a degenerate case: in their case the prediction gradient = the policy gradient. They also can’t autonomously generalize to generating new classes of text without first being shown a carefully curated dataset of such texts. They’re not an exception.)
I’m skeptical that locating the hyperparameters you mention is an AGI-complete task.