… believes that AI progress will (probably) be gradual, smooth, and relatively predictable, with each advance increasing capabilities by a little, receiving widespread economic use, and adopted by multiple actors before it is compounded by the next advance
… believes that AI progress will (probably) be erratic, involve sudden capability jumps
The question of whether there is a jump specifically at the autonomous research threshold (let’s call that “AGI”) is muddled by the discussion of what happens prior to that threshold. The reasons for the jump there in particular are very different from reasons for jumps elsewhere, and it doesn’t seem relevant to discuss presence or absence of such jumps elsewhere in connection to the jump at this particular threshold.
I expect gradual improvement all the way to AGI, then technical feasibility of a jump from that particular level to superintelligence in a matter of months, if the AGI is allowed to do its thing. But the reasons for expecting gradual improvement prior to AGI and expecting a jump after AGI seem unrelated. There are convergent scaling laws that different architectures seem to share in quantitative detail, always constrained in practical application by slowly changing available hardware and investment, thus sudden jumps are unlikely for long stretches of time, possibly up to AGI. And then there is serial speed advantage of AIs that accelerates technological history across the board, which doesn’t influence progress prior to AIs becoming autonomously competent at research, but then suddenly gets to influence it, making use of existing hardware/investment more efficiently to extract much more competence out of it.
Figuring out how to generate much higher quality general data (as in RL and self-play) is a wildcard that might disrupt gradual improvement before AGI, but then at this point it’s probably also sufficient to reach AGI, given how capable existing systems are that only use natural data. So the distinction is mostly in difficulty of stopping at a system capable of autonomous research but not yet significantly more competent than humans, which is important for plans that want to bootstrap defense against misaligned AI. It’s still gradual predictable improvement followed by a jump to superintelligence (if this particular jump is not interrupted).
Scaling laws are an important phenomena and probably deeply tied with the nature of intelligence.
I do take issue with the assertion that scaling laws imply slow takeoff. One key takeaway of the modern ML revolution is that specific details of architectures-in-the-narrow-sense* is mostly not that important and compute and data dominate.
The natural implication is that scaling laws are a function of the data distribution—and mostly not of the architecture. Just because we see a ‘smooth, slow’ scaling law on text data doesn’t mean that this will generalize to other domains/situations/ horizons. In fact, I think we should mostly expect this not to be the case.
*I think the jump from architectures-in-the-narrow-sense don’t matter to architectures-in-the-broad-sense don’t matter is often made. I think this obviously not suppored by the evidence we have sofar (despite many claims to the contrary) and likely wrong.
Even architectures-in-the-narrow-sense don’t show overarching scaling laws at current scales, right? IIRC the separate curves for MLPs, LSTMs and transformers do not currently match up into one larger curve. See e.g. figure 7 here.
So a sudden capability jump due to a new architecture outperforming transformers the way transformers outperform MLPs at equal compute cost seems to be very much in the cards?
I intuitively agree that current scaling laws seem like they might be related in some way to a deep bound on how much you can do with a given amount of data and compute, since different architectures do show qualitatively similar behavior even if the y-axes don’t match up. But I see nothing to suggest that any current architectures are actually operating anywhere close to that bound.
The relevant laws describe how perplexity determines compute and data needed to get it by a training run that tries to use as little compute as possible and is otherwise unconstrained on data. The claim is this differs surprisingly little across different architectures. This is different from what historical trends in algorithmic progress measure, since those results are mostly not unconstrained on data (which also needs to be from sufficiently similar distributions to compare architectures), and fail to get through the initial stretch of questionable scaling at low compute.
It’s still probably mostly selection effect, but see Mamba’s scaling laws (Figure 4 in the paper) where dependence of FLOPs on perplexity only ranges about 6x across GPT-3, LLaMA, Mamba, Hyena, and RWKV. Also, the graphs for different architectures don’t like intersecting, suggesting some “compute multiplier” property of how efficient an architecture is across a wide range of compute compared to another architecture. The question is if any of these compute multipliers significantly change at greater scale, once you clear the first 1e20 FLOPs or so.
Hence generation of higher quality data is a plausible way of disrupting the way scaling laws govern slow takeoff. What this data needs to provide is general cognitive competence that therefore applies to the physical world, but that competence doesn’t need to involve initial familiarity with the human world.
So it could be formal proofs on a reasonable distribution of topics, or a superscaled RL system in an environment that sufficiently elicits general reasoning. If the backbone of a dataset shapes representations towards competence, it might transfer to other areas. Thus we get an alien mind that mostly uses natural data as a tool to speak good English and anticipate popular opinion, not as the essential fabric of its own nature.
In the current not-knowing-what-we-are-doing regime, I’m guessing the safer AGIs are scaffolded natural data LLMs, or failing that model-based RL systems that develop in contact with the human world or data. Model-free RL that relies on a synthetic environment to generate enough data risks growing up more alien. Less clear with reasoning that originates in synthetic data for math, grounded in the physical world through natural data being a fraction of datasets for all models in the system (as a kind of multimodality). Such admixing of natural data might even be sufficient to make a model-free RL system less alien.
The question of whether there is a jump specifically at the autonomous research threshold (let’s call that “AGI”) is muddled by the discussion of what happens prior to that threshold. The reasons for the jump there in particular are very different from reasons for jumps elsewhere, and it doesn’t seem relevant to discuss presence or absence of such jumps elsewhere in connection to the jump at this particular threshold.
I expect gradual improvement all the way to AGI, then technical feasibility of a jump from that particular level to superintelligence in a matter of months, if the AGI is allowed to do its thing. But the reasons for expecting gradual improvement prior to AGI and expecting a jump after AGI seem unrelated. There are convergent scaling laws that different architectures seem to share in quantitative detail, always constrained in practical application by slowly changing available hardware and investment, thus sudden jumps are unlikely for long stretches of time, possibly up to AGI. And then there is serial speed advantage of AIs that accelerates technological history across the board, which doesn’t influence progress prior to AIs becoming autonomously competent at research, but then suddenly gets to influence it, making use of existing hardware/investment more efficiently to extract much more competence out of it.
Figuring out how to generate much higher quality general data (as in RL and self-play) is a wildcard that might disrupt gradual improvement before AGI, but then at this point it’s probably also sufficient to reach AGI, given how capable existing systems are that only use natural data. So the distinction is mostly in difficulty of stopping at a system capable of autonomous research but not yet significantly more competent than humans, which is important for plans that want to bootstrap defense against misaligned AI. It’s still gradual predictable improvement followed by a jump to superintelligence (if this particular jump is not interrupted).
Scaling laws are an important phenomena and probably deeply tied with the nature of intelligence.
I do take issue with the assertion that scaling laws imply slow takeoff. One key takeaway of the modern ML revolution is that specific details of architectures-in-the-narrow-sense* is mostly not that important and compute and data dominate.
The natural implication is that scaling laws are a function of the data distribution—and mostly not of the architecture. Just because we see a ‘smooth, slow’ scaling law on text data doesn’t mean that this will generalize to other domains/situations/ horizons. In fact, I think we should mostly expect this not to be the case.
*I think the jump from architectures-in-the-narrow-sense don’t matter to architectures-in-the-broad-sense don’t matter is often made. I think this obviously not suppored by the evidence we have sofar (despite many claims to the contrary) and likely wrong.
Even architectures-in-the-narrow-sense don’t show overarching scaling laws at current scales, right? IIRC the separate curves for MLPs, LSTMs and transformers do not currently match up into one larger curve. See e.g. figure 7 here.
So a sudden capability jump due to a new architecture outperforming transformers the way transformers outperform MLPs at equal compute cost seems to be very much in the cards?
I intuitively agree that current scaling laws seem like they might be related in some way to a deep bound on how much you can do with a given amount of data and compute, since different architectures do show qualitatively similar behavior even if the y-axes don’t match up. But I see nothing to suggest that any current architectures are actually operating anywhere close to that bound.
Is it true that scaling laws are independent of architecture? I don’t know much about scaling laws but that seems surely wrong to me.
e.g. how does RNN scaling compare to transformer scaling
The relevant laws describe how perplexity determines compute and data needed to get it by a training run that tries to use as little compute as possible and is otherwise unconstrained on data. The claim is this differs surprisingly little across different architectures. This is different from what historical trends in algorithmic progress measure, since those results are mostly not unconstrained on data (which also needs to be from sufficiently similar distributions to compare architectures), and fail to get through the initial stretch of questionable scaling at low compute.
It’s still probably mostly selection effect, but see Mamba’s scaling laws (Figure 4 in the paper) where dependence of FLOPs on perplexity only ranges about 6x across GPT-3, LLaMA, Mamba, Hyena, and RWKV. Also, the graphs for different architectures don’t like intersecting, suggesting some “compute multiplier” property of how efficient an architecture is across a wide range of compute compared to another architecture. The question is if any of these compute multipliers significantly change at greater scale, once you clear the first 1e20 FLOPs or so.
Hence generation of higher quality data is a plausible way of disrupting the way scaling laws govern slow takeoff. What this data needs to provide is general cognitive competence that therefore applies to the physical world, but that competence doesn’t need to involve initial familiarity with the human world.
So it could be formal proofs on a reasonable distribution of topics, or a superscaled RL system in an environment that sufficiently elicits general reasoning. If the backbone of a dataset shapes representations towards competence, it might transfer to other areas. Thus we get an alien mind that mostly uses natural data as a tool to speak good English and anticipate popular opinion, not as the essential fabric of its own nature.
In the current not-knowing-what-we-are-doing regime, I’m guessing the safer AGIs are scaffolded natural data LLMs, or failing that model-based RL systems that develop in contact with the human world or data. Model-free RL that relies on a synthetic environment to generate enough data risks growing up more alien. Less clear with reasoning that originates in synthetic data for math, grounded in the physical world through natural data being a fraction of datasets for all models in the system (as a kind of multimodality). Such admixing of natural data might even be sufficient to make a model-free RL system less alien.