Someone who is interested in learning and doing good.
My Substack: https://matthewbarnett.substack.com/
Let me restate some of my points, which can hopefully make my position clearer. Maybe state which part you disagree with:
Language models are probability distributions over finite sequences of text.
The “true distribution” of internet text refers to a probability distribution over sequences of text that you would find on the internet (including sequences found on other internets elsewhere in the multiverse, which is just meant as an abstraction).
A language model is “better” than another language model to the extent that the cross-entropy between the true distribution and the model is lower.
A human who writes a sequence of text is likely to write something with a relatively high log probability relative to the true distribution. This is because in a quite literal sense, the true distribution is just the distribution over what humans actually write.
A current SOTA model, by contrast, is likely to write something with an extremely low log probability, most likely because it will write something that lacks long-term coherence, and is inhuman, and thus, won’t be something that would ever appear in the true distribution (or if it appears, it appears very very very rarely).
The last two points provide strong evidence that humans are actually better at the long-sequence task than SOTA models, even though they’re worse at the next character task.
Intuitively, this is because the SOTA model loses a gigantic amount of log probability when it generates whole sequences that no human would ever write. This doesn’t happen on the next character prediction task because you don’t need a very good understanding of long-term coherence to predict the vast majority of next-characters, and this effect dominates the effect from a lack of long-term coherence in the next-character task.
It is true (and I didn’t think of this before) that the human’s cross entropy score will probably be really high purely because they won’t even think to have any probability on some types of sequences that appear in the true distribution. I still don’t think this makes them worse than SOTA language models because the SOTA will also have ~0 probability on nearly all actual sequences. However…
Even if you aren’t convinced by my last argument, I can simply modify what I mean by the “true distribution” to mean the “true distribution of texts that are in the reference class of things we care about”. There’s absolutely no reason to say the true distribution has to be “everything on the internet” as opposed to “all books” or even “articles written by Rohin” if that’s what we’re actually trying to model.
Thus, I don’t accept one of your premises. I expect current language models to be better than you at next-character prediction on the empirical distribution of Rohin articles, but worse than you at whole sequence prediction for Rohin articles, for reasons you seem to already accept.
Large language models are also going to be wildly superhuman by long-sequence metrics like “log probability assigned to sequences of Internet text”
I think this entirely depends on what you mean. There’s a version of the claim here that I think is true, but I think the most important version of it is actually false, and I’ll explain why.
I claim that if you ask a human expert to write an article (even a relatively short one) about a non-trivial topic, their output will have a higher log probability than a SOTA language model, with respect to the “true” distribution of internet articles. That is, if you were given the (entirely hypothetical) true distribution of actual internet articles (including articles that have yet to be written, and the ones that have been written in other parts of the multiverse...), a human expert is probably going to write an article that has a higher log probability of being sampled from this distribution, compared to a SOTA language model.
This claim might sound bizarre at first, because, as you noted “many such metrics are just sums over the next-character versions of the metric, which this post shows LLMs are great at”. But, first maybe think about this claim from first principles: what is the “true” distribution of internet articles? Well, it’s the distribution of actual internet articles that humans write. If a human writes an article, it’s got to have pretty high log-probability, no? Because otherwise, what are we even sampling from?
Now, what you could mean is that instead of measuring the log probability of an article with respect to the true distribution of internet articles, we measure it with respect to the empirical distribution of internet articles. This is in fact what we use to measure the log-probability of next character predictions. But the log probability of this quantity over long sequences will actually be exactly negative infinity, both for the human-written article, and for the model-written article, assuming they’re not just plagiarizing an already-existing article. That is, we aren’t going to find any article in the empirical distribution that matches the articles either the human or the model wrote, so we can’t tell which of the two is better from this information alone.
What you probably mean is that we could build a model of the true distribution of internet articles, and use this model to estimate the log-probability of internet articles. In that case, I agree, a SOTA language model would probably far outperform the human expert, at the task of writing internet articles, as measured by the log-probability given by another model. But, this is a flawed approach, because the model we’re using to estimate the log-probability with respect to the true distribution of internet articles is likely to be biased in favor of the SOTA model, precisely because it doesn’t understand things like long-sequence coherence, unlike the human.
How could we modify this approach to give a better estimate of the performance of a language model at long-sequence prediction? I think that there’s a relatively simple approach that could work.
Namely, we set up a game in which humans try to distinguish between real human texts and generated articles. If the humans can’t reliably distinguish between the two, then the language model being used to generate the articles has attained human-level performance (at least by this measure). This task has nice properties, as there is a simple mathematical connection between prediction ability and ability to discriminate; a good language model that can pass this test will likely only pass it because it is good at coming up with high log-probability articles. And this task also measures the thing we care about that’s missing from the predict-the-next-character task: coherence over long sequences.
Ah, I see your point. That being said, I think calling the task we train our LMs to do (learn a probabilistic model of language) “language modeling” seems quite reasonable to me—in my opinion, it seems far more unreasonable to call “generating high quality output” “language modeling”.
Note that the main difference between my suggested task and the next-character-prediction task is that I’m suggesting we measure performance over a long time horizon. “Language models” are, formally, probability distributions over sequences of text, not models over next characters within sequences. It is only via a convenient application of the Markov assumption and the chain rule of probability that we use next-character-prediction during training.
The actual task, in the sense of what language models are fundamentally designed to perform well on, is to emulate sequences of human text. Thus, it is quite natural to ask when they can perform well on this task. In fact, I remain convinced that it is more natural to ask about performance on the long-sequence task than the next-character-prediction task.
We disagree that this measure is better. Our goal here isn’t to compare the quality of Language Models to the quality of human-generated text; we aimed to compare LMs and humans on the metric that LMs were trained on (minimize log loss/perplexity when predicting the next token).
Your measure is great for your stated goal. That said, I feel the measure gives a misleading impression to readers. In particular I’ll point to this paragraph in the conclusion,
Even current large language models are wildly superhuman at language modeling. This is important to remember when you’re doing language model interpretability, because it means that you should expect your model to have a lot of knowledge about text that you don’t have. Chris Olah draws a picture where he talks about the possibility that models become more interpretable as they get to human level, and then become less interpretable again as they become superhuman; the fact that existing LMs are already superhuman (at the task they’re trained on) is worth bearing in mind when considering this graph.
I think it’s misleading to say that language models are “wildly superhuman at language modeling” by any common-sense interpretation of that claim. While the claim is technically true if one simply means that languages do better at the predict-the-next-token task, most people (I’d imagine) would not intuitively imagine that to be the best measure of general performance at language modeling. The reason, fundamentally, is that we are building language models to compete with other humans at the task of writing text, not the task of predicting the next character.
By analogy, if we train a robot to play tennis by training it to emulate human tennis players, I think most people would think that “human level performance” is reached when it can play as well as a human, not when it can predict the next muscle movement of an expert player better than humans, even if predicting the next muscle movement was the task used during training.
Building on this comment, I think it might be helpful for readers to make a few distinctions in their heads:
“True entropy of internet text” refers to the entropy rate (measured in bits per character, or bits per byte) of English text, in the limit of perfect prediction abilities. Operationally, if one developed a language model such that the cross entropy between internet text and the model was minimized to the maximum extent theoretically possible, the cross entropy score would be equal to the “true” entropy of internet text. By definition, scaling laws dictate that it takes infinite computation to train a model to reach this cross entropy score. This quantity depends on the data distribution, and is purely a hypothetical (though useful) abstraction.
“Human-level perplexity” refers to perplexity associated with humans tested on the predict-the-next-token task. Perplexity, in this context, is defined as two raised to the power of the cross entropy between internet text, and a model.
“Human-level performance” refers to a level of performance such that a model is doing “about as well as a human”. This term is ambiguous, but is likely best interpreted as a level of perplexity between the “true perplexity” and “human-level perplexity” (as defined previously).
The limitations detailed above are probably why these results are not consistent with estimation of human perplexity by Shannon, who estimated the average per-character perplexity to be between 0.6 and 1.3 bits, which would result in a per-token perplexity between 7 and 60 (the average length of tokens in our corpus is 4.5).
Shannon’s estimate was about a different quantity. Shannon was interested in bounding character-level entropy of an ideal predictor, ie. what we’d consider a perfect language model, though he leveraged human performance on the predict-the-next-character task to make his estimate.
This article cites a paper saying that, when human-level perplexity was measured on the same dataset that Shannon used, a higher estimate was obtained that is consistent with your estimate.
Cover and King framed prediction as a gambling problem. They let the subject “wager a percentage of his current capital in proportion to the conditional probability of the next symbol.” If the subject divides his capital on each bet according to the true probability distribution of the next symbol, then the true entropy of the English language can be inferred from the capital of the subject after n wagers.
Separately, in my opinion, a far better measure of human-level performance at language modeling is the perplexity level at which a human judge can no longer reliably distinguish between a long sequence of generated text and a real sequence of natural language. This measure has advantage that, if well-measured human-level ability is surpassed, we can directly substitute language models for human writers.
One more personal update, which I hope will be final until the bet resolves:
I made quite a few mistakes while writing this bet. For example, I carelessly used 2022 dollars while crafting the inflation adjustment component of the second condition. These sorts of things made me update in the direction of thinking that making a good timelines bet is really, really hard.
And I’m a bit worried that people will use this bet to say that I was deeply wrong, and my credibility will blow up if I lose. Maybe I am deeply wrong, and maybe it’s right that my credibility should blow up. But for the record, I never had a very high credence on winning—just enough so that the bet seemed worth it.
Politicization. The COVID-19 response worries me much more than you, and it’s positives only outweighed it’s negatives only because of the fact that there wasn’t any X-risk. In particular, the fact that there was a strong response actually decayed pretty fast, and in our world virtually everything is politicized into a culture war as soon as it actually impacts people’s lives.
Note that I’m simply pointing out that people will probably try to regulate AI, and that this could delay AI timelines. I’m not proposing that we should be optimistic about regulation. Indeed, I’m quite pessimistic about heavy-handed government regulation of AI, but for reasons I’m not going to go into here.
Separately, the reason why the Covid-19 response decayed quickly likely had little to do with politicization, given that the pandemic response decayed in every nation in the world, with the exception of China. My guess is that, historically, regulations on manufacturing particular technologies have not decayed quite so quickly.
I’m curious if you have any thoughts on the effect regulations will have on AI timelines. To have a transformative effect, AI would likely need to automate many forms of management, which involves making a large variety of decisions without the approval of other humans. The obvious effect of deploying these technologies will therefore be to radically upend our society and way of life, taking control away from humans and putting it in the hands of almost alien decision-makers. Will bureaucrats, politicians, voters, and ethics committees simply stand idly by while the tech industry takes over our civilization like this?
On the one hand, it is true that cars, airplanes, electricity, and computers were all introduced with relatively few regulations. These technologies went on to change our lives greatly in the last century and a half. On the other hand, nuclear power, human cloning, genetic engineering of humans, and military weapons each have a comparable potential to change our lives, and yet are subject to tight regulations, both formally, as the result of government-enforced laws, and informally, as engineers regularly refuse to work on these technologies indiscriminately, fearing backlash from the public.
One objection is that it is too difficult to slow down AI progress. I don’t buy this argument.
A central assumption of the Bio Anchors model, and all hardware-based models of AI progress more generally, is that getting access to large amounts of computation is a key constraint to AI development. Semiconductor fabrication plants are easily controllable by national governments and require multi-billion dollar upfront investments, which can hardly evade the oversight of a dedicated international task force.
We saw in 2020 that, if threats are big enough, governments have no problem taking unprecedented action, quickly enacting sweeping regulations of our social and business life. If anything, a global limit on manufacturing a particular technology enjoys even more precedent than, for example, locking down over half of the world’s population under some sort of stay-at-home order.
Another argument states that the incentives to make fast AI progress are simply too strong: first mover advantages dictate that anyone who creates AGI will take over the world. Therefore, we should expect investments to accelerate dramatically, not slow down, as we approach AGI. This argument has some merit, and I find it relatively plausible. At the same time, it relies on a very pessimistic view of international coordination that I find questionable. A similar first-mover advantage was also observed for nuclear weapons, prompting Bertrand Russell to go as far as saying that only a world government could possibly deter nations from developing and using nuclear weapons. Yet, I do not think this prediction was borne out.
Finally, it is possible that the timeline you state here is conditioned on no coordinated slowdowns. I sometimes see people making this assumption explicit, and in your report you state that you did not attempt to model “the possibility of exogenous events halting the normal progress of AI research”. At the same time, if regulation ends up mattering a lot—say, it delays progress by 20 years—then all the conditional timelines will look pretty bad in hindsight, as they will have ended up omitting one of the biggest, most determinative factors of all. (Of course, it’s not misleading if you just state upfront that it’s a conditional prediction).
The kinds of recursive self-improvement mentioned here aren’t exactly the frequently-envisioned scenario of a single AI system improving itself unencumbered. They instead rely on humans to make them work, and humans are inevitably slow and thus currently inhibit a discontinuous foom scenario.
It’s worth noting that the examples shown here are in line with most continuous models of AI progress. In most continuous models, AI-driven improvements first start small, with AI contributing a little bit to the development process. Over time, AI will contribute more and more to the process of innovation in AI, until they’re contributing 60% of the improvements, then 90%, then 98%, then 99.5%, and then finally all of the development happens through AI, and humans are left out of the process entirely.
I don’t know whether most people who believe in hard takeoff would say that these examples violate their model (probably not), but at the very least, these observations are well-predicted by simple continuous models of AI takeoff.
I now have an operationalization of AGI I feel happy about, and I think it’s roughly just as difficult as creating transformative AI (though perhaps still slightly easier).
I have less probability now on very long timelines (>80 years). Previously I had 39% credence on AGI arriving after 2100, but I now only have about 25% credence.
I also have a bit more credence on short timelines, mostly because I think the potential for massive investment is real, and it doesn’t seem implausible that we could spend >1% of our GDP on AI development at some point in the near future.
I still have pretty much the same reasons for having longer timelines than other people here, though my thinking has become more refined. Here are of my biggest reasons summarized: delays from regulation, difficulty of making AI reliable, the very high bar of automating general physical labor and management, and the fact that previous impressive-seeming AI milestones ended up mattering much less in hindsight than we thought at the time.
Taking these considerations together, my new median is around 2060. My mode is still probably in the 2040s, perhaps 2042.
I want to note that I’m quite impressed with recent AI demos, and I think that we are making quite rapid progress at the moment in the field. My longish timelines are mostly a result of the possibility of delays, which I think are non-trivial.
I’m delighted to have been cited in this post. However, I must now note that this operationalization is out of date. I have a new question on Metaculus that I believe provides a more thorough, and clearer definition of AGI than the one referenced here. I will quote the criteria in full,
The following definitions are provided:A Turing test is any trial during which an AI system is instructed to pretend to be a human participant while communicating with judges who are instructed to discriminate between the AI and human confederates in the trial. This trial may take any format, and may involve communication across a wide variety of media, as long as communication through natural language is permitted.A Turing test is said to be “long” if the AI communicates with judges for a period of at least two consecutive hours.A Turing test is said to be an “informed” test if all of the human judges possess an expert-level understanding of contemporary AI, and the ways in which contemporary AI systems fail, and all of the human confederates possess an expert-level understanding of contemporary AI, and the ways in which contemporary AI systems fail.A Turing test is said to be “adversarial” if the human judges make a good-faith attempt, in the best of their abilities, to successfully unmask the AI as an impostor among the participants, and the human confederates make a good-faith attempt, in the best of their abilities, to demonstrate that they are humans. In other words, all of the human participants should be trying to ensure that the AI does not pass the test.An AI is said to “pass” a Turing test if at least 50% of judges rated the AI as more human than at least 20% of the human confederates. This condition could be met in many different ways, so long as the final determination of the judges explicitly or implicitly yields a rating for how “human” the AI acted during the trial. For example, this condition would be met if there are five human confederates, and at least half of the judges select a human confederate as their single best guess for the imposter.This question resolves on the first date during which a credible document is published indicating that a long, informed, adversarial Turing test was passed by some AI, so long as the test was well-designed and satisfied the criteria written here, according to the best judgement of Metaculus administrators. Metaculus administrators will also attempt to exclude tests that included cheating, conflicts of interest, or rogue participants who didn’t follow the rules. All human judges and confederates should understand that their role is strictly to ensure the loss of the AI, and they collectively “fail” if the AI “passes”.
The following definitions are provided:
A Turing test is any trial during which an AI system is instructed to pretend to be a human participant while communicating with judges who are instructed to discriminate between the AI and human confederates in the trial. This trial may take any format, and may involve communication across a wide variety of media, as long as communication through natural language is permitted.
A Turing test is said to be “long” if the AI communicates with judges for a period of at least two consecutive hours.
A Turing test is said to be an “informed” test if all of the human judges possess an expert-level understanding of contemporary AI, and the ways in which contemporary AI systems fail, and all of the human confederates possess an expert-level understanding of contemporary AI, and the ways in which contemporary AI systems fail.
A Turing test is said to be “adversarial” if the human judges make a good-faith attempt, in the best of their abilities, to successfully unmask the AI as an impostor among the participants, and the human confederates make a good-faith attempt, in the best of their abilities, to demonstrate that they are humans. In other words, all of the human participants should be trying to ensure that the AI does not pass the test.
An AI is said to “pass” a Turing test if at least 50% of judges rated the AI as more human than at least 20% of the human confederates. This condition could be met in many different ways, so long as the final determination of the judges explicitly or implicitly yields a rating for how “human” the AI acted during the trial. For example, this condition would be met if there are five human confederates, and at least half of the judges select a human confederate as their single best guess for the imposter.
This question resolves on the first date during which a credible document is published indicating that a long, informed, adversarial Turing test was passed by some AI, so long as the test was well-designed and satisfied the criteria written here, according to the best judgement of Metaculus administrators. Metaculus administrators will also attempt to exclude tests that included cheating, conflicts of interest, or rogue participants who didn’t follow the rules. All human judges and confederates should understand that their role is strictly to ensure the loss of the AI, and they collectively “fail” if the AI “passes”.
I respond with arguments like, “In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).”Their response in turn is generally some variant of “well, natural selection wasn’t optimizing very intelligently” or “maybe humans weren’t all that sharply above evolutionary trends” or “maybe the power that let humans beat the rest of the ecosystem was simply the invention of culture, and nothing embedded in our own already-existing culture can beat us” or suchlike.Rather than arguing further here, I’ll just say that failing to believe the hard problem exists is one surefire way to avoid tackling it.
I respond with arguments like, “In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).”
Their response in turn is generally some variant of “well, natural selection wasn’t optimizing very intelligently” or “maybe humans weren’t all that sharply above evolutionary trends” or “maybe the power that let humans beat the rest of the ecosystem was simply the invention of culture, and nothing embedded in our own already-existing culture can beat us” or suchlike.
Rather than arguing further here, I’ll just say that failing to believe the hard problem exists is one surefire way to avoid tackling it.
It sounds like you don’t want to argue this point further here, but I would like to point something very simple out that I think your argument here glosses over.
Humanity is a species, not an individual. It wasn’t the case that a single animal arose among all the others, and out-competed everyone else. Instead, it was a large set of entities that collectively out-competed all the other animals. And I think this distinction is quite important to make.
If you think that an analogy to human evolution is critical to understanding our epistemic situation, it appears to me that the evolutionary analogy should force you to draw the opposite conclusion from the one you have drawn here (relative to credible people who disagree).
In my understanding of our situation, the conclusion to draw from human evolution is that a single species can acquire a host of very powerful technologies, and tower above everyone else, in a relatively short period of time. That is, we should predict that, in the future, a collection of AIs could eventually out-match humanity.
But you’re not arguing that thesis! (At least, as I understand your argument) You’re arguing that the evolutionary analogy shows that a single individual can outcompete everyone else. And I don’t know where that idea is coming from.
That’s interesting. One caveat I should add is that I was referring to calorie overconsumption, as opposed to volume overconsumption. Rice is not very calorie dense, making it relatively easy to become full without eating many calories.
I think I’ll pass up an opportunity for a second bet for now. My mistake was being too careless in the first place—and I’m not currently too interested in doing a deeper dive into what might be a good replacement for MATH.
If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development. There were some slightly hard problems that the model was capable of doing, that Google highlighted in their paper (though they were cherry-picked)—and for that I did update by a bit (I said my timelines advanced by “a few years”).
I agree this is more of an update about what existing models were already capable of.
I agree this is more of an update about what existing models were already capable of.
I’m confused. I am not saying that, so I’m not sure which part of my comment you’re agreeing with.
If you want to replace it with something that more represents what you thought MATH did, I will probably take this second bet at the same odds.
If you want to replace it with something that more represents what you thought MATH did, I will probably take this second bet at the same odds.
If I found something, I’d be sympathetic to taking another bet. Unfortunately I don’t know of any other good datasets.
The recent breakthrough on the MATH dataset has made me update substantially in the direction of thinking I’ll lose the bet. I’m now at about 50% chance of winning by 2026, and 25% chance of winning by 2030.
That said, I want others to know that, for the record, my update mostly reflects that I now think MATH is a relatively easy dataset, and my overall AGI median only advanced by a few years.
Previously, I relied quite heavily on statements that people had made about MATH, including the authors of the original paper, who indicated it was a difficult dataset full of high school “competition-level” math word problems. However, two days ago I downloaded the dataset and took a look at the problems myself (as opposed to the cherry-picked problems I saw people blog about), and I now understand that a large chunk of the dataset includes simple plug-and-chug and evaluation problems—some of them so simple that Wolfram Alpha can perform them. What’s more: the previous state of the art model, which was touted as achieving only 6.9%, was simply a fine-tuned version of GPT-2 (they didn’t fine-tune anything larger), which makes it very unsurprising that the prior SOTA was so low.
I feel a little embarrassed for not realizing all of this—and I’m certainly still going to pay out to people who bet against me, if I lose—but I want people to know that my main takeaway so far is that the MATH dataset turned out to be surprisingly easy, not that large language models turned out to be surprisingly good at math.
No, I corrected it.