The recent breakthrough on the MATH dataset has made me update substantially in the direction of thinking I’ll lose the bet. I’m now at about 50% chance of winning by 2026, and 25% chance of winning by 2030.
That said, I want others to know that, for the record, my update mostly reflects that I now think MATH is a relatively easy dataset, and my overall AGI median only advanced by a few years.
Previously, I relied quite heavily on statements that people had made about MATH, including the authors of the original paper, who indicated it was a difficult dataset full of high school “competition-level” math word problems. However, two days ago I downloaded the dataset and took a look at the problems myself (as opposed to the cherry-picked problems I saw people blog about), and I now understand that a large chunk of the dataset includes simple plug-and-chug and evaluation problems—some of them so simple that Wolfram Alpha can perform them. What’s more: the previous state of the art model, which was touted as achieving only 6.9%, was simply a fine-tuned version of GPT-2 (they didn’t fine-tune anything larger), which makes it very unsurprising that the prior SOTA was so low.
I feel a little embarrassed for not realizing all of this—and I’m certainly still going to pay out to people who bet against me, if I lose—but I want people to know that my main takeaway so far is that the MATH dataset turned out to be surprisingly easy, not that large language models turned out to be surprisingly good at math.
I agree this is more of an update about what existing models were already capable of. I disagree that this means someone in your position should not be updating to significantly lower timelines. Even removing MATH, I’m pretty confident I will “win”. If you want to replace it with something that more represents what you thought MATH did, I will probably take this second bet at the same odds.
The part about the previous SOTA being fine-tuned GPT-2, which means a lot of MATH performance was latent in LMs that existed at the time we made the bet. On top of this, the various prompting and data-cleaning changes strike me as revealing latent capacity.
If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development. There were some slightly hard problems that the model was capable of doing, that Google highlighted in their paper (though they were cherry-picked)—and for that I did update by a bit (I said my timelines advanced by “a few years”).
>If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development.
I suppose I just have different intuitions on this. Let’s just make a second bet. I imagine you can find another element for your list you will be comfortable adding—it doesn’t necessarily have to be a dataset, just something in the same spirit as the other items in the list.
I think I’ll pass up an opportunity for a second bet for now. My mistake was being too careless in the first place—and I’m not currently too interested in doing a deeper dive into what might be a good replacement for MATH.
One more personal update, which I hope will be final until the bet resolves:
I made quite a few mistakes while writing this bet. For example, I carelessly used 2022 dollars while crafting the inflation adjustment component of the second condition. These sorts of things made me update in the direction of thinking that making a good timelines bet is really, really hard.
And I’m a bit worried that people will use this bet to say that I was deeply wrong, and my credibility will blow up if I lose. Maybe I am deeply wrong, and maybe it’s right that my credibility should blow up. But for the record, I never had a very high credence on winning—just enough so that the bet seemed worth it.
I don’t think this will affect your credibility too much. You made a bet, which is virtuous. And you will note how few people were interested in taking it at the time.
Personal update:
The recent breakthrough on the MATH dataset has made me update substantially in the direction of thinking I’ll lose the bet. I’m now at about 50% chance of winning by 2026, and 25% chance of winning by 2030.
That said, I want others to know that, for the record, my update mostly reflects that I now think MATH is a relatively easy dataset, and my overall AGI median only advanced by a few years.
Previously, I relied quite heavily on statements that people had made about MATH, including the authors of the original paper, who indicated it was a difficult dataset full of high school “competition-level” math word problems. However, two days ago I downloaded the dataset and took a look at the problems myself (as opposed to the cherry-picked problems I saw people blog about), and I now understand that a large chunk of the dataset includes simple plug-and-chug and evaluation problems—some of them so simple that Wolfram Alpha can perform them. What’s more: the previous state of the art model, which was touted as achieving only 6.9%, was simply a fine-tuned version of GPT-2 (they didn’t fine-tune anything larger), which makes it very unsurprising that the prior SOTA was so low.
I feel a little embarrassed for not realizing all of this—and I’m certainly still going to pay out to people who bet against me, if I lose—but I want people to know that my main takeaway so far is that the MATH dataset turned out to be surprisingly easy, not that large language models turned out to be surprisingly good at math.
I agree this is more of an update about what existing models were already capable of. I disagree that this means someone in your position should not be updating to significantly lower timelines. Even removing MATH, I’m pretty confident I will “win”. If you want to replace it with something that more represents what you thought MATH did, I will probably take this second bet at the same odds.
I’m confused. I am not saying that, so I’m not sure which part of my comment you’re agreeing with.
If I found something, I’d be sympathetic to taking another bet. Unfortunately I don’t know of any other good datasets.
The part about the previous SOTA being fine-tuned GPT-2, which means a lot of MATH performance was latent in LMs that existed at the time we made the bet. On top of this, the various prompting and data-cleaning changes strike me as revealing latent capacity.
If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development. There were some slightly hard problems that the model was capable of doing, that Google highlighted in their paper (though they were cherry-picked)—and for that I did update by a bit (I said my timelines advanced by “a few years”).
>If I thought large language models were already capable of doing simple plug-and-chug problems, I’m not sure why I’d update much on this development.
I suppose I just have different intuitions on this. Let’s just make a second bet. I imagine you can find another element for your list you will be comfortable adding—it doesn’t necessarily have to be a dataset, just something in the same spirit as the other items in the list.
I think I’ll pass up an opportunity for a second bet for now. My mistake was being too careless in the first place—and I’m not currently too interested in doing a deeper dive into what might be a good replacement for MATH.
You could just drop MATH and make a bet at different odds on the remaining items.
One more personal update, which I hope will be final until the bet resolves:
I made quite a few mistakes while writing this bet. For example, I carelessly used 2022 dollars while crafting the inflation adjustment component of the second condition. These sorts of things made me update in the direction of thinking that making a good timelines bet is really, really hard.
And I’m a bit worried that people will use this bet to say that I was deeply wrong, and my credibility will blow up if I lose. Maybe I am deeply wrong, and maybe it’s right that my credibility should blow up. But for the record, I never had a very high credence on winning—just enough so that the bet seemed worth it.
I don’t think this will affect your credibility too much. You made a bet, which is virtuous. And you will note how few people were interested in taking it at the time.