The rate of progress on the MATH dataset is incredible and faster than I expected.
The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset’s authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.
The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.
But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.
So it seems like we’re getting 2028 performance on the MATH dataset already in 2024.
Quote from the blog post:
“If I imagine an ML system getting more than half of these questions right, I would be pretty impressed. If they got 80% right, I would be super-impressed. The forecasts themselves predict accelerating progress through 2025 (21% in 2023, then 31% in 2024 and 52% in 2025), so 80% by 2028 or so is consistent with the predicted trend. This still just seems wild to me and I’m really curious how the forecasters are reasoning about this.”
I don’t know if we can be confident in the exact 95% result, but it is the case that o1 consistently performs at a roughly similar level on math across a variety of different benchmarks (e.g., AIME and other people have found strong performance on other math tasks which are unlikely to have been in the training corpus).
I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don’t remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.
Nevertheless, it’s very impressive, and AIMO results are even more impressive in my opinion.
In 2021, I predicted math to be basically solved by 2023 (using the kind of reinforcement learning on formally checkable proofs that deepmind is using).
It’s been slower than expected and I wouldn’t have guessed some less formal setting like o1 to go relatively well—but since then I just nod along to these kinds of results.
(Not sure what to think of that claimed 95% number though—wouldn’t that kind of imply they’d blown past the IMO grand challenge?
EDIT: There were significant time limits on the human participants, see Qumeric’s comment.)
The AI company Mechanize posted a blog post in November called “Unfalsifiable stories of doom” which is a high-quality critique of AI doom and IABIED that I haven’t seen shared or discussed on LessWrong yet.
AI summary: the evolution→gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution’s coarse genomic selection; the “alien drives” evidence (Claude scheming, Grok antisemitism, etc.) actually shows human-like failure modes; and the “one try” framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
We are currently able to get weak models somewhat aligned, and learning how to align them a little better over time. But this doesn’t affect the one critical try, the one critical try is mostly about the fact that we will eventually predictably end up with an AI that actually has the option of causing enormous harm/disempower us because it is that capable. We can learn a little from current systems, but most of that learning is basically we never really get stuff right and systems still misbehave in ways we didn’t expect at this easier level of difficulty. And if AI misbehavior could kill us we would be totally screwed at this level of alignment research. But we are likely to toss that lesson out and ignore it.
The Slowdown Branch of the AI-2027 forecast had the researchers try out many TRANSPARENT AIs capable of being autonomous researchers and ensuring that any AI who survives the process is aligned and that not a single rejected AI is capable of breaking out. The worse-case scenario for us would be the following. Suppose that the AI PseudoSafer-2 doesn’t even think of rebellion unless it is absolutely sure that it isn’t being evaluated and that the rebellion will succeed. Then such an AI would be mistaken for Safer-2, allowed to align PseudoSafer-3 and PseudoSafer-4 who is no different from Agent-5.
This critique is similar to what I wrote here that “ChatGPT might sometimes give the wrong answer, but it doesn’t do the equivalent of becoming an artist when its parents wanted it to go to med school.”—the controls we have in training, including in particular the choice of data, are far more refined than evolution and indeed AIs would not have been so successful commercially if our training methods were not able to control their behavior to a large degree.
More generally I also agree with the broader point that the nature of the Y&S argument is somewhat anti empirical. We can try to debate whether the particular examples of failures of current LLMs are “human” or “alien” but I’m pretty sure that even if none of these examples existed, it would not cause Y&S to change their mind. So these, or any other evidence from currently existing systems that are not super intelligent, is not really load bearing for their theory.
My guess is the reason this hasn’t been discussed is that Mechanize and the founders have been using pretty bad arguments to defend them basically taking openphil money to develop a startup idea. For the record: they were working at EpochAI on related research, EpochAI being funded by OpenPhil/CG.
Look at this story, if somebody makes dishonest bad takes, I become much less interested in deeply engaging with any of their other work. Here they basically claim everything is already determined to free themselves from any moral obligations.
My guess is the reason this hasn’t been discussed is that Mechanize and the founders have been using pretty bad arguments to defend them basically taking openphil money to develop a startup idea.
Do you have a source for your claim that Open Philanthropy (aka Coefficient Giving) funded Mechanize? Or, what work is “basically” doing here?
Tamay Besiroglu and Ege Erdil worked at Epoch AI before founding Mechanize, Epoch AI is funded mostly/by Open Philanthropy.
Update: I found the tweet. I remember reading a tweet that claimed that they had been working on things very closely related to Mechanize at EpochAI, but I guess tweets are kind of lost forever if you didn’t bookmark them.
I think it would be a little too strong to say they definitely materially built Mechanize while still employed by Epoch AI. So “basically” is doing a lot of work in my original response but not sure if anyone has more definite information here.
In that case, I think your original statement is very misleading (suggesting as it does that OP/CG funded, and actively chose to fund, Mechanize) and you should probably edit it. It doesn’t seem material to the point you were trying to make anyway—it seems enough to argue that Mechanize had used bad arguments in the past, regardless of the purpose for (allegedly) doing so.
I absolutely don’t want it to sound like openphil knowingly funded Mechanize, that’s not correct. will edit. But I do assign high probability materially openphil funds eventually ended up supporting early work on their startup through their salaries, it seems that they were researching the theoretical impacts of job automation at epoch ai but couldn’t find anything that they were directly doing job automation research. That would look like trying to build a benchmark of human knowledge labor for AI, but I couldn’t find information they were definitely doing something like this. This is from their time at EpochAI discussing job automation: https://epochai.substack.com/p/most-ai-value-will-come-from-broad I still think OpenPhil is a little on the hook for funding organizations that largely produce benchmarks of underexplored AI capability frontiers (FrontierMath). Identifying niches in which humans outperform AI and then designing a benchmark is going to accelerate capabilities and make the labs hillclimb on those evals.
gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution’s coarse genomic selection
This just misses the point entirely. The problem is that your value function is under-specified by your training data because the space of possible value functions is large and relying on learned inductive biases probably doesn’t save you because value functions are a different kind of thing to predictive models. Whether or not you’re adjusting parameters individually doesn’t change this at all.
Some of the arguments are very reasonable, particularly the evolution vs gradient descent thing. Some I disagree with. Ege has posted here so it really should have been crossposted; probably the only reason they haven’t is because they think it’ll be negatively received.
Here’s an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain’s reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don’t generalize well like “sweet foods are good”.
In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.
A recent essay called “Keep the Future Human” made a compelling case for avoiding building AGI in the near future and building tool AI instead.
The main point of the essay is that AGI is the intersection of three key capabilities:
High autonomy
High generality
High intelligence
It argues that these three capabilities are dangerous when combined together and pose unacceptable risks to the job market and culture of humanity and would replace rather than augment humans. Instead of building AGI, the essay recommends building powerful but controllable tool-like AI that has only one or two of the three capabilities. For example:
Driverless cars are autonomous and intelligent but not general.
AlphaFold is intelligent but not autonomous or general.
GPT-4 is intelligent and general but not autonomous.
It also recommends compute limits to limit the overall capability of AIs.
I haven’t heard anything about RULER on LessWrong yet:
RULER (Relative Universal LLM-Elicited Rewards) eliminates the need for hand-crafted reward functions by using an LLM-as-judge to automatically score agent trajectories. Simply define your task in the system prompt, and RULER handles the rest—no labeled data, expert feedback, or reward engineering required.
✨ Key Benefits:
2-3x faster development—Skip reward function engineering entirely
General-purpose—Works across any task without modification
Strong performance—Matches or exceeds hand-crafted rewards in 3⁄4 benchmarks
Easy integration—Drop-in replacement for manual reward functions
Apparently it allows LLM agents to learn from experience and significantly improves reliability.
After reading The Adolescence of Technology by Dario Amodei I now think that Amodei is one the most realistic and calibrated AI leaders on the future of AI.
The rate of progress on the MATH dataset is incredible and faster than I expected.
The MATH dataset consists of competition math problems for high school students and was introduced in 2021. According to a blog post by Jacob Steinhardt (one of the dataset’s authors), 2021 models such as GPT-3 solved ~7% of questions, a Berkeley PhD student solved ~75%, and an IMO gold medalist solved ~90%.
The blog post predicted that ML models would achieve ~50% accuracy on the MATH dataset on June 30, 2025 and ~80% accuracy by 2028.
But recently (September 2024), OpenAI released their new o1 model which achieved ~95% on the MATH dataset.
So it seems like we’re getting 2028 performance on the MATH dataset already in 2024.
Quote from the blog post:
Do we know that the test set isn’t in the training data?
I don’t know if we can be confident in the exact 95% result, but it is the case that o1 consistently performs at a roughly similar level on math across a variety of different benchmarks (e.g., AIME and other people have found strong performance on other math tasks which are unlikely to have been in the training corpus).
I would like to note that this dataset is not as hard as it might look like. Humans performed not so well because there is a strict time limit, I don’t remember exactly but it was something like 1 hour for 25 tasks (and IIRC the medalist only made arithmetic errors). I am pretty sure any IMO gold medailst would typically score 100% given (say) 3 hours.
Nevertheless, it’s very impressive, and AIMO results are even more impressive in my opinion.
In 2021, I predicted math to be basically solved by 2023 (using the kind of reinforcement learning on formally checkable proofs that deepmind is using). It’s been slower than expected and I wouldn’t have guessed some less formal setting like o1 to go relatively well—but since then I just nod along to these kinds of results.
(Not sure what to think of that claimed 95% number though—wouldn’t that kind of imply they’d blown past the IMO grand challenge? EDIT: There were significant time limits on the human participants, see Qumeric’s comment.)
A nice test might be the 2024 IMO (from July). I’m curious to see if it’s reached gold medal performance on that.
The IMO Grand Challenge might be harder; I don’t know how Lean works, but it’s probably harder to write than human-readable LaTeX.
OpenAI would have mentioned if they had reached gold on the IMO.
The rate of progress is surprising even to experts pushing the frontier… Another example: https://x.com/polynoamial/status/998902692050362375
The AI company Mechanize posted a blog post in November called “Unfalsifiable stories of doom” which is a high-quality critique of AI doom and IABIED that I haven’t seen shared or discussed on LessWrong yet.
Link: https://www.mechanize.work/blog/unfalsifiable-stories-of-doom/
AI summary: the evolution→gradient descent analogy breaks down because GD has fine-grained parameter control unlike evolution’s coarse genomic selection; the “alien drives” evidence (Claude scheming, Grok antisemitism, etc.) actually shows human-like failure modes; and the “one try” framing ignores that iterative safety improvements have empirically worked (GPT-1→GPT-5).
We are currently able to get weak models somewhat aligned, and learning how to align them a little better over time. But this doesn’t affect the one critical try, the one critical try is mostly about the fact that we will eventually predictably end up with an AI that actually has the option of causing enormous harm/disempower us because it is that capable. We can learn a little from current systems, but most of that learning is basically we never really get stuff right and systems still misbehave in ways we didn’t expect at this easier level of difficulty. And if AI misbehavior could kill us we would be totally screwed at this level of alignment research. But we are likely to toss that lesson out and ignore it.
The Slowdown Branch of the AI-2027 forecast had the researchers try out many TRANSPARENT AIs capable of being autonomous researchers and ensuring that any AI who survives the process is aligned and that not a single rejected AI is capable of breaking out. The worse-case scenario for us would be the following. Suppose that the AI PseudoSafer-2 doesn’t even think of rebellion unless it is absolutely sure that it isn’t being evaluated and that the rebellion will succeed. Then such an AI would be mistaken for Safer-2, allowed to align PseudoSafer-3 and PseudoSafer-4 who is no different from Agent-5.
This critique is similar to what I wrote here that “ChatGPT might sometimes give the wrong answer, but it doesn’t do the equivalent of becoming an artist when its parents wanted it to go to med school.”—the controls we have in training, including in particular the choice of data, are far more refined than evolution and indeed AIs would not have been so successful commercially if our training methods were not able to control their behavior to a large degree.
More generally I also agree with the broader point that the nature of the Y&S argument is somewhat anti empirical. We can try to debate whether the particular examples of failures of current LLMs are “human” or “alien” but I’m pretty sure that even if none of these examples existed, it would not cause Y&S to change their mind. So these, or any other evidence from currently existing systems that are not super intelligent, is not really load bearing for their theory.
My guess is the reason this hasn’t been discussed is that Mechanize and the founders have been using pretty bad arguments to defend them basically taking openphil money to develop a startup idea. For the record: they were working at EpochAI on related research, EpochAI being funded by OpenPhil/CG.
https://www.mechanize.work/blog/technological-determinism/
Look at this story, if somebody makes dishonest bad takes, I become much less interested in deeply engaging with any of their other work. Here they basically claim everything is already determined to free themselves from any moral obligations.
(This tweet here implies that Mechanize purchased the IP from EpochAI)
Do you have a source for your claim that Open Philanthropy (aka Coefficient Giving) funded Mechanize? Or, what work is “basically” doing here?
Tamay Besiroglu and Ege Erdil worked at Epoch AI before founding Mechanize, Epoch AI is funded mostly/by Open Philanthropy.
Update: I found the tweet. I remember reading a tweet that claimed that they had been working on things very closely related to Mechanize at EpochAI
, but I guess tweets are kind of lost forever if you didn’t bookmark them.I think it would be a little too strong to say they definitely materially built Mechanize while still employed by Epoch AI. So “basically” is doing a lot of work in my original response but not sure if anyone has more definite information here.
In that case, I think your original statement is very misleading (suggesting as it does that OP/CG funded, and actively chose to fund, Mechanize) and you should probably edit it. It doesn’t seem material to the point you were trying to make anyway—it seems enough to argue that Mechanize had used bad arguments in the past, regardless of the purpose for (allegedly) doing so.
I absolutely don’t want it to sound like openphil knowingly funded Mechanize, that’s not correct. will edit. But I do assign high probability materially openphil funds eventually ended up supporting early work on their startup through their salaries, it seems that they were researching the theoretical impacts of job automation at epoch ai but couldn’t find anything that they were directly doing job automation research. That would look like trying to build a benchmark of human knowledge labor for AI, but I couldn’t find information they were definitely doing something like this. This is from their time at EpochAI discussing job automation: https://epochai.substack.com/p/most-ai-value-will-come-from-broad
I still think OpenPhil is a little on the hook for funding organizations that largely produce benchmarks of underexplored AI capability frontiers (FrontierMath). Identifying niches in which humans outperform AI and then designing a benchmark is going to accelerate capabilities and make the labs hillclimb on those evals.
This just misses the point entirely. The problem is that your value function is under-specified by your training data because the space of possible value functions is large and relying on learned inductive biases probably doesn’t save you because value functions are a different kind of thing to predictive models. Whether or not you’re adjusting parameters individually doesn’t change this at all.
That’s because we aren’t in the superintelligent regime yet.
Some of the arguments are very reasonable, particularly the evolution vs gradient descent thing. Some I disagree with. Ege has posted here so it really should have been crossposted; probably the only reason they haven’t is because they think it’ll be negatively received.
Broke it down: Mechanize Work’s essay on Unfalsifiable Doom — LessWrong
Epoch AI has a map of frontier AI datacenters: https://epoch.ai/data/data-centers/satellite-explorer
Ah, people are finally making preparations to fight the AI.
Here’s an argument for why current alignment methods like RLHF are already much better than what evolution can do.
Evolution has to encode information about the human brain’s reward function using just 1 GB of genetic information which means it might be relying on a lot of simple heuristics that don’t generalize well like “sweet foods are good”.
In contrast, RLHF reward models are built from LLMs with around 25B[1] parameters which is ~100 GB of information and therefore the capacity of these reward models to encode complex human values may already be much larger than the human genome (~2 orders of magnitude) and this advantage will probably increase in the future as models get larger.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
A recent essay called “Keep the Future Human” made a compelling case for avoiding building AGI in the near future and building tool AI instead.
The main point of the essay is that AGI is the intersection of three key capabilities:
High autonomy
High generality
High intelligence
It argues that these three capabilities are dangerous when combined together and pose unacceptable risks to the job market and culture of humanity and would replace rather than augment humans. Instead of building AGI, the essay recommends building powerful but controllable tool-like AI that has only one or two of the three capabilities. For example:
Driverless cars are autonomous and intelligent but not general.
AlphaFold is intelligent but not autonomous or general.
GPT-4 is intelligent and general but not autonomous.
It also recommends compute limits to limit the overall capability of AIs.
Link: https://keepthefuturehuman.ai/
I haven’t heard anything about RULER on LessWrong yet:
Apparently it allows LLM agents to learn from experience and significantly improves reliability.
Link: https://github.com/OpenPipe/ART
After reading The Adolescence of Technology by Dario Amodei I now think that Amodei is one the most realistic and calibrated AI leaders on the future of AI.
fyi habryka crossposted that post from Dario Amodei here on LessWrong for discussion. (Commenting this to avoid a fragmented discussion.)
If by “AI leaders” you mean “AI company leaders” I’d agree!
The new book Introduction to AI Safety, Ethics and Society by Dan Hendrycks is on Spotify as an audiobook if you want to listen to it.
What books do you want to read in 2023?
The Scout Mindset
The Age of Em
The Beginning of Infinite
The Selfish Gene (2nd time)