But frontier labs are deliberately working on making LLMs more agentic. Why wouldn’t they—AI that can do work autonomously is more economically valuable than a chatbot.
Expertium
Another suggestion: https://cybench.github.io/
https://x.com/alexwei_/status/1946477742855532918
I believe this qualifies as “technical capability existing by end of 2025”.
For example, did any of the examples derive their improvement by some way other than chewing through bits of algebraicness?
I don’t think so.
https://arxiv.org/pdf/2506.13131
What did the system invent?
Example: matrix multiplication using fewer multiplication operations.
There were also combinatorics problems, “packing” problems (like multiple hexagons inside a bigger hexagon), and others. All of that is in the paper.
Also, “This automated approach enables AlphaEvolve to discover a heuristic that yields an average 23% kernel speedup across all kernels over the existing expert-designed heuristic, and a corresponding 1% reduction in Gemini’s overall training time.”
How did the system work?
It’s essentially an evolutionary/genetic algorithm, with LLMs providing “mutations” for the code. Then the code is automatically evaluated, bad solutions are discarded, and good solutions are kept.
What makes you think it’s novel?
These solutions weren’t previously discovered by humans. Unless the authors just couldn’t find the right references, of course, but I assume the authors were diligent.
Would it have worked without the LLM?
You mean, “could humans have discovered them, given enough time and effort?”. Yes, most likely.
I’m surprised to see zero mentions of AlphaEvolve. AlphaEvolve generated novel solutions to math problems, “novel” in the “there are no records of any human ever proposing those specific solutions” sense. Of course, LLMs didn’t generate them unprompted, humans had to do a lot of scaffolding. And it was for problems where it’s easy to verify that the solution is correct; “low messiness” problems if you will. Still, this means that LLMs can generate novel solutions, which seems like a crux for “Can we get to AGI just by incrementally improving LLMs?”.
Sounds like you could benefit either from Easy Days (available natively in the newer versions of Anki) or from Advance/Postpone from the FSRS Helper add-on
10x more training compute = 5x greater task length (kind of)
Let’s look at another “LLMs lack true understanding” paper
https://www.virologytest.ai/
This benchmark has human expert percentiles, which makes it very convenient for exactly the kind of stuff you are doing (though I decided to calculate SDs as a function of release date rather than compute, just because it’s more intuitive).
I wrote down SOTA models, their release dates, and performance:
Model Release date Normalized date Accuracy Expert percentile z-score GPT-4 Turbo 2023-06-01 0 16.8% 43% -0.18 Gemini 1.5 Pro 2024-02-15 259 25.4% 61% 0.28 Sonnet 3.5 2024-06-20 385 26.9% 69% 0.50 Sonnet 3.5 v2 2024-10-22 509 33.6% 75% 0.67 o1 2024-12-05 553 35.4% 89% 1.23 o3 2025-04-16 685 43.8% 94% 1.55 Z-scores are based on expert percentiles. This gives roughly 0.90 SD/year for LLMs. So we should expect an LLM as good as a +6 SD human virology expert around 2030.
I wish more benchmarks had human percentiles.
I’m curious why it seems better to you.
Because it’s not rewarding AI’s outward behavior. Any technique that just rewards the outward behavior is doomed once we get to AIs capable of scheming and deception. Self-other overlap may still be doomed in some other way, though.
It might choose to go along with its initial behavioral and ethical habits, or it might choose to deliberately undo the effects of the self-other overlap training once it is reflective and largely rational and able to make decisions about what goals/values to follow
That seems like a fully general argument that aligning a self-modifying superintelligence is impossible.
I imagine you will like the paper on Self-Other Overlap. To me this seems like a much better approach than, say, Constitutional AI. Not because of what it has already demonstrated, but because it’s a step in the right direction.
In that paper, instead of just rewarding AI for spitting out text that is similar both when the prompt is about the AI itself and someone else, the authors tinkered with activation functions so that AI actually thinks about itself and others similarly. Of course, there is the “if I ask AI to make me a sandwich, I don’t want AI to make itself a sandwich” concern if you push this technique too far, but still. If you ask me, “What will an actual working solution to alignment look like?” I’d say it will look a lot less like Constitutional AI and a lot more like Self-Other Overlap.
Smarter Models Lie Less
Agree. I would prefer less “this guy said a thing on x.com″ and more news, statistics and technical reports.
Yes, I’ve seen that benchmark (I mean, I literally linked to it in my comment) and the video.
Regarding geobench specifically: the main leaderboard on that benchmark is essentially NMPZ (No Moving, Panning or Zooming). Gemini 2.5 Pro achieves an average score of 4085. That’s certainly really good for NMPZ, but I don’t think that’s Rainbolt-tier. Rainbolt-tier is more like 4700-4800, if we want an LLM that has average-case performance equal to Rainbolt’s best-case performance.
Also, LLMs can’t do the “guess the country solely by pavement” thing like he can, so there’s room for improvement.
6. “Frontier model” means either of the following: (a) an artificial intelligence model trained using greater than 10^26 computational operations (e.g., integer or floating-point operations), the compute cost of which exceeds one hundred million dollars.
This reads like “100 million dollars AND 10^26 FLOPS” instead of OR. So 99 million and 10^26 FLOPS = no regulations, I think. Might be a loophole that gets easier to exploit as hardware becomes cheaper.
I’m looking at this: https://legiscan.com/NY/text/S06953/2025
I guess it was amended very quickly?
Do you plan on updating the graph every 6-12 months? It doesn’t have to be a new paper every time, obviously. Just having the graph on metr.org and regularly updating it would be very useful.
EDIT: https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Idk if this is new or if I just somehow missed this page.
Why should I imagine that AGI would have that ability?
Modern LLMs are already like that. They have expert or at least above-average knowledge in many domains simultaneously. They may not have developed “magical” abilities yet, but “AI that has lots of knowledge from a vast number of different domains” is something that we already see. So I think “AI that has more than one magical ability” it’s a pretty straightforward extrapolation.
Btw, I think it’s possible that even before AGI, LLMs will have at least 2 “magical” abilities. They’re getting better at Geoguessr, so we could have a Rainbolt-level LLM in a few years; this seems like the most likely first “magical” ability IMO.
Superhuman forecasting could be the next one, especially once LLMs become good at finding relevant news articles in real time.
Identifying book authors from a single paragraph with 99% accuracy seems like something LLMs will be able to do (or maybe even already can), though I can’t find a benchmark for that.
Accurately guessing age from a short voice sample is something that machine learning algorithms can do, so with enough training data, LLMs could probably do it too.
Let me put it another way—do you expect that “LLMs do not optimize for a goal” will still be a valid objection in 2030? If yes, then I guess we have a very different idea of how progress will go.