Long-time lurker (c. 2013), recent poster. I also write on the EA Forum. Cunningham’s law is my friend.
For my own reference: some “benchmarks” (very broadly construed) I pay attention to.
Long-time lurker (c. 2013), recent poster. I also write on the EA Forum. Cunningham’s law is my friend.
For my own reference: some “benchmarks” (very broadly construed) I pay attention to.
gustaf’s one seems fine.
Don’t think there is one single Less Wrongist philosophy of math, but the infinite set atheism tag might be of interest, it links to 2009!Yudkowsky’s guess that “I suspect that an AI does not have to reason about large infinities, or possibly any infinities at all, in order to deal with reality”.
Where does mathematics go? How is it that human mental mathematical abilities and the related “manipulations of symbols” can be used to achieve things in the world? Cf The Unreasonable Effectiveness of Mathematics.
What do you think of Eric S. Raymond’s take?
Where does mathematics come from? On the surface, human mathematical activity seems quite different from what I imagine apes would be doing on the savanna.
Same question, what do you think of this take?
I think plenty of people intrinsically enjoy having power over others, and the ability to lord that power over them. It doesn’t end at adoration and respect: you can get a meaningfully different kick out of terrorizing other people.
To add to this, I know many such people in my circles of acquaintance, especially the ones who won postsecondary scholarships to study abroad via multi-stage interviews selecting for “leadership potential” and such. They absolutely relish crushing others’ wills; they brag about being “cold-blooded killers” in corporate settings and amateur sports competitions; etc.
Slopolis reminds me of @titotal’s Slopworld 2035: The dangers of mediocre AI from last year, if not quite exactly the same.
Agree, it’s the thing Scholze seems to do spectacularly well for instance. Not sure why I wrote that, thanks for the callout.
The greatest praise that Paul Erdős, maybe the most prolific mathematician who ever lived, gave proofs was to proclaim them “straight from The Book”.
GPT-5.4 Pro one-shot the solution to Erdős Problem #1196 in 80 minutes (plus “another 30 ish mins to convert the solution to a latex math paper”). Math Inc later formalised it in Lean:
Jared Lichtman, a world expert on this problem, proclaimed GPT-5.4 Pro’s solution to be from The Book, perhaps the first. I certainly haven’t seen anything like it, e.g. nothing from Gavin’s thread compares. There are big-shot names below, e.g. James Maynard is a recent Fields medalist at the peak of his powers, Jacob Fox is arguably a Fields-level combinatorialist, etc:
Jared later wrote:
In my doctorate, I proved the Erdős Primitive Set Conjecture, showing that the primes themselves are maximal among all primitive sets.
This problem will always be in my heart: I worked on it for 4 years (even when my mentors recommended against it!) and loved every minute of it.
[Primitive sets are a vast generalization of the prime numbers: A set S is called primitive if no number in S divides another.]
Now Erdős#1196 is an asymptotic version of Erdős’ conjecture, for primitive sets of “large” numbers.
It was posed in 1966 by the Hungarian legends Paul Erdős, András Sárközy, and Endre Szemerédi.
I’d been working on it for many years, and consulted/badgered many experts about it, including my mentors Carl Pomerance and James Maynard.
The proof produced by GPT5.4 Pro was quite surprising, since it rejected the “gambit” that was implicit in all works on the subject since Erdős’ original 1935 paper. The idea to pass from analysis to probability was so natural & tempting from a human-conceptual point of view, that it obscured a technical possibility to retain (efficient, yet counter-intuitve) analytic terminology throughout, by use of the von Mangoldt function \Lambda(n).
The closest analogy I would give would be that the main openings in chess were well-studied, but AI discovers a new opening line that had been overlooked based on human aesthetics and convention.
In fact, the von Mangoldt function itself is celebrated for it’s connection to primes and the Riemann zeta function—but its piecewise definition appears to be odd and unmotivated to students seeing it for the first time. By the same token, in Erdős#1196, the von Mangoldt weights seem odd and unmotivated but turn out to cleverly encode a fundamental identity \sum_{q|n}\Lambda(q) = \log n, which is equivalent to unique factorization of n into primes. This is the exact trick that breaks the analytic issues arising in the “usual opening”.
Moreover, Terry Tao has long suspected that the applications of probability to number theory are unnecessarily complicated and this “trick” might actually clarify the general theory, which would have a broader impact than solving a single conjecture.
Although he still wouldn’t call it “original”, rather “clever”
Sure, whatever. (ETA: not sure why I wrote this. Garrett’s response is obviously correct)
Terry Tao seemed pretty excited about this, judging by his ~14 comments in the thread. Here he said:
In any case, I would indeed say that this is a situation in which the AI-generated paper inadvertently highlighted a tighter connection between two areas of mathematics (in this case, the anatomy of integers and the theory of Markov processes) than had previously been made explicit in the literature (though there were hints and precursors scattered therein which one can see in retrospect). That would be a meaningful contribution to the anatomy of integers that goes well beyond the solution of this particular Erdos problem.
r/math has for the longest time been almost constitutionally allergic to frontier AI x math progress updates. So the chatter on it there felt like quite the vibe shift in one of the last bastions of capabilities skepticism. Joshua Zelinsky, who I used to follow on math Quora back in the day, summarised it like so:
Four things to note: #1196 is a decently well known problem. It wasn’t like Erdős-Straus level fame, but it is well known enough that I was familiar with it. Second, this is not a problem where no one had worked on it; there was a lot of prior work on it and closely related problems. Third, this is not example where the AI made small modifications to things in the literature or recognized that large parts of the problem were in an obscure paper. The approach the AI used is largely a different direction than the literature on this problem went. Fourth, and closely related to three, this proof does look like parts of it will inspire subsequent proofs because it really is going in a different direction which now looks likely to be a productive line of investigation for similar problems.
I am not fond of putting words like “stunning” in titles which can be very clickbaity and feels like a hype word, but this really is in the direction where the word isn’t unreasonable even if I myself would not go so far as to use it here.
These remarks by LANA core team member Yuichiro Hoshi seem relevant:
I have joined the LANA Project in 2024 as one of the core members and have been working with the other members ever since. Looking back, I feel that this more than a year in the project was a period in which I have continuously confronted the question of how to explain inter-universal Teichmuller theory should be explained.
The role that I myself have undertaken in the LANA project has been to explain the content of inter-universal Teichmuller theory to the members of the project as comprehensively as possible. On particularly important points, I have tried to give explanations as carefully as possible and with as much time as necessary, so that we could share not merely an outline-level understanding but also the details of the logic. In particular, one major goal toward which I have been working has been to establish a understanding, among the members of the project, of the logic by which Corollary 3.12 is derived from Theorem 3.11 of the third paper of the inter-universal Teichmuller theory published in 2021.
To that end, we have spent a great deal of time holding discussions, repeating explanations, and organizing the issues from as many angles as possible. As a result, through these efforts over the past year and more, I am confident that the members’ understanding of inter-universal Teichmuller theory has certainly advanced greatly. At the very least, I do not think that there are many people outside of those directly involved with inter-universal Teichmuller theory who have tried to understand it this seriously and worked on it this persistently. In that sense, I feel that this project is of great significance, and, moreover, I am very happy to be participating in such a project.
At the same time, however, I also believe, regrettably, that I have not yet fully fulfilled my role. In particular, with regard to the logic by which Corollary 3.12 is derived from Theorem 3.11 as I mentioned earlier, I must take seriously the fact that many members of the LANA Project still feel that there is some insurmountable wall there. Then I also keenly feel my own shortcomings in the face of such a situation.
Nevertheless, I do not think at all that the efforts we have made over the past year and more have been meaningless. On the contrary, I think that precisely because we have thought so seriously, discussed so extensively, and tried so hard to understand matters up to this point, it has become far clearer than before where the true difficulty really lies. In that sense as well, I believe that the efforts we have made on this project have definite value and significance.
How corollary 3.12 follows from theorem 3.11 is where everyone outside of Mochizuki’s circle gets stuck (except Kirti Joshi I suppose, who Mochizuki completely dis-endorses). The LANA project (“Lean for ANAbelian geometry”) is what Mochizuki’s note above is about.
Not yet, here’s what Mochizuki said on that:
Yeah my impression is verifying human proof-to-Lean code translation faithfulness is pretty resistant to automation.
New Mochizuki lore just dropped (to be clear this wasn’t unanticipated):
OpenClaw adoption incentives:
Afra: …In China, [OpenClaw’s] popularity has far outstripped anything seen in the US. Tencent and Baidu organized OpenClaw configuration workshops in Shenzhen that drew retirees and students alike; developer meetups in Beijing sold out instantly; China’s daily token consumption has surpassed 140 trillion — more than a thousandfold increase from the 100 billion daily tokens of early 2024. According to cybersecurity firm SecurityScorecard, China has now overtaken the US in OpenClaw adoption. So how did this happen?
Du Lei: …There’s a structural reason for this. In Silicon Valley, OpenClaw’s arrival didn’t produce any cognitive shock — because it emerged from a complete evolutionary sequence. We’d already had Devin, various agent frameworks, and a step-by-step progression of Claude and other tools steadily advancing in capability. OpenClaw felt like a natural next layer — a more flexible agentic glue. Most of what it enables, people in the Valley had already approximated six months earlier through other means. Exciting, yes. Directionally significant, yes. But perceived as an organic next step within a familiar lineage.
In China, that entire year of evolution had essentially not happened. The domestic conversation had stayed at the level of “how do I use AI to break down a PowerPoint” — consumer entertainment tracks like image and video generation, extremely sophisticated in their own right, but irrelevant to anyone who needed AI to actually assist with work, anyone who needed a genuine intelligent agent.
So when OpenClaw arrived in China, it compressed an entire year of product evolution into a single moment. The final exam answers appeared on the table overnight. OpenClaw settled the debt that the domestic AI productivity track had been accumulating for a year — all at once, in one decisive strike.
Hua Han: On the supply side: every model company has enormous incentive to accelerate this. AI capability has climbed in distinct steps — first ChatGPT-style single-turn conversation; then reasoning models, where thinking time extended but the interaction remained fundamentally one-shot; then Claude Code and open computer use, where single interaction is no longer the design target and multi-turn, long-horizon task collaboration becomes the paradigm, with token consumption rising exponentially. For model companies, this is an enormous business — their entire imperative is to sell tokens.
Shenzhen is home to vast numbers of hardware and model companies. Every laptop or phone that ships with an OpenClaw client pre-installed becomes a persistent token consumption entry point. Token-maxxing, from a pure business logic standpoint, is entirely rational for every model vendor.
Afra: I have my own theory as well. A lot of local governments in China are desperately hungry for new tech narratives right now. Against the backdrop of economic slowdown and high youth unemployment, keywords like “digital nomad” and “open source” have become the latest in a string of buzzwords that local governments race to adopt. From Liangzhu and Anji in Hangzhou to Huangshan in Anhui, cities and towns are competing to attract AI digital nomads. Local governments are anxiously latching onto each new narrative, trying to shore up their talent pipelines and fiscal resources.
Hua Han: From the user side too, the appeal is real. OpenClaw is a genuine productivity tool. Chinese corporate culture — whether in state enterprises or major internet companies — is notoriously process-heavy. Even if you’re already cycling through DingTalk and Feishu (China’s Slack) all day, a tool that can actually automate your routine work has authentic pull for ordinary users.
Manufacturing x AI:
Afra: There’s one more thread I’ve been following closely: the intersection of China’s vast manufacturing base with AI, and why that combination is particularly potent here. Du Lei walked me through a great example earlier: a Chinese luggage company that sources whole hides of cowhide for every production run. Natural leather is inherently uneven — some areas are uniform enough for the exterior face of a handbag, others have flaws and can only be used for lining. Traditionally, experienced craftsmen would eyeball each hide and sketch out a cutting pattern by hand, trying to maximize the number of usable pieces while minimizing waste.
The problem is that the optimal layout for leather is actually a complex two-dimensional combinatorial puzzle: an irregular curved surface onto which you need to fit dozens of pieces of varying shapes, where the orientation and placement of each piece affects the final yield. After introducing AI, the factory uses image recognition to scan each hide’s texture and contour, then runs an optimization algorithm to compute the optimal cutting path in real time.
This kind of application is simply invisible to Silicon Valley — because Silicon Valley doesn’t have the leather manufacturing substrate. The problem domain doesn’t exist there.
Du Lei: And the significance extends beyond leather. China has an enormous manufacturing base, and every industry within it carries the same accumulated store of “master craftsman’s intuition” waiting to be translated into AI. Steel heat treatment parameter optimization, food factory quality inspection, textile color consistency control — the pattern repeats across sectors. What these problems share is that they used to require expensive bespoke algorithm development. Now a single person equipped with the right AI tools can tackle them independently. These opportunities will keep surfacing across Chinese manufacturing — and China is exceptionally well-positioned to capture them.
Afra: I’ve been watching Bambu Lab along exactly these lines — they’ve deeply integrated AI image recognition into their 3D printers to monitor the print head for clogs in real time, automatically distinguishing normal operation from the dreaded “spaghetti” failure mode where the print spirals out of control, and halting immediately to prevent further material waste. For the 3D printing industry, it’s a meaningful leap.
From Afra Wang’s China OS vs. America OS (2026 version), “bloodline politics”:
Du Lei: It’s a phenomenon anyone operating in the field can observe directly. In Western terms, you’d call it “bloodline politics” — your academic lineage, your school, your work history directly determine which faction you belong to and who you end up founding companies with. It’s an awkward dynamic for the industry as a whole, because resources end up concentrating heavily at the top. But honestly, Silicon Valley is much the same — are you a Thiel Fellow? Did you come out of Yao’s class in Tsinghua? There’s no fundamental difference.
Afra: Right, Silicon Valley’s own bloodline politics branches outward from the authors of a handful of foundational papers — who studied under whom, who worked at which lab — and that genealogy largely determines whether you can raise money and get your hands on compute.
Du Lei: Let me trace the lineage of this pureblood lineage for a moment, because it’s actually a single tree. The roots were planted by Hinton’s generation, who laid the foundations of deep learning. The 2017 Google Brain paper “Attention Is All You Need” was a critical fork — nearly every one of its eight authors went on to become the seed of a company or a major research team. Add in the DeepMind lineage — itself deeply connected to Hinton — and you’ve essentially accounted for the entire global population of people who have ever had the hands-on feel of training at ten-thousand-GPU scale. When you add it all up, it might be a few hundred people, most of whom have direct mentor-student or former-colleague relationships with each other.
… There’s a fun phrase in Chinese tech circles: training large models is like liandan(炼丹) — “alchemy,” the ancient Chinese art of refining elixirs. The difference in intuitive feel between someone who has actually trained a model on tens of thousands of GPUs and someone who hasn’t is immense. When a company is deciding whether to proceed with a training run that will burn twenty million dollars, that judgment is almost never a scientific question — it’s closer to a craft or certain sensibility. And since only a small cohort has been trained that way, they are the ones who keep attracting disproportionate resources everywhere they go, which only deepens the structure further.
(aside: this all reads very Claude-y, as someone who’s had to read more Claude prose than I’d like)
You should also put ever-decreasing credence in reported time horizons, cf Ryan’s post
The old METR time horizon benchmark has mostly saturated when it comes to measuring 50%-reliability time-horizon (as in, scores are sufficiently high this measurement is unreliable), but at 80% reliability the best publicly deployed models are at a bit over an hour while I expect the best internal models are reaching a bit below 2 hours. I expect that increasingly this 80%-reliability score is dominated by relatively niche tasks that don’t centrally reflect automating software engineering or AI R&D. Further, the time horizon measurement is increasingly sensitive to the task distribution.
if you are not currently at the cutting edge and actively advancing your field, why follow new research at all?
I like Eric Drexler’s answer (how to understand everything and why, how to learn about everything)
To avoid blunders and absurdities, to recognize cross-disciplinary opportunities, and to make sense of new ideas, requires knowledge of at least the outlines of every field that might be relevant to the topics of interest. By knowing the outlines of a field, I mean knowing the answers, to some reasonable approximation, to questions like these:
What are the physical phenomena?
What causes them?
What are their magnitudes?
When might they be important?
How well are they understood?
How well can they be modeled?
What do they make possible?
What do they forbid?And even more fundamental than these are questions of knowledge about knowledge:
What is known today?
What are the gaps in what I know?
When would I need to know more to solve a problem?
How could I find what I need?It takes far less knowledge to recognize a problem than to solve it, yet in key respects, that bit of knowledge is more important: With recognition, a problem may be avoided, or solved, or an idea abandoned. Without recognition, a hidden problem may invalidate the labor of an hour, or a lifetime. Lack of a little knowledge can be a dangerous thing.
and the how:
Read and skim journals and textbooks that (at the moment) you only half understand. Include Science and Nature.
Don’t halt, dig a hole, and study a particular subject as if you had to pass a test on it.
Don’t avoid a subject because it seems beyond you — instead, read other half-understandable journals and textbooks to absorb more vocabulary, perspective, and context, then circle back.
Notice that concepts make more sense when you revisit a topic.
Notice which topics link in all directions, and provide keys to many others. Consider taking a class.
Continue until almost everything you encounter in Science and Nature makes sense as a contribution to a field you know something about.
Also you seem to conflate “follow new research” with “having to follow all new research”, which is totally different. The former can be guided by things like Holden’s learning by writing or Sarah’s factposting, the latter is part of the job of a professional frontier-pusher.
The recent METR Research note: We spent 2 hours working in the future (quick take) gave a neat visual for this:
Figure 3: A future project might take ~42 days of wall-clock time, with ~8 hours of agent work (not counting running the evals) and 1000 serial hours of human IC work, evals execution, and review.
From experience I’m still skeptical it won’t be a proxy in practice once you actually implement the scalar virtuosity score and fine-tune LLMs against it, but instead of belaboring my skepticism I’ll wait for your preliminary results. I’d be keen to see something roughly as thorough as this but probably a bit much if you’re the sole person working on this.
Tom White has this non-exhaustive list of telltale tics of LLM writing:
The “It’s not X, it’s Y” Antithesis
The most common tell. A fake profundity wrapped in a neat contrast: “We’re not a company, we’re a movement.” “It’s not just a tool, it’s a journey.” Humans use this sparingly; AI uses it compulsively
The Punchline Em-Dash
Every section feels like it’s waiting for a big reveal—until the reveal is obvious or hollow
The Three-Item List
AI loves the rhythm of threes: “clarity, precision, and impact.” It’s a pattern baked deep into training data and reinforced in feedback
Mirrored Metaphors & Faux Gravitas
“We don’t chase trends — trends chase us.” They sound like aphorisms, but they’re cosplay; form without experience
Adverbial Bloat
“Importantly,” “remarkably,” “fundamentally,” “clearly.” Empty intensifiers meant to simulate significance
Mechanical Rhythm
Sentences marching in lockstep, each about the same length. Humans sprawl, stumble, cut themselves off. AI taps its digital foot to a metronome
Hedged Authority
The “at its core,” “in many ways,” “arguably.” A way of sounding wise without taking a stand
Latin Sidebar Syndrome
AI’s compulsive use of e.g. and i.e. often comes with a giveaway glitch: the period-comma doublet (“e.g.,” or “i.e.,”). Almost no human would punctuate this way. Once you’ve seen the “.,” pattern, you can’t unsee it
Closing Tautologies
Ending sections with empty recaps: “This shows why innovation matters.” It looks like a conclusion, but it’s just filler.
Please note that these are not foolproof. After all, AI claims it wrote not only the Declaration of Independence, but also my early, pre-Chat GPT writing.
I expect LLM writing to move past these tics while still being weirdly slippery for a while for increasingly hard-to-legibilize reasons.
On the other hand, Tomas B’s Experiments With Opus 4.6′s Fiction which he generated in a low-ish effort way was quite decent, even if not close to his usual standards.
Steven Byrnes wrote similarly in Jan ’23:
Let’s say I know how to build / train a human-level (more specifically, John von Neumann level) AGI. And let’s say that we (and/or the AGI itself) have already spent a few years[1] on making the algorithm work better and more efficiently.
Question: How much compute will it take to run this AGI?
(NB: I said “running” an AGI, not training / programming an AGI. I’ll talk a bit about “training compute” at the very end.)
Answer: I don’t know. But that doesn’t seem to be stopping me from writing this post. ¯\_(ツ)_/¯ My current feeling—which I can easily imagine changing after discussion (which is a major reason I’m writing this!)—seems to be:
75%: One current (Jan 2023) high-end retail gaming PC (with an Nvidia GeForce RTX 4090 GPU) will be enough (or more than enough) for human-level human-speed AGI,
85%: One future high-end retail gaming PC, that will on sale in a decade (2033)[2], will be enough for human-level AGI, at ≥20% human speed.
Frontier models seem to be slightly worsening over time at converting casual reading highlights into memory flashcards that survive long-term review (a year) due to lack of taste from extensive SRS review experience, per Andy Matuschak at Memory Machines (full report). Opus 4.7 is worse than Sonnet 3.7 for instance:
What taste borne of extensive SRS review looks like:
What makes a good memory flashcard for a spaced repetition system, as opposed to ordinary flashcards?
That last point seems related to how non-finetuned LLM writing tends to oddly “slide off” my eyes despite being mildly pleasant to read.
From the targeting / construction 2x2 Matuschak and Ozzie Kirkby come up with this 4-tier taxonomy:
Model taste doesn’t transfer even when Matuschak & Kirkby described and demonstrated examples:
They tried all kinds of approaches to train models on taste without success.
Here are the frontier models perf-ranked by the full-tiered taxonomy: