Tao Lin

Karma: 979

Tao Lin 9 Feb 2026 1:04 UTC
4 points
0
in reply to: Steven Byrnes’s comment on: The nature of LLM algorithmic progress
Gopher
Cerebras-GPT
Pythia
Why is Gopher better than Pythia or Cerebras? Mostly no comment, but I think Pythia and Cerebras weren’t making any super simple obvious mistake but were behind 2021-era DeepMind.

Tao Lin 8 Feb 2026 19:13 UTC
4 points
0
on: The nature of LLM algorithmic progress
I believe open data pretty strongly contradicts your claims
System efficiency: ~2x, not ~20x
§1.2 estimates “up to 20x” from system optimizations (quantization, parallelization, FlashAttention, etc.). But model FLOPs Utilization, the fraction of peak hardware FLOPs that large training runs actually achieve, has barely changed. Here are the most legible training run flops efficiencies:
- Megatron-LM 2019: 30% on a single V100
- PaLM 2022: 46% on 6K TPU v4s
- Llama 3.1 2024: 38-43% on 16K H100s
- MegaScale 2024: 55% on 12K GPUs
- Llama 4 2025: 39% BF16-equivalent on 32K H100s
That’s less than 2x variation over 6 years. Most optimization work at scale is treading water against communication overhead as clusters grow larger, not increasing training compute per chip. Also, individual optimizations like FlashAttention’s 3x attention speedup are additive rather than multiplicative, and many other optimizations only apply to inference, not training.
The Gundlach paper tests 6 innovations. The field has produced 40+.
§1.1 relies on Gundlach et al. (2025b), which ablated six things, SwiGLU, RoPE, cosine decay, pre-RMSNorm, pre-LayerNorm, and AdamW. The most algorithmically advanced models with published methodology are DeepSeek-V3, DeepSeek-R1, and Kimi K2, and here’s a list of algorithms they used that Gundlach didn’t test:
Architecture:
- Fine-grained MoE: 256 routed + 1 shared expert, 8 active per token
- Ultra-sparse MoE: scaling to 384 experts at fixed active params
- No token-dropping in MoE (enabled by better balancing)
- Multi-Head Latent Attention (MLA): low-rank KV compression
- Sigmoid gating with top-K normalization replacing softmax
- Auxiliary-loss-free load balancing via learned bias terms
- YaRN context extension (4K→128K in 2000 fine-tuning steps)
Training methodology:
- Multi-token prediction with sequential causal chain
- MuonClip optimizer: Muon + QK-Clip for stable training (zero loss spikes over 15.5T tokens)
- Fill-in-Middle pretraining (10% of data in PSM order)
- LLM-based data rephrasing for knowledge and math (outperforms naive multi-epoch)
- Multi-phase learning rate schedule with annealing stages
Post-training:
- GRPO: RL without a critic model
- Four-stage pipeline: cold-start SFT → reasoning RL → rejection sampling SFT → all-scenario RL
- Rule-based rewards avoiding neural reward model hacking
Of these, Muon alone is ~2x, which is on par with AdamW vs SGD.
HellaSwag: an example 10x+ compute multiplier, without distillation or synthetic data
Three model families, each trained at a fixed recipe across multiple sizes: Gopher (2021), Cerebras-GPT (2023, Chinchilla-optimal), and Pythia (2023), and they didn’t use any distillation, targeted data similar to benchmarks, or data contractors. Gopher reaches the same HellaSwag score as Cerebras-GPT or Pythia with roughly 10x less compute, and while using Kaplan parameter scaling. There was pretty much no legible algorithm in the paper that obviously caused this difference.
There’s lots more evidence, and many models such as Qwen or R1 got significantly higher compute multipliers, but as you note, some fraction of that is from distillation or data labelling.
Note: Claude 4.6 wrote most of this, with a huge number of revisions and fixes from me, it wasn’t really an uplift compared to only using Claude to make the list and the plot.

Tao Lin 28 Jan 2026 7:08 UTC
6 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
I would guess that 96+% of code merged into lab codebases is read by a human, but I’ve heard some startups are <50%, and are intentionally accumulating tons of tech debt / bad code for short term gain. People write lots of personal scripts with AI,which aren’t in the merged/production code statistics, maybe that brings it to 85% human-read code in labs. There’s also a wide range in how deeply you read and review code, where maybe only 50% of code that’s read is fully understood.

Tao Lin 20 Jan 2026 17:57 UTC
10 points
0
in reply to: ryan_greenblatt’s comment on: ryan_greenblatt’s Shortform
Anthropic already applied some CBRN filtering to Opus 4, with the intent to bring it below Anthropic’s ASL-3 CBRN threshold, but the model did not end up conclusively below that threshold. Anthropic looked into whether they could bring more capable future models below the ALS-3 Bio threshold using pretraining data filtering, and determined that it would require filtering out too much biology and chemistry knowledge. Jerry’s comment is about nerfing the model to below the ASL-3 threshold even with tool use, which is a very low bar compared to frontier model capabilities. This doesn’t necessarily apply to sabotage or ability to circumvent monitoring, which depends on un-scaffolded capabilities.

Tao Lin 16 Jan 2026 21:42 UTC
5 points
0
on: Tensor-Transformer Variants are Surprisingly Performant
Redwood research did very similar experiments in 2022, but didn’t publish about them. They are briefly mentioned in this podcast: https://blog.redwoodresearch.org/p/the-inaugural-redwood-research-podcast.

Tao Lin 3 Jan 2026 16:54 UTC
5 points
0
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
In this range of code lengths, 400-1800 lines, lines of code does not correlate with effort imo. It only takes like 1 day to write 1800 lines of code by hand. The actual effort is dominated by thinking of ideas and huge hyperparameter sweeps.
Another note: i was curious what they had to do to reduce torch startup time and such, and it turns out they spend 7 minutes compiling and warming up for their 2 minute training run lmao. That does make it more realistic but is a bit silly.

Tao Lin 30 Dec 2025 0:57 UTC
2 points
3
on: Does the USG have access to smarter models than the labs’?
There’s obviously no truth to that claim. Labs absolutely have better models.

Tao Lin 22 Dec 2025 6:21 UTC
9 points
3
in reply to: Thomas Kwa’s comment on: Can Claude teach me to make coffee?
Claude 4.5 sonnet is relatively bad at vision for a frontier model. Gemini 3 pro, GPT 5.2, and Claude 4.5 opus are better.

Tao Lin 19 Nov 2025 15:01 UTC
1 point
−2
on: Tao Lin’s Shortform
Saying you would 2 box in Newcomb’s problem doesn’t make sense. By saying it aloud, you’re directly causing anyone who you can influence and who would later have an opportunity to casually cooperate with you to not do that. If you believe acausal stuff is possible, you should always respond with “1 box” or “no comment”, even if you would in reality 2-box.

Tao Lin 23 Sep 2025 3:15 UTC
16 points
2
on: Safety researchers should take a public stance
I support a magically enforced 10+ year AGI ban. It’s hard for me to concretely imagine a ban enforced by governments, because it’s hard to disentangle what that counterfactual government would be like, but I support a good government enforced AGI slowdown. I do like it when people shout doom from the rooftops though, because it’s better for my beliefs to be closer to global average average, and the global discourse is extremely far from overshooting doominess.

Tao Lin 23 Sep 2025 2:44 UTC
1 point
0
in reply to: habryka’s comment on: Buck’s Shortform
Yeah it goes out of its way to say the opposite, but if you know Nate and Eliezer the book gives the impression that their pdooms are still extremely high, and responding to the author’s beliefs even when those aren’t exactly the same as the text is sometimes correct, although not really in this case.

Tao Lin 20 Sep 2025 2:20 UTC
6 points
0
in reply to: Max Harms’s comment on: Max Harms’s Shortform
If you have a lump of 7,000 neurons, they can each connect to each other neuron, and you can spherical-cow approximate that as a 7000x7000 matrix multiplication. That matrix multiplication will all happen within O(1) spikes, ¹⁄₁₀₀ of a second. That’s ~700GFlop. An H100 GPU takes ~1 millisecond to do that operation, or 1M cycles, to approximate one brain spike cycle! And the gpu has 70B or whatever transistors, so it’s more like 10M transistors per neuron!

Tao Lin 3 Sep 2025 16:19 UTC
3 points
0
on: My AI Model Delta Compared To Christiano
Note that since Paul started working for the US government a few years ago, he has withdrawn from public discussion of AI safety to avoid PR and conflict of interests, so going off his writings are significantly behind his current beliefs.

Tao Lin 2 Sep 2025 23:17 UTC
1 point
0
on: Generative AI is not causing YCombinator companies to grow more quickly than usual (yet)
YC batches have grown 3x since 2016. I expect a significant market saturation / low hanging fruit effect, reducing the customer base of each startup compared to when there were only 200/year.

Tao Lin 23 Aug 2025 6:36 UTC
2 points
0
in reply to: Ben Pace’s comment on: Yudkowsky on “Don’t use p(doom)”
I’m surprised that’s the question. I would guess that’s not what Eliezer means because he says Dath Ilan is responding sufficiently to AI risk but also hints at Dath Ilan still spending a significant fraction of its resources on AI safety (I’ve only read a fraction of the work here, maybe wrong). I have a background belief that the largest problems don’t change that much, and it’s rare for problems to go from #1 problem to not-in-top-10 problems, and that most things have diminishing returns such that it’s not worthwhile to solve them so thoroughly. An alternative definition that’s spiritually similar that I like more is; “What policy could governments implement such that the improving the AI x-risk policy would now not be the #1 priority, if the governments were wise.”. This isolates AI / puts it in context of other global problems, such that the AI solution doesn’t need to prevent governments from changing their minds over the next 100 years or whatever needs to happen for the next 100 years to go well.

Tao Lin 9 Aug 2025 19:13 UTC
4 points
0
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
I would expect aerodynamic maneuvering MIRVS to work and not be prohibitively expensive. The closest deployed version appears to be https://en.wikipedia.org/wiki/Pershing_II which has 4 large fins. You likely don’t need that much steering force

Tao Lin 2 Aug 2025 21:25 UTC
8 points
0
on: Consider chilling out in 2028
I really struggle to think of problems you want to wait 2.5 years to solve—when you identify a problem, you usually want to work on solving it within the month. Just update most of the way now + a tiny bit over time as evidence comes in. As others commented, no doom by 2028 is very little evidence

Tao Lin 2 Aug 2025 20:38 UTC
1 point
0
in reply to: jacquesthibs’s comment on: How Fast Can Algorithms Advance Capabilities? | Epoch Gradient Update
I heard some rumors that gpt 4.5 got good pretraining loss but bad downstream performance. If that’s true the loss scaling laws may have worked correctly. If not, yeah a lot of things can go wrong and something did, whether that’s hardware issues, software bugs, or machine learning problems or problems with their earlier experiments

Tao Lin 20 Jul 2025 1:10 UTC
5 points
1
in reply to: Leon Lang’s comment on: OpenAI Claims IMO Gold Medal
This is OpenAI cot style. See it in the original o1 blog post. https://openai.com/index/learning-to-reason-with-llms/

Tao Lin 18 Jul 2025 17:37 UTC
2 points
0
in reply to: ryan_greenblatt’s comment on: Video and transcript of talk on “Can goodness compete?”
I can imagine scenarios where you could end up with more resources from causing vacuum decay without extortion. Like if you care about doing something with resources quickly and other agents want to use resources slowly, then if you cause vacuum decay inside your region, the non collapsed shell of your region becomes more valuable to you relative to other agents because it only exists for a short duration, and maybe that makes other agents fight over it less. Or maybe you can vacuum decay into a state that still supports life and you value that