European countries would deploy troops to Greenland to defend it against the US
looks like this is only 200 troops, and they don’t mention any heavy equipment, so they couldn’t really defend against anything
European countries would deploy troops to Greenland to defend it against the US
looks like this is only 200 troops, and they don’t mention any heavy equipment, so they couldn’t really defend against anything
I think this post overstates its case, but it makes important under-discussed points!
Here’s another argument against evals that resonates with me:
One big reason why people do evals is that they see evals as completely cooperative and non-adversarial, whereas regulation and advocacy could cause conflict and polarization. If all you’re doing is explaining facts, no one should regret interacting with you.
I think that’s not quite right in the case of AI wakeup and AI regulation. It’s obvious that in retrospect, all ambitious or powerful people will wish they had made AGI their top priority some time in the now-past, such as 2023. So if, in your honest and cooperative communication, you cause people to be less afraid and more complacent about AI, they will likely regret that interaction in retrospect! At some point later, they will realize they were grossly underrating AI, lacked sufficient context to effectively reason about AI, and will probably wish that someone had pushed them harder to prioritize AI safety. In that sense, it’s actually more coooperative to selectively share information that increases wakeup and willingness to take costly safety actions, not unbiased information. Of course this consequentialist reasoning is not a solid basis for decision making, but it does make me believe that telling someone who isn’t intimately familar with AI “actually Y model can’t cause catastrophe Z yet” is not really a mutually beneficial interaction.
Even if new people’s impact isn’t downstream of research, the inability to contribute to research significantly hinders their progress, because potential mentors and collaborators won’t benefit from working with them, and they generally get fewer opportunities to advance their careers.
Yeah this argument applies much less to policy/communications work.
I agree that you don’t need 5+ years of research to conntribute, but I expect that at some point soon, if you don’t have some minimum amount of context on AI safety, your job (managing a bunch of LLM agents) can just be automated by an LLM agent.
It seems likely to me that maybe 50% of people who start seriously studying or working on AI safety in the next year will be below the intelligence escape velocity, where they forever lag behind frontier AI systems in AI research ability. If I were working in capacity building, I would already start to deprioritize the earlier parts of the funnel for this reason. For reference, my TEDAI timelines are 2028.
Maybe “accountability” is just the mixture of responsibility and dominance, and of those two I find dominance more motivating.
Why is Gopher better than Pythia or Cerebras? Mostly no comment, but I think Pythia and Cerebras weren’t making any super simple obvious mistake but were behind 2021-era DeepMind.
I believe open data pretty strongly contradicts your claims
System efficiency: ~2x, not ~20x
§1.2 estimates “up to 20x” from system optimizations (quantization, parallelization, FlashAttention, etc.). But model FLOPs Utilization, the fraction of peak hardware FLOPs that large training runs actually achieve, has barely changed. Here are the most legible training run flops efficiencies:
Megatron-LM 2019: 30% on a single V100
PaLM 2022: 46% on 6K TPU v4s
Llama 3.1 2024: 38-43% on 16K H100s
MegaScale 2024: 55% on 12K GPUs
Llama 4 2025: 39% BF16-equivalent on 32K H100s
That’s less than 2x variation over 6 years. Most optimization work at scale is treading water against communication overhead as clusters grow larger, not increasing training compute per chip. Also, individual optimizations like FlashAttention’s 3x attention speedup are additive rather than multiplicative, and many other optimizations only apply to inference, not training.
The Gundlach paper tests 6 innovations. The field has produced 40+.
§1.1 relies on Gundlach et al. (2025b), which ablated six things, SwiGLU, RoPE, cosine decay, pre-RMSNorm, pre-LayerNorm, and AdamW. The most algorithmically advanced models with published methodology are DeepSeek-V3, DeepSeek-R1, and Kimi K2, and here’s a list of algorithms they used that Gundlach didn’t test:
Architecture:
Fine-grained MoE: 256 routed + 1 shared expert, 8 active per token
Ultra-sparse MoE: scaling to 384 experts at fixed active params
No token-dropping in MoE (enabled by better balancing)
Multi-Head Latent Attention (MLA): low-rank KV compression
Sigmoid gating with top-K normalization replacing softmax
Auxiliary-loss-free load balancing via learned bias terms
YaRN context extension (4K→128K in 2000 fine-tuning steps)
Training methodology:
Multi-token prediction with sequential causal chain
MuonClip optimizer: Muon + QK-Clip for stable training (zero loss spikes over 15.5T tokens)
Fill-in-Middle pretraining (10% of data in PSM order)
LLM-based data rephrasing for knowledge and math (outperforms naive multi-epoch)
Multi-phase learning rate schedule with annealing stages
Post-training:
GRPO: RL without a critic model
Four-stage pipeline: cold-start SFT → reasoning RL → rejection sampling SFT → all-scenario RL
Rule-based rewards avoiding neural reward model hacking
Of these, Muon alone is ~2x, which is on par with AdamW vs SGD.
HellaSwag: an example 10x+ compute multiplier, without distillation or synthetic data
Three model families, each trained at a fixed recipe across multiple sizes: Gopher (2021), Cerebras-GPT (2023, Chinchilla-optimal), and Pythia (2023), and they didn’t use any distillation, targeted data similar to benchmarks, or data contractors. Gopher reaches the same HellaSwag score as Cerebras-GPT or Pythia with roughly 10x less compute, and while using Kaplan parameter scaling. There was pretty much no legible algorithm in the paper that obviously caused this difference.
There’s lots more evidence, and many models such as Qwen or R1 got significantly higher compute multipliers, but as you note, some fraction of that is from distillation or data labelling.
Note: Claude 4.6 wrote most of this, with a huge number of revisions and fixes from me, it wasn’t really an uplift compared to only using Claude to make the list and the plot.
I would guess that 96+% of code merged into lab codebases is read by a human, but I’ve heard some startups are <50%, and are intentionally accumulating tons of tech debt / bad code for short term gain. People write lots of personal scripts with AI,which aren’t in the merged/production code statistics, maybe that brings it to 85% human-read code in labs. There’s also a wide range in how deeply you read and review code, where maybe only 50% of code that’s read is fully understood.
Anthropic already applied some CBRN filtering to Opus 4, with the intent to bring it below Anthropic’s ASL-3 CBRN threshold, but the model did not end up conclusively below that threshold. Anthropic looked into whether they could bring more capable future models below the ALS-3 Bio threshold using pretraining data filtering, and determined that it would require filtering out too much biology and chemistry knowledge. Jerry’s comment is about nerfing the model to below the ASL-3 threshold even with tool use, which is a very low bar compared to frontier model capabilities. This doesn’t necessarily apply to sabotage or ability to circumvent monitoring, which depends on un-scaffolded capabilities.
Redwood research did very similar experiments in 2022, but didn’t publish about them. They are briefly mentioned in this podcast: https://blog.redwoodresearch.org/p/the-inaugural-redwood-research-podcast.
In this range of code lengths, 400-1800 lines, lines of code does not correlate with effort imo. It only takes like 1 day to write 1800 lines of code by hand. The actual effort is dominated by thinking of ideas and huge hyperparameter sweeps.
Another note: i was curious what they had to do to reduce torch startup time and such, and it turns out they spend 7 minutes compiling and warming up for their 2 minute training run lmao. That does make it more realistic but is a bit silly.
There’s obviously no truth to that claim. Labs absolutely have better models.
Claude 4.5 sonnet is relatively bad at vision for a frontier model. Gemini 3 pro, GPT 5.2, and Claude 4.5 opus are better.
Saying you would 2 box in Newcomb’s problem doesn’t make sense. By saying it aloud, you’re directly causing anyone who you can influence and who would later have an opportunity to casually cooperate with you to not do that. If you believe acausal stuff is possible, you should always respond with “1 box” or “no comment”, even if you would in reality 2-box.
I support a magically enforced 10+ year AGI ban. It’s hard for me to concretely imagine a ban enforced by governments, because it’s hard to disentangle what that counterfactual government would be like, but I support a good government enforced AGI slowdown. I do like it when people shout doom from the rooftops though, because it’s better for my beliefs to be closer to global average average, and the global discourse is extremely far from overshooting doominess.
Yeah it goes out of its way to say the opposite, but if you know Nate and Eliezer the book gives the impression that their pdooms are still extremely high, and responding to the author’s beliefs even when those aren’t exactly the same as the text is sometimes correct, although not really in this case.
If you have a lump of 7,000 neurons, they can each connect to each other neuron, and you can spherical-cow approximate that as a 7000x7000 matrix multiplication. That matrix multiplication will all happen within O(1) spikes, 1⁄100 of a second. That’s ~700GFlop. An H100 GPU takes ~1 millisecond to do that operation, or 1M cycles, to approximate one brain spike cycle! And the gpu has 70B or whatever transistors, so it’s more like 10M transistors per neuron!
Note that since Paul started working for the US government a few years ago, he has withdrawn from public discussion of AI safety to avoid PR and conflict of interests, so going off his writings are significantly behind his current beliefs.
YC batches have grown 3x since 2016. I expect a significant market saturation / low hanging fruit effect, reducing the customer base of each startup compared to when there were only 200/year.
Yes, but in the simulation animal advocacy could still help base reality chickens by giving animal considerations more power over our hypothetical light cone. Care about chickens, but only ones after the singularity lol