Łukasz Chajdaś

Karma: 10

Łukasz Chajdaś 26 Oct 2025 13:27 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Musings on Reported Cost of Compute (Oct 2025)
Thanks for the detailed response. If we assume that continued progress requires using much more compute on RL than pretraining, this could favor model sizes much smaller than Chinchilla optimal. It’s possible that future models will be trained with 99% RL and will employ some sort of continuous learning architecture.
I believe the number of active parameters will be determined by achieving similar load on compute and memory, which will depend on the current level of attention efficiency and the context necessary to hit agent ability targets.I don’t know how sparse very large (>20T) models could be, but they could probably be extremely sparse (maybe 0.1% or less, where most of the volume is taken by a router).
I think it’s possible that current and planned AI compute designs are not even close to optimal for training methods that may be developed within a year or so. Assuming something like DeepSeek DSA is relatively close to maximum efficiency, and RL pathways produced by agents will stretch to many millions of tokens, it may be the case that HBM will finally be too slow, and using SRAM will be the only way forward.
There’s also a completely different consideration. For example, inclusionAI/Ring-1T has 1T total and 50B active parameters, and the creators claim it can achieve silver in IMO using a multi-agent framework, with plans to train it further to achieve gold, it’s only 50b.
As you said, we’re probably entering compute ranges that enable training models with many T active parameters (assuming RL will be in roughly one-to-one proportion to pretraining). The question is whether pattern density in our environment is currently large enough to make use of it.”

Łukasz Chajdaś 25 Oct 2025 17:40 UTC
11 points
0
on: Musings on Reported Cost of Compute (Oct 2025)
Do we have reason to assume that MoE models will stretch to the entire rack memory capacity minus KV cache in batched inference? I ruminated earlier that Rubin Ultra could theoretically run 300T sparse models, but it looks recently like it’s possible to squeeze much more from far fewer parameters and iterate much faster.
This data about compute spend at OpenAI suggests this trend: https://epoch.ai/data-insights/openai-compute-spend
At the same time, some Chinese open source models are downscaling in parameters. The most recent MiniMax M2 is much smaller than MiniMax M1. Qwen-Next 80B-A3B is close in performance to Qwen-3 235B-A22B. I wouldn’t be surprised if the same happened with DeepSeek-V4.
It’s not that models won’t grow, but it feels like intelligence density per parameter and non-degrading sparsity thresholds will be tested in the search for maximum training and inference efficiency.

Łukasz Chajdaś 19 Oct 2025 9:29 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Intuition for ‘faster’ seems more straightforward to justify because, in general, there will be a higher volume of technology available as time progresses, while brain emulation requirements are constant. I think it’s interesting to focus on what could cause a slower scenario.
1. It’s possible that, without complete digitization of a sufficiently complex animal brain, further progress will be intractable for human intelligence and, by extension, very advanced future LLMs. For example, there may be many supposed breakthroughs in implementing continual learning in neural networks or high sample efficiency learning, but for some reason it will not be possible to glue all those things together, and the critical insights to do so will seem very hard or almost impossible to invent.
2. It may be the case that scaling AGI like intelligence is like trying to increase velocity in a fluid. It’s more complex than just a quadratic increase in drag. The type of flow changes above supersonic speeds, and there will be multiple supersonic like transitions that are incredibly complex to understand.
  The radical difference in human intellectual capability from a supposedly relatively identical substrate seems contradictory at first. It’s possible that, for some evolutionary reason, the ability to form complex circuits is stunted in non-anomalous brains.
3. Digitization of a sufficiently complex brain at enough resolution may not be possible without nanotechnology that can monitor a living brain in real time across its entire volume. There might be approaches to grow flat brains with scaled up features synthetically, but it may not be possible to train such a brain for reasons that are hard to understand or even speculate about now. Nanotechnology at this level may not be possible to develop without ASI.
4. It’s possible that the brain requires much more compute than a naive estimate based on synaptic firing frequency suggests.
  I have often heard that what a single neuron does is extremely complex. On the other hand, the frequency of synaptic firing suggests there isn’t much data transmitted in total. This is relatively hard for me: on one hand, Hans Moravec style estimates: computing capacity in the retina multiplied by brain volume make sense; on the other hand, outside the retina, at a whole brain level, some sort of data augmentation may be happening that actually consumes 99.9% of the compute, and in those processes very complex in neuron operations are used.
5. It may not be possible to design and manufacture high volume, brain like programmable substrates without ASI, and ASI may not be achievable without them, or it may be extremely hard and require multiple terawatts of compute because of 2.
6. The current LLM takeoff suggests that intelligence is relatively “simple” to solve, but in fact this type of text based pattern processing could be OOMs more efficient than animal like intelligence due to the pattern compressing function of language. If bootstrapping LLMs to AGI fails, it could be really hard to find a paradigm that gets closer. A new paradigm may get closer but still turn out to be relatively limited. This situation could repeat many times without obvious solutions on the horizon.
For now, I think that’s enough. I need to think about this more.