I basically agree with the intended point that general intelligence in a compute-limited world is necessarily complicated (and think that a lot of people are way too invested in trying to simplify the brain into the complexity of physics), but I do think you are overselling the similarities between deep learning and the brain, and in particular you are underselling the challenge of actually updating the model, mostly because unlike current AIs, humans can update their weights at least once a day always, and in particular there’s no training date cutoff after which the model isn’t updated anymore, and in practice human weight updates almost certainly have to be done all the time without a training and test separation, whereas current AIs do update their weights, but it lasts only a couple of months in training and then the weights are frozen and served to customers.
(For those in the know, this is basically what people mean when they talk about continual learning).
So while there are real similarities, there are also differences.
Because most of the world is actually complication. This is another thing Alan Kay talks about — the complexity curve versus the complication curve. If you have physics brain, you model the world as being mostly fundamental complexity with low Kolmogorov complexity, and you expect some kind of hyperefficient Solomonoff induction procedure to work on it. But if you have biology brain or history brain, you realize that the complication curve of the outcomes implied by the rules of the cellular automaton that is our reality is vastly, vastly bigger than the fundamental underlying complexity of the basic rules of that automaton.
Another way to put this, if you’re skeptical: the actual program size of the universe is not just the standard model. It is the standard model plus the gigantic seed state after the Big Bang. If you think of it like that, you realize the size of this program is huge. And so it’s not surprising that the model you need to model it is huge, and that this model quickly becomes very difficult to interpret due to its complexity.
I would slightly change this, and say that if you can’t brute-force simulate the universe based on it’s fundamental laws, you must take into account the seed, but otherwise a very good point that is unheeded by a lot of people (the change doesn’t matter for AI capabilities in the next 50-100 years, and it also doesn’t matter for AI alignment with p(0.9999999), but does matter from a long-term perspective on the future/longtermism.
I actually disagree with this, and would say that if you believe AI alignment is hard and there isn’t a way to make superhuman AI safe without immense capabilities restraint, then data-center bans are net positive for the following reason:
Even under the assumption that new paradigms are required, training and experiment compute is still helpful because of scale-dependent algorithmic efficiency, which means that algorithmic progress requires training compute to increase, and it’s a significant portion of the algorithmic efficiency that we do get in practice, as Epoch notes below:
For example, @MITFutureTech found that shifting from LSTMs (green) to Modern Transformers (purple) has an efficiency gain that depends on the compute scale: - At 1e15 FLOP, the gain is 6.3× - At 3e16 FLOP, the gain is 26×
Naively extrapolating to 1e23 FLOP, the gain is 20,000×!
Also the AI Futures Model argues for a 4x slowdown (but this has to be appropriately timed, but even later pauses slow down takeoff).
This should probably be a recurring question, ala the Open Threads LW moderators make, but to put it in a short sentence, alignment has gotten easier, but humanity has gotten more incompetent and is unwilling to pay large costs for safety.
The reason I say alignment has gotten easier is that we have slowly started to realize that the original goal needed to be revised in part by lowering the capability target.
One of the insights of AI control is that we (probably) don’t actually need to consider aligning super-intelligences in the limit of technological development, or anywhere close to that, and that the first AIs that are both massively useful and pose non-negligible risk of AI takeover are able to be controlled in a way that doesn’t depend on AI alignment working.
To be clear, it’s still quite daunting as a challenge and AI companies/governments have started to be more reckless in AI deployment/progress, so it’s still easy for misalignment to occur, especially if we get more unfavorable paradigms (neuralese actually working would be the big one here, but even more prosaic continual learning/long-term memory could be a big problem for AI alignment)
My median/modal expectation conditional on AI being able to automate all of AI R&D is that we implement half-baked control/alignment, and things are very messy and lots of balls are dropped, but we ultimately survive the ordeal based on cheap strategies like satiating AI preferences working, but that we incur a terrifying amount of risk (as in for example taking on 1-5%, or even 10-90% risk of AI takeover) while attempting to solve AI alignment.
My current take is that the chatbot parasitism, even at it’s most severe was basically what was expected when you let the general population use a tech that can speak back to them, and I basically agree with Ben-Landau Taylor’s theory that the demand of horrifying stories around AI psychosis is way in excess of the true supply, and the biggest reason it was focused on was because of GPT-4o and the fact that we are generally bad at base rates, so I’m unconvinced persona parasitology actually matters that much.
AI existential risks, especially extinction risks from a long-termist perspective are now way overfunded compared to better futures work, and longtermism properly interpreted agrees with the common view amongst the general public that sub-existential catastrophes that collapse civilization are at least as important as risks that kill everybody, and are more important to prevent in practice than extinction risks.
One major upshot of this is that bio-threats, wars that can collapse civilization entirely, or other threats that kill off a large fraction of the population but don’t make them extinct, especially coming from AI is quite a bit more important to prevent than classical AI risk scenarios, and probably deserve more funding than current AI safety.
A better heuristic is to instead focus on a wider portfolio of grand challenges, which were defined in the article as decisions that could affect the value of the future by at least 0.1%, and another better heuristic related to long-term alignment of ASI is to scrap the Coherent Extrapolated Volition target and instead make ASIs execute optimal moral trades.
The counting arguments for misalignment, even if they were correct do not show that AI safety is as difficult as some groups like MIRI claim without other very contestable premises that we could attempt to make false.
While I generally agree with you, I’m getting more worried that the caveat of “they’re not studying the latest and greatest frontier models” is particularly applicable here due to a Liu et al paper (2025) which does show that in some cases, RLVR can create capabilities out of whole cloth.
So while I do think 2025-era frontier models aren’t influenced much by RLVR, I do expect 2026 and especially 2027-era LLMs to be influenced by RLVR much more relative to today, on both capabilities and alignment.
I agree that induction on data does require an inductive bias without resorting to look-up tables, but I do claim that you were arguing that modern AI’s behavior was more determined by the data than the architecture relative to what Max H was saying.
The data, as Zack M Davis argues, and one of the takeaways from the deep learning revolution is that inductive biases mattered a lot less than we thought, and data is much more important for AI behavior than we thought.
And once you realize this, the entire scenario falls apart, and ultimately it’s anti-capitalist pablum that has it’s bottom line already written, as shown by Alex Armlovich here:
Bad econ. “White collar workers switch to Doordash, driving down real wages there too” is partial equilibrium thinking
Robot firms only grow if they’re producing more real goods & services. If production is growing, real incomes are rising not falling. No doom loop!
If production is growing but somehow consumption stalls, that’s just a failure of monetary & fiscal policy. Cut rates to zero quickly; below zero, fiscal policy kicks in to redistribute income to consumers & restore consumption growth to trend
Either there’s no doom loop in the first place, or else New Keynesian monetary & fiscal policy kicks in to close any emergent wedge between robot output & human consumption
This piece is ultimately just anticapitalist pablum (teasing a belief that most markets are just scams and rent seeking in the intro!)--and after that, the piece simply underrates or misunderstands the stabilizing powers of liberal democratic Keynesian capitalism to keep output & consumption in balance
We don’t have to allow what @delong calls “a failure of the exchange mechanism” in the future any more than we did in the 1930s
Driving down costs to shift equilibria. Probably the largest driver of decreased CO₂ emissions has been the emergence of cheap wind and solar power. This illustrates a powerful dynamic: as production scales, costs fall along experience curves (sometimes called Wright’s law), until clean energy becomes the default rather than the alternative. Before this inflection point, decarbonization meant fighting against economic incentives; after it, the incentives pulled in the same direction, and the market did the work of driving further R&D.
Also relevant here is that it broke the need to address the politicization of climate change, and meant that you didn’t need political will anymore once the 2010s and later rolled around for the clean energy transition (assuming you don’t care about animals much).
One of my hopes is that even if the discourse on AI safety degenerates into a politicized mess, that boring technical work will still progress enough via Wright s Law to make at least large parts of AI safety demands be incentivized by markets alone.
IMO, you mention the main downside in the post about aligning to virtues, in that it allows AIs more leeway to make decisions on values, and a core divergence from you is that while we will need to defer to AIs eventually, I don’t expect massive enough breakthroughs in alignment to make it net-positive to allow AIs to have the level of control over generalizing on values that I’d want for safe AI, and I tend to think corrigibility plus AI control is likely to be our first-best mainline AI safety plan, and some of the purported benefits are much less likely to occur than you think it will.
The other issue is that while people do agree on virtues more than consequentialist preferences, a lot of the reason why people agree on virtues both now in the past are at least consistent with 2 phenomena occuring, and which I’d argue explain the super-majority of the effect in practice:
Technology, while it has unbundled certain things that were bundled in the distant past, and already made virtues like honor go from 80-99% of the population to ~0%, it’s still the case that on a lot of different virtues, it turns out that a lot of the option space harms or helps a lot of virtues by default, and it’s still very difficult to engineer the vast space of possible virtue disagreements between humans because it’s hard to improve one virtue without improving another, and most of our virtues are in practice built out of valuing instrumental goods, but in a post-AGI future (for now I’ll assume the AI safety problem is solved), it’s a lot easier to create goods that are valued differently by many OOMs between different virtues, which Tyler M John explains well here.
To a large extent, humans need to live and work with other humans, and it’s not really possible for anyone, even for the richest and most powerful people to simply ignore societal norms without paying heavy prices, even if only informally, and interactions are repeated often enough combined with the fact that enforcement is possible due to humans requiring a lot of logistical inputs that other humans can take away, that we can turn prisoner’s dilemma’s into either iterated prisoner’s dilemma’s or stag hunts or schelling problems (I ignore acausal trade/cooperation for decision theories like EDT/FDT/UDT because it relies on people having more impartial values than they actually have, and pure reciprocity motivations don’t work because humans can’t reason well about each other (yes, we’re surprisingly good at modeling given compute constraints, but it’s nowhere near enough)). But post-AGI, humans will be able to choose to be independent of social constraints/pressure, meaning that the forces for convergence to certain virtues will weaken a lot. Vladimir Nesov talks about that here.
IMO, I think the more plausible versions of value alignment/good futures looks like moral trade, like in this short afterward of “What We Owe The Future” or earlier on, viatopia as discussed by William Macaskill here (conditional on solving alignment).
IMO, the main way things change if we assume that automated jailbreaking/persuasion attacks is the most likely strategy, compared to just attempting to communicate illegibly via steganography is that as in this experiment, paraphrasing CoTs almost entirely works to fix the monitor’s weaknesses without degrading any capabilities, because it’s way easier to separate dangerous reasoning from benign reasoning if the message itself is sent in plain text, and this remains true even if we cannot solve the issue of jailbreaking monitors or persuasion attacks.
More minorly, it hints that LLMs may have a similar issue to humans that Roger Dearnaley describes here:
It’s like they’re much less good at turning a description of an algorithm into an implementation if it has to be carried out in their initial and final layers than in their middle layers: it almost reminds me of the conscious and unconscious mind phenomenon in humans: except they seem to think they have conscious control when they don’t.
And depending on how much humans use neuralese, this capability deficit may persist even if CoT reasoning got replaced by huge neuralese models with memory banks, meaning neuralese is less bad for safety than feared.
It matters because if we don’t get steganography, it’s likely possible to fix monitors in ways that even superhuman AIs cannot avoid.
I think this is an inaccurate way of thinking about things. AI progress has sped up due to the development of test-time compute scaling. GPT-5 is probably[4] less than our around 1 OOM bigger than GPT-4 in terms of training compute, while previous N to N + 1 scalings have been 2 OOMs.
This is exactly the kind of thing that Yudkowsky was talking about! It explains a huge amount of the speed-up. It doesn’t really feel quite like a paradigm shift the way going from RNNs to transformers did, but it does change the scaling dynamics, the one thing that Ajeya was relying on.[5]
I’ll somewhat defend Ajeya here. While the test-time compute boost was a big deal in that it let us get 2 years of progress in AI for nearly free, it ultimately doesn’t change the scaling dynamics much because test-time compute quickly becomes costlier than hiring humans, and inference cost is much more variable than training cost, meaning scaling it to allow the changing of the scaling curves into ones more favorable isn’t practical except in narrow domains.
Compared to the many OOMs that pre-training achieved and the OOMs remaining for pre-training, test-time compute is much less relevant here.
IMO, this post plus this other post on similar issues (that cultural/physical evolution, if left unchecked will probably remove any alignment to humans that remain absent solutions that are perfect and work unboundedly long) suggest that value lock-in/preserving of initial alignment even through long time periods matters a lot, but I think the main 2 disagreements I have with the floor section of the post are these:
I think value stability/lock-in preventing cultural evolution, or at least massively slowing it down relative to human time scales is possible, and once achieved makes it much easier for even imperfect alignment with humans to persist, meaning that it’s a lot less likely AIs drift/intentionally go into this sort of failure mode. Forethought has a good article on this issue.
Due to standard instrumental convergence arguments plus real world history, I broadly expect AIs and humans to preserve their values if they can attempt to do so, but for AIs it’s a lot easier to prevent value change, meaning that we don’t need to work on gradual disempowerment very much.
So from a human perspective, I do believe alignment is basically all you need (though a caveat here is that you do need AI alignment before AIs can do superpersuasion, but thankfully, superpersuasion seems likely to be one of the abilities AI achieves late into the industrial/software intelligence explosions, and a large portion of that reason is that it’s actually quite hard to change human minds using intelligence alone (absent just inventing superpersuasive nanobot/drug/genetic therapy X, which I do think is possible but at that point killing all humans using more boring super-robots/nanotech would have been possible long before, so either AIs are aligned well enough or we are extinct well before then)).
The reason I think that the cultural disempowerment point of gradual disempowerment relies on superpersuasion working is that otherwise, it’s quite easy for humans to prevent even the worst-case scenarios of AI-AI culture from affecting human lives (assuming humans at least have economic power through aligned AIs).
Culture. They discuss how AI will increasingly produce culture and this will shift culture towards AI-friendly versions, but they don’t argue that humans will stop being the primary consumers of culture – except by repeating the conclusion from the economics section that I didn’t find convincing. So again it seems to me that humans could constrain cultural evolution through their role as consumers alone.
I do appreciate that it’s much more possible for a completely anti-human ideology to flourish – e.g. one advocating for human death/extinction/non-agency – in this post-AGI world. It would not be selected against on the production side – groups proposing it wouldn’t lose out competitively (e.g. by killing themselves). But on the consumption side it still seems like it would lose – humans have strong biological instincts not to die and (I claim) they will own huge wealth.
And the production side will be influenced by the consumption side.
Even AI-AI culture, if it promotes bad outcomes for humans and humans can understand this, will be indirectly selected against as humans (who have money) prefer interacting with AI systems that have good consequences for their well-being.
My own take re rationalization/motivated reasoning is that at the end of the day, no form of ethics can meaningfully slow it down if the person either can’t credibly commit to their future selves, or simply isn’t bound/want to follow ethical rules, so the motivated reasoning critique isn’t EA specific, but rather shows 2 things:
People are more selfish than they think themselves to be, and care less about virtues, so motivated reasoning is very easy.
We can’t credibly commit our future selves to do certain things, especially over long timeframes, and even when people do care about virtues, motivated reasoning still harms their thinking.
Motivated reasoning IMO is a pretty deep-seated problem within our own brains, and is probably unsolvable in the near term.
A core issue here that is repeated is that since AI progress has been (so far) slower than super-exponential or faster-growing functions, and is merely growing at an exponential rate as defined by time horizons, it turns out that there’s a very, very large difference between acing benchmarks and actually posing enough of an existential risk to actually serve as a useful red line, and due to the jagged frontier plus progress coming from many small improvements in compute, it’s a lot harder to make clear red lines or get definitional clarity.
More generally, one of the takeaways from time horizons work is that by default, there will probably be no legible clear red lines, so any warning shots need to be prepared for.
I agree with a lot of this post, but one other motivation (at least for me) is that checking how much algorithmic progress exists/comes from compute is that it’s relevant to predictions around recursive self-improvement/software intelligence explosion without increasing compute, assuming you have AIs that can fully automate AI research, and more generally informs takeoff speeds, and I take the Gundleach et al paper as evidence that pure software improvements are more likely to cap out at relatively modest impacts than creating a self-sustaining feedback loop of progress, slowing down AI takeoff by quite a bit until we massively scale up compute for AGIs.
(The evidence value is reduced but not eliminated by the fact that they tested it on LLMs).
The nuance was in saying that their framework can’t predict whether or not data or compute scaling made the majority of improvements, nor can they separate out data and compute improvements, but the core finding of algorithmic efficiency being almost all compute-scaling dependent still holds, so if we had a fixed stock of compute now, we would essentially have 0 improvements in AI forever.
Another possibility, in principle, for why automating AI R&D doesn’t lead to an intelligence explosion is because a very large percentage of the progress (at that part of development trajectory) is driven by scaling relative to algorithmic progress.
This is actually happening today, so the real question is why algorithmic progress returns will increase once we attempt to fully automate AI R&D, rather than why we won’t get an intelligence explosion.
More specifically, the algorithmic progress that has happened is basically all downstream of more compute going into AI, and algorithmic efficiency is dependent on compute scale being larger and larger to reap the gains of better algorithms.
I basically agree with the intended point that general intelligence in a compute-limited world is necessarily complicated (and think that a lot of people are way too invested in trying to simplify the brain into the complexity of physics), but I do think you are overselling the similarities between deep learning and the brain, and in particular you are underselling the challenge of actually updating the model, mostly because unlike current AIs, humans can update their weights at least once a day always, and in particular there’s no training date cutoff after which the model isn’t updated anymore, and in practice human weight updates almost certainly have to be done all the time without a training and test separation, whereas current AIs do update their weights, but it lasts only a couple of months in training and then the weights are frozen and served to customers.
(For those in the know, this is basically what people mean when they talk about continual learning).
So while there are real similarities, there are also differences.
I would slightly change this, and say that if you can’t brute-force simulate the universe based on it’s fundamental laws, you must take into account the seed, but otherwise a very good point that is unheeded by a lot of people (the change doesn’t matter for AI capabilities in the next 50-100 years, and it also doesn’t matter for AI alignment with p(0.9999999), but does matter from a long-term perspective on the future/longtermism.