Steven Byrnes comments on Steve Byrnes’s Shortform

Steven Byrnes 22 Jun 2026 19:30 UTC
48 points
0
MORE UPDATES TO MY OLD BLOG POSTS:
1. Revisiting an old take on LLMs
I’ll probably annoy all sides with this not-really-an-apology that I just tacked onto the beginning of my LLM-skepticism-related post from Apr 2023:
Update June 2026: I basically stand by this post, except that I regret using the word “plateau”. By analogy, chess engines show no signs of “plateau-ing”: Stockfish 18 (2026) crushes Stockfish 17 (2024), which in turn crushes Stockfish 16 (2023), etc.^[1] But chess engines ain’t gonna take over the world. So anyway, what I should have said was: There’s a school of thought, in which LLMs might (or might not) continue to get ever better at some things, but they will definitely always be bad at other things, and those ‘other things’ are very important, such that their continued absence will prevent LLMs from ever becoming “transformative AI” as discussed below. This post is about how this school of thought relates to AI doom. And separately, if that school of thought is right, should we describe this situation using the word “plateau”? Arguably we could, in the same sense that chess engines would hit a “plateau” of 50% on a test with half chess puzzles and half sports trivia. But all things considered, I think the word “plateau” was a poor choice born of sloppy thinking. Sorry.
2. More clarity on what “under-sculpting” means
I added a few sentences to Perils of under- vs over-sculpting AGI desires (2025) to hopefully make it clearer what “under-sculpting” means, especially this part:
Think of it like: the AGI’s desires are a boat, and there’s a radio beacon (the reward function) that we know how to steer towards. If we keep steering towards the beacon (over-sculpting), then we can get very close to the beacon, but alas, the beacon is built on jagged rocks that will kill us (specification gaming, §2). On the other hand, if we stop navigating towards the beacon at some point (under-sculpting), then we’re just adrift on the sea, and our two problems are: (1) we don’t know where we are right now (path dependence, §8.1), and (2) we don’t know where the unpredictable currents will take us in the future (concept extrapolation upon distribution shifts, §8.2).
3. A bunch of edits to Intro to Brain-Like-AGI Safety
See changelogs at the bottom of Posts 1, 5, 6, and 15 for details. The biggest changes were thorough rewrites of §5.3 on valence and reinforcement learning, §6.4 on what I mean by “model-based RL”, and §6.6.1 on ego-syntonic desires. (Those changes are not yet in the archival PDF version, sorry, I’ll upload a new PDF when I get a chance.)
The new version of §6.6.1 is fun; here’s most of it:
6.6.1 The distinction between internalized ego-syntonic desires and externalized ego-dystonic urges is unrelated to Learning Subsystem vs. Steering Subsystem
Many people (including me) have a strong intuitive distinction between ego-syntonic drives that are “part of us” or “what we want”, versus ego-dystonic drives that feel like urges which intrude upon us from the outside.
For example, if someone is on a hunger strike for freedom, they might say that their desire to eat comes from their innate drives, whereas their desire to fight for freedom comes from “reason”, or “their best self”, or whatever. What does that mean? What’s going on?
…
I propose that we should not take this intuition at face value. In reality, the hunger-striker’s desire to eat comes ultimately from their innate drives, and their desire to fight for freedom also comes ultimately from their innate drives! These are just two innate drives that are pointing in different directions, and thus they duke it out. One side will win, and then the person will either keep their hunger strike, or break down and eat.
It’s not so different from if you feel simultaneously very sleepy and very hungry; you can’t satisfy both drives, so the drives will duke it out, and one of them will win, and either you’ll nap despite your hunger, or you’ll eat despite your sleepiness.
That said, there do seem to be very important differences between the mundane hunger-vs-sleepiness battle (urge vs urge) and the hunger-vs-fight-for-freedom battle (urge vs ego-syntonic desire). In particular, here are three obvious questions that I need to address:
First, hunger and sleep drives are easy to understand. But it’s much less obvious how innate drives could lead a person to care so much about freedom that they’ll go on a hunger strike. What exactly is the innate drive in question?
Second, if I’m both hungry and sleepy, it’s sorta “a fair fight”. Probably whichever feeling is more immediately powerful will win. By contrast, the internal battle between the hunger-striker’s desire for freedom and their desire for food is not a fair fight. The former desire will punch above its weight by bringing far more intelligence and foresight to bear towards its objective. Thus, if the person is setting up a commitment mechanism, or tying their own hands, or attempting to control themselves and their mood, those actions will almost definitely be in pursuit of freedom, not in pursuit of hunger-satisfaction. Why the asymmetry? If both drives are in the Steering Subsystem, shouldn’t they be equally stupid and myopic? Relatedly, if the person is “applying willpower”, why is it in support of freedom rather than hunger-satisfaction? And by the way, what the heck does “applying willpower” even mean at a nuts-and-bolts level??
Third, if the person’s desire for freedom is not really more a “part of them” than their desire to eat when hungry, then … why does it feel that way to them? In other words, this is an honest introspective report. I’m allowed to claim that the report should be interpreted as a perceptual illusion rather than taken at face value, but you have no reason to believe me unless I can also explain what exactly they were introspecting upon, and what they saw when they did so, and why it left them with the impression that it did.
These are all great questions! And I have answers to all of them! But they’re rather involved.
For the first question, I claim that hunger-striking for freedom is driven by social instincts. You can learn about those at a high level in Post #13 of this series, and then proceed to my follow-up work Neuroscience of human social instincts: a sketch (2024), and Social drives 1: “Sympathy Reward”, from compassion to dehumanization (2025), and Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking (2025), especially §3 of that last post on how Approval Reward leads to pride in one’s self-image, and staying true to your principles even when nobody is watching. (That said, if people are watching, and you thus win the approval of people you admire, so much the better!)
For the second question, my short answer is that people have a strong, salient association between thoughts of themselves, and thoughts of how they would look in someone else’s eyes. The former thoughts are required for strategically tying one’s hands, making precommitments, etc. And the latter thoughts summon social instincts like pride. This is the source of the asymmetry where social instincts “punch above their weight” in the context of intelligent self-reflective plans, such as precommitments, describing one’s life aspirations, and so on. For more detail (along with how “willpower” fits in), see §8.5.5–§8.5.6 of my “Intuitive Self-Models” series (2024), but be warned that it might not make sense without reading the earlier posts in that Intuitive Self-Models series, in which I try to unravel a bunch of misleading intuitions in how we think about our own minds.
The third question (on why and how ego-syntonic desires are “internalized”) is also addressed in that same series, see “Intuitive Self Models” §3.5.4.
…
One way to put it is: why does the hunger-striker care about freedom, and not about, I dunno, ironing the wrinkles out of dollar bills? There has to be some explanation, right? And if you reply “it’s because freedom leads to (blah)”, then I’ll just reply, “OK, and why do they care about (blah), rather than, I dunno, measuring the distance between pebbles on the sidewalk?” We can go back and forth forever. The answer is: at the end of the day, something has to just plain feel intuitively good or bad. And that feeling has to come from an innate drive, one way or another.
Here are three more views on why we should believe that the Steering Subsystem is the ultimate source of not only ego-dystonic urges like hunger, but also ego-syntonic desires like friendship and justice.
- AI perspective: We don’t yet know in full detail how model-based RL and model-based planning works in the human brain—we don’t have brain-like AGI yet. But we do at least vaguely know how these kinds of algorithms work. And we know enough to say for sure that these algorithms don’t develop prosocial motivations out of nowhere. For example, if you set the reward function of MuZero to always return 0, then the algorithm will emit random outputs forever—it won’t start fighting for justice.
- Rodent model perspective: For what it’s worth, researchers have been equally successful in finding little cell groups in the rodent hypothalamus that orchestrate “antisocial” behaviors like aggression, and that orchestrate “prosocial” behaviors like parenting and sociality. I fully expect that the same holds for humans.
- Philosophy perspective: Without the Steering Subsystem, the only thing the cortex can do is build a world-model from predictive learning of sensory inputs (§4.7). That’s “is”, not “ought”. And “Hume’s law” says that you can’t get “ought”-statements from exclusively “is”-statements. Granted, not everyone believes in Hume’s law. But I do—see an elegant and concise argument for it here.
This new section I added to Post 1 is also worth copying here, you wouldn’t believe how often people are confused by this:
1.3.4 So is “Brain-like AGI” a good plan? Or is it a threat model?
Lots of people assume that, if I’m devoting my career to brain-like-AGI safety, I must be very enthusiastic about brain-like AGI.
By analogy, if someone is devoting their career to rocket engine safety, odds are high that they’re a space nerd who thinks that rocket engines are really cool and great.
…But on the other hand, people devote their careers to earthquake safety too! Do those people think earthquakes are really cool and great? Of course not! But they recognize that earthquakes will come, whether we want them or not, so we’d better prepare.
Now as it turns out, I’m much more like the earthquake safety person than the rocket safety person: I think of brain-like AGI as a threat model. Honestly, I expect that brain-like AGI will probably kill us all, in a manner that makes Skynet (from the Terminator movies) look primitive and sentimental. You don’t have to agree! Indeed, both enthusiasts and naysayers have a strong shared interest in understanding potential safety problems and designing mitigations, in a constructive, pedagogical, technical, and detail-oriented way. That’s my aim in this series. But I did want to lay my cards on the table.
4. Is valence a linear function on “thoughts”?
In [Valence series] 2. Valence & Normativity (2023), I made a claim in §2.4.1 that “valence is a (roughly) linear function over compositional thought-pieces”, and then immediately in §2.4.1.1 listed a bunch of examples where that claim seems totally wrong—for example, “someone I hate is suffering” or “I’m gonna avoid traffic” (both “someone I hate” and “traffic” are bad, but those thoughts seem overall good). I had (and still have) sound neuroanatomical and algorithmic reasons to posit linearity, but in 2023 I had a pretty hazy understanding of why this hypothesis was not immediately ruled out by those examples. Anyway, I can explain this substantially better now (albeit still a bit hazy), and I rewrote §2.4.1.1 accordingly.
(Almost all of those edits were in response to criticisms from Rif A. Saurous, many thanks to him.)
- Raemon 22 Jun 2026 23:29 UTC
  11 points
  3
  Parent
  FYI, every time I hear you say LLMs will plateau (or, “will never be able to take over the world” or whatever), I have to do a bunch of work to figure out if you mean “LLMs + shit tons of diverse RL environments.”
  I think you also think LLMs + shit tons of diverse RL environments are unlikely to take over the world, but I’m not sure. And, your framing makes it feel like you’re centering the argument in a very weird place to center the argument. Who cares whether LLMs in their original form are going to scale to AGI? That’s clearly not the mechanism by which they will scale to AGI. “LLM-base-models + tons of diverse RL environments” are obviously the default path, and I think even people who are relatively bullish on LLMs who don’t think very hard about it are still implicitly assuming that.
  (for me specifically, I’m also mentally tacking on “diverse RL environments that require long horizon conceptual reasoning to solve”, which most LLM-bulls are not thinking through, but, I think they’ll eventually bump into by accident)
  But, I don’t get why the particular limitations of LLMs should be the particularly loadbearing part of the description of “the thing that won’t work.”
  (I’m not very educated here but my impression is, right now RL is like 20% of the compute spent training current LLM-agents, and I’m imagining a world where it it’s more like 80%, 95%, or something. The main bottleneck seems to be figuring out how to construct the relevant RL environments at scale, but, that doesn’t have much to do with LLMs. Someone tell me if I’m being dumb here.)
  So, this is a) asking for specific clarity on that, b) arguing that your framing about this is weird and you should change it somehow.
  - Steven Byrnes 23 Jun 2026 0:35 UTC
    9 points
    0
    Parent
    I think we’re mostly on the same page, in the sense that I feel like RLVR is now thoroughly baked into the definition of of the word “LLM”, and thus if anyone is talking about “future LLM progress” in 2026, then they are definitely referring to the use of ever more (and better) RLVR, in addition to the use of ever more (and better) pretraining data, parameter count, inference-time compute, and so on. Right? At least, that seems self-evident to me.
    If you think that the right kind and quantity of RLVR environments will lead to LLMs that can take over the world (or whatever), then you’re disagreeing with me, and I’m confident that you’d also be disagreeing with other LLM naysayers like LeCun, Sutton, Melanie Mitchell, Gary Marcus, and all the rest of my strange bedfellows on this issue. (And that’s fine, of course.)
    The 2023 OP in question was AI doom from an LLM-plateau-ist perspective, which is a bit more specific (see the little Venn diagram thing, although it’s too outdated to discuss RLVR specifically, see also footnote 4). But that post was not a case for LLM-plateau-ism (or whatever we want to call it) anyway. I’ve actually never published a Steve’s Case Against LLMs Ever Becoming A Proper Superintelligence Even With Much More And Better RLVR And Also More Parameter Count And All The Other Things, and I’m not planning to, although my perspective has a bunch of facets and I’ve written about some of them in scattered posts and comments. It’s also possible that LLMs might be able to take over the world and/or kill everyone even without becoming a proper superintelligence; I’d bet against but hard to be sure.
    Also, I often talk about my opinions without defending them. But I try to make it clear that that’s what I’m doing.
    - Raemon 23 Jun 2026 2:32 UTC
      4 points
      0
      Parent
      Cool. Yeah I had just re-read AI doom from an LLM-plateau-ist perspective and still was a bit confused.
      Is the part that the model you’re starting with is an LLM, as opposed to some different RL base architecture, particularly loadbearing?
      And yeah, seems fine to state the opinion without defending it. I just wanted more clarity on which opinion you weren’t defending :P
      - Steven Byrnes 23 Jun 2026 2:58 UTC
        5 points
        0
        Parent
        Loadbearing on what?
        I mean: one topic of discussion is the alignment problem. I work on the technical alignment problem for certain non-LLM AI architectures. I find that LLM alignment (and control etc.) literature is mostly not applicable to my worldview. (I try to skim it just enough to know what I don’t know.) And we can chat about what are the important alignment-relevant disanalogies that I see between LLMs and these other kinds of AIs.
        OR, a different topic of discussion is how soon ASI will arrive. Again, I have different expectations, and we can talk about the timelines-relevant disanalogies that I see between LLMs and these other kinds of AIs.
        Or, yet another topic is whether ASI likely to be invented at Anthropic, OpenAI, etc., versus some different outfit. Or, yet another topic is how much compute it will require. Or how suddenly it will emerge. Etc. These are all different discussions. :)
        Raemon 23 Jun 2026 3:51 UTC
        2 points
        0
        Parent
        Okay yeah fair enough.
    - RobinHa 23 Jun 2026 6:49 UTC
      2 points
      0
      Parent
      just to make sure, when you say “will lead to LLMs that can take over the world (or whatever)”, does this include very, very high RSI? A lot of ML lends itself very well to RLVR (in theory), just the durations (and training compute) are still a bit challenging as of right now. personally, my timeline would roughly be very high RSI in ML using LLMs and inventing new and better architectures and training algorithms, very well taking a big distance to LLMs. do you also consider this unrealistic or is your point more that what we land on down the line isn’t an LLM?
      - Steven Byrnes 23 Jun 2026 10:44 UTC
        3 points
        0
        Parent
        I’m expecting a future paradigm shift, and I don’t think it matters terribly much whether that next paradigm of AI is invented by humans, or by old-paradigm AIs, or by the two working together (although for the record I expect it to be mainly humans). At least, it doesn’t matter much for technical alignment. I guess it’s relevant for timelines.
        Raemon 23 Jun 2026 20:15 UTC
        4 points
        0
        Parent
        Okay, the last time I was having this argument with a (different) someone, it was primarily about timelines, and they specifically disbelieved LLM-descendants would be capable of inventing the next paradigm. Sounds like you don’t have a strong take on that?
        
        I totally agree there will be a new paradigm by the time we get to overwhelming superintelligence. I don’t think there is necessarily a new paradigm by the time we get to “human-reasoning-complete” AGI. Is that something you have a strong belief on?
        Steven Byrnes 23 Jun 2026 21:24 UTC
        5 points
        0
        Parent
        I think LLMs are (and will always be) worse than humans at the kind of research involved in inventing a new AI paradigm. I think they’ll be helpful in the same way that PyTorch and GitHub and arXiv are helpful, but not involved in the way that’s similar to massively multiplying the quantity or quality of intellectual labor working on the problem. So when I think about timelines, I kinda assume business as usual, and am thinking loosely about historical analogies and how things seem to be going, and I wind up saying … I dunno, probably 5–25 years, or maybe more than 25 years, who knows, or maybe less than 5 years, who knows. I have a bit of discussion in §1.9 here (+ §1.4.1).
        Andrii Vasylenko 23 Jun 2026 22:16 UTC
        1 point
        0
        Parent
        I think LLMs are (and will always be) worse than humans at the kind of research involved in inventing a new AI paradigm. I think they’ll be helpful in the same way that PyTorch and GitHub and arXiv are helpful, but not involved in the way that’s similar to massively multiplying the quantity or quality of intellectual labor working on the problem.
        I basically agree with that description, as applied to the LLMs of 2026. However, I think that there’s a smooth-ish path in design space from here to actually-dangerous AGI. It seems plausible to me that the AI companies will be able to follow that path to its destination over the course of several years.
  - Random Developer 23 Jun 2026 7:03 UTC
    4 points
    0
    Parent
    Who cares whether LLMs in their original form are going to scale to AGI? That’s clearly not the mechanism by which they will scale to AGI. “LLM-base-models + tons of diverse RL environments” are obviously the default path, and I think even people who are relatively bullish on LLMs who don’t think very hard about it are still implicitly assuming that.
    
    I’m not the person you’re replying to, but I suspect that LLMs+more RL will not scale to AGI. I think that current LLMs they’re a convincing existence proof that we’ll get to AGI, probably within my lifetime, but they’re not the actual thing. The actual thing will probably require a major breakthrough in either flobination ^[1] or grinkling.
    
    But on the other hand, I don’t think this makes much difference. So what if you turn out to need LLMs+RL+grinkling before you build AGI? I fear we’ll make it there quickly, and once we do, we’ll pass quickly enough to ASI. This wouldn’t necessarily require some dramatic kind of RSI. My impression of Opus and especially Fable is that they’re already superhuman in important ways, but that they’re “spiky.” If the spikes in Fable were to be smoothed out, then I think it would already be a weak superintelligence ^[2] at the far upper end of the human distribution.
    
    ↩︎
    Since I personally believe that robust, enforced alignment is straight up impossible in principle, and that actual AGI has a significant chance of being a diagnosis of stage 4 invasive cancer for everyone I know, I make a point of not discussing specifically what I suspect is missing from current AIs. There is no real point in disclosing inherently unfixable security problems. “Security by obscurity” is inferior for problems that can be fixed. But in the real world, there’s an enormous difference between “a physical security risk that experts can figure out” and “a security risk that is common cultural knowledge.”
    
    ↩︎
    “Weak” superintelligence: Does basically what the most capable humans do, but significantly faster or with much higher output. “Strong” superintelligence: We couldn’t do much of what it does even with many human lifetimes of effort from our best thinkers. I think we’ll get the weak version more or less automatically once we get AGI at all.
- Thane Ruthenis 22 Jun 2026 21:13 UTC
  4 points
  4
  Parent
  I’ll probably annoy all sides with this not-really-an-apology that I just tacked onto the beginning of my LLM-skepticism-related post from Apr 2023:
  Not so: the side of myself is in full in agreement with that addendum.

Steven Byrnes comments on Steve Byrnes’s Shortform

1. Revisiting an old take on LLMs

2. More clarity on what “under-sculpting” means

3. A bunch of edits to Intro to Brain-Like-AGI Safety

6.6.1 The distinction between internalized ego-syntonic desires and externalized ego-dystonic urges is unrelated to Learning Subsystem vs. Steering Subsystem

1.3.4 So is “Brain-like AGI” a good plan? Or is it a threat model?

4. Is valence a linear function on “thoughts”?