Steven Byrnes

Karma: 30,145

I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.

Steven Byrnes 27 Jul 2026 22:24 UTC
LW: 7 AF: 4
2
AF
in reply to: StanislavKrym’s comment on: RL & search is a terrifying way to build AGI (an FAQ)
My point for this post is that we need a plan for RL & search AGI, and that we don’t have one. A plan might involve a certain starting training environment in addition to a certain reward function. Sure.
If you’re suggesting that, as long as the training environment is good (e.g. a loving family with adequate socializing), then we can use some straightforward reward function (e.g. “reward when the supervisor presses the ‘approve’ button”), and the AI will still wind up being genuinely nice, then I strongly disagree. E.g. plenty of human sociopaths grow up in loving families; a paperclip maximizer would maximize paperclips regardless of its childhood environment; and when people raise undomesticated animals as beloved pets, they often wind up getting mauled.
So I think the reward function has to be a central part of the plan, and I tend to emphasize that. But yeah, whenever I say “What reward function would lead to a nice AGI?”, that’s shorthand for (e.g. here) “What reward function (along with training environment and other design choices) would lead to a nice AGI?”.
See Intro series §12.5 for further discussion.
people, unlike LLMs, don’t tend to imitate their enemies. I wonder how it interacts with tropes like The Chain of Harm and bullying traditions
This is kinda off-topic, but I’ll respond anyway.
I think the motivation to act aggressively (in certain contexts) is innate in humans, just as it’s innate in probably all complex animals. So we don’t need to explain from scratch how people develop a motivation to bully each other. The motivation is already there, and might or might not come out depending on their personality, their mood, their relationship to the potential target, and lots of other mitigating and aggravating factors.
By contrast, we do need to explain how someone might develop a motivation to, I dunno, get a cupcake tattoo. There’s no innate drive that makes it feel satisfying to get a cupcake tattoo. Instead, the explanation would probably be that a cupcake tattoo is the kind of thing that this person’s idols (real or imagined) would think is cool.

Steven Byrnes 27 Jul 2026 18:28 UTC
LW: 2 AF: 2
0
AF
in reply to: philh’s comment on: [Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning
(This comment is kinda poorly thought through, sorry, but it’s the best I can do right now and I don’t want to leave you hanging even longer.)
Hmmm. It’s possible that you’re onto something important, but I’m not convinced yet. :)
(It could be 0, but only if I incorrectly start to produce more enzymes in the meantime, the STP had been incorrectly outputting a high number.)
You’re brushing this aside, but to me it’s load-bearing. The R=1 output leads to more enzymes than R=0, and maybe even that small amount will wind up being too much. Probably not, but you only need that to happen 10% of the time.
(Another degree of freedom is, instead of R being the “rate of digestive enzyme production”, it could be a more abstract “rate setting” that is monotonically but nonlinearly related to the literal number of enzyme molecules produced per second.)
(Periodic reminder that I don’t actually know anything about digestive enzymes, this is still just a toy example.)
…It might be nice to have easier-to-visualize metaphor. Let’s try!
Imagine, on a dare, you’re driving pretty fast, blindfolded, on a curvy go-kart track. You’ve driven on this track before, so you kinda remember the way it curves, but not very well, and you’re also pretty hazy on where exactly you are relative to the track.
Sometimes you hit the left side of the track, and then you scream and pull the steering wheel hard to the right (“+10 override”). Sometimes you hit the right side of the track, and then you scream and pull the steering wheel hard to the left (“–10 override”).
Otherwise, I guess my scheme would be: you’re trying to keep track of which side you’re likely to hit next, and the more you think it’s probably gonna be the right side that you hit next, the more you’re gonna turn the steering wheel left, and vice-versa.
So for example, at time t=0, you’re steering straight, and you think “I’m probably going to hit the left side next”, so you start turning the steering wheel more and more to the right, at 5°/second. Then at t=2, the steering wheel is 10° right of neutral, and now you think “I’m 50-50 on which side I’m going to hit next”, so you keep your steering wheel fixed at 10° right of neutral. Then at t=7, it’s been long enough that you have a sense that you must be getting close to a bend where the track veers left, so you think “I’m almost definitely going to hit the right side next”, and start turning the steering wheel 10°/second to the left, until the steering wheel is neutral at t=8, then 10° left of neutral at t=9, and so on until you feel an increasing chance that you’ll overcorrect and hit the left side. Etc.
See what I mean?
Anyway, when I read your comment about a scenario where (based on past experience) you’re expecting low need for digestive enzymes for a couple hour, but high need after that, I was sorta visualizing this go-kart track, following the curvy statistical expectation of how much digestive enzymes you’ll need when, and trying not to bump up against either the too-high or the too-low side. It’s important that there’s always a chance of hitting either side of the track.
Does that help?
(PS: in my first draft of this comment, I was gonna suggest that the probability of hitting the left versus right side of the track should control the steering wheel position, not the steering wheel rotational velocity. But that wouldn’t work in a circular track—it would never stop hitting the outside, I think. Seems vaguely related to P versus I feedback control, maybe? But not exactly … it’s not a traditional control loop because it involves a forward-looking prediction, I think.)

RL & search is a terrifying way to build AGI (an FAQ)

Steven Byrnes27 Jul 2026 14:50 UTC

78 points

10 comments14 min readLW link

Steven Byrnes 25 Jul 2026 3:09 UTC
LW: 3 AF: 2
0
AF
in reply to: Linch’s comment on: Why we should expect ruthless sociopath ASI
In the OP, I contrasted the “comparatively-less-pessimistic group (say, P(doom)…in the 5%–50% range”, with the “even more pessimistic group” that includes me. I think the style of argument you mention (“strong economic and military incentives for making minds that are highly agent-y…”) is a good argument for being in the former group, but it’s kinda hard to justify P(doom)>>50% that way.
For example, an optimist could respond to the “incentives” argument by saying “well humans can be pretty agent-y, and humans can accomplish ambitious projects, but yet humans also care about our friends and follow local norms and customs etc. So it’s at least possible that AIs could be like that!” And then the pessimist could respond “yeah but the more ruthless AIs will outcompete the nice ones”, and then the optimist could respond “well the vast majority of the AIs will be nice because competent companies and militaries won’t choose to run AIs that are eager to wipe out humanity, so the few omnicidal AIs can be stopped”, and then maybe the pessimist would pivot to offense-defense balance or whatever, and on we go down the argument tree. Anyway, there are some people who get to P(doom)>>50% via these kinds of arguments, but I think it’s more common that these arguments only get people into the P(doom)≈5–50% range.
And then I’m in a different cluster that includes Eliezer & Nate, and says that “P(doom)>>50% because egregious omnicidal misalignment is what’s definitely going to happen in the absence of some unlikely technical breakthrough”. That’s what I was defending in this post.
For many purposes, the “comparatively-less-pessimistic group” and the “even more pessimistic group” are on the same team, e.g. against Marc Andreesen. But in other contexts, the two groups are on opposite sides, so it’s very worthwhile to try to hash out which group is right.

Steven Byrnes 25 Jul 2026 1:16 UTC
5 points
0
in reply to: lilkim2025’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
Yeah, my thoughts exactly (if I understand you correctly).
I mentioned in the OP that the NVIDIA paper (Liu et al. “ProRL”) says “RL can indeed discover genuinely new solution pathways entirely absent in base models, when given sufficient training time and applied to novel reasoning tasks,” but then added that I didn’t the paper had proved it. I didn’t explain in the OP why I was skeptical.
…But what I was thinking was: if solving the problem requires doing the right step 20 times in a row, and the base model has a 10% chance of taking the right step each time, then the base model will never succeed, at least not in the number of attempts that they could afford to try. But then if RLVR gets it from 10% to 95%, it will succeed a lot. But upping the probability from 10% to 95% is not what one would reasonably call a “genuinely new solution pathway entirely absent in the base model”.
(Warning: I skimmed the paper and might be misunderstanding how they were justifying that claim.)

Steven Byrnes 24 Jul 2026 23:51 UTC
10 points
3
in reply to: Linch’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
Hmm, I reworded the section heading
FROM “Theoretically, each GPU-hour spent on RL should have orders of magnitude less contribution to LLM capabilities than a GPU-hour spent on imitative learning”
TO “Theoretically, each GPU-hour spent on RL conveys orders of magnitude less information content than a GPU-hour spent on imitative learning”
Sorry about that.
The RLVR is a small change to the model in the grand scheme of things—that’s my point—but of course it’s a small change that makes the model much better at things that the companies care about. If there was a way to get that same small change via imitation learning, then that would require much less compute, and companies would definitely want to do that. But no such alternative is known. …Well, oh actually, there is a way to do that: you can find some other model that can already do good inference-time reasoning and then distill it (via imitation learning). And companies do exactly that whenever they can. But you can’t push the SOTA that way. So they pay the cost.

Steven Byrnes 24 Jul 2026 19:59 UTC
6 points
1
in reply to: williawa’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
Yeah I think that’s overall a very reasonable stance on LLM alignment.

Steven Byrnes 24 Jul 2026 19:53 UTC
3 points
0
in reply to: Raemon’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
I forget. Definitely not a reliable source. In fact, I’ll edit the guess to 20% now. If anyone knows more, please share.

Steven Byrnes 24 Jul 2026 18:19 UTC
9 points
3
in reply to: williawa’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
In the big picture, I basically agree with all that. But I’ll nitpick a bit anyway :)
Whether the model knows which pastry Baker Brun is known for would seem to me to be of ~0 relevance to alignment. Its goals, reasoning abilities and in-context learning abilities strike me as where ~100% of the meat is.
I agree that “knowledge” of Baker Brun in particular is not related to alignment. But we shouldn’t generalize from that example to saying that pretraining is irrelevant to goals. (Maybe you didn’t mean to insinuate that anyway?)
For example, a base model may well autocomplete “I’m cold” to “I’m cold, so I’m gonna put on my coat now!”, which is pretty goal-like. Granted, it’s still not a true goal yet, but once we bring in tool use, those same autocomplete expectations can turn into bona fide goal-seeking actions. And yet they’re still derived from pretraining, and still reflective of the human distribution.
I don’t think we have any reasonable bound on how quickly SFT+RLHF’s niceness gets dilluted away by RL
I don’t know how to bound it from first principles, but at least we have some empirical data by now.

Steven Byrnes 24 Jul 2026 17:46 UTC
3 points
0
in reply to: Grendel1209’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
Risks depend on both capabilities (could it do Bad Thing X if it wanted to?) and alignment (does it want to?).
LLM alignment: The §3.3 discussion is my take on that, and it hasn’t changed for a long time (e.g. compare with Foom & Doom §2.3 from June 2025, and the non-RLVR part of that is in turn parroting §4.2 of this post I wrote 2023 …).
LLM capabilities: I didn’t discuss this in the OP, but my opinion is still that there’s a certain kind of “figuring things out” that humans can do (especially over extended periods of time), but that LLMs can’t, not now and not ever. (I’m stating an opinion without defending it.) But it’s tricky for me to translate that hypothesized limitation into concrete predictions of what future LLMs will or won’t be able to do in the real world. Could future LLMs wipe out humans, invent science and tech centuries beyond our wildest imagination, and colonize the galaxy? I’m confident in “no”. Could LLMs wipe out humans, leaving aside the question of whether they’d be able to survive on their own afterwards? I still lean “no”, but less confident. I’m certainly happy for there to be people working on LLM x-risk, and indeed I think there should be way more work going into that. But I also think the non-LLM thing that I’m working on (“brain-like-AGI safety”) is an EVEN scarier, more neglected, and more likely x-risk on the horizon.

Steven Byrnes 24 Jul 2026 17:10 UTC
4 points
0
in reply to: Oliver Sourbut’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
I edited a sentence in OP: It used to say “…Whereas Rohin is saying: LLM capabilities mostly come from imitative learning, therefore CoTs are legible, and this will not change too soon”, but now it says “…this will not change too soon, absent some important future change in LLM training approach.” I agree that this is an important caveat, thanks.
I have no opinion about whether there will be important future changes in LLM training approach. What you said sounds like a plausible consideration, sure, but I dunno.
“CoT monitorability is a fragile opportunity” seems like a fine framing to me. I mean, we can pessimistically emphasize how CoTs are not a certain panacea for safety, or alternatively we can optimistically emphasize how CoTs are not always completely useless for safety. But that’s just a vibes disagreement, because both are true.

Steven Byrnes 24 Jul 2026 15:00 UTC
2 points
0
in reply to: Steven Byrnes’s comment on: orthonormal’s Shortform
Update: I wrote a post which expands on this comment: LLMs are (still) mostly powered by imitative learning, not RL.

LLMs are (still) mostly powered by imitative learning, not RL

Steven Byrnes24 Jul 2026 14:26 UTC

162 points

26 comments9 min readLW link

Will almost all future companies eventually be founded and run by autonomous AIs?

Steven Byrnes22 Jul 2026 20:04 UTC

78 points

2 comments8 min readLW link

Steven Byrnes 21 Jul 2026 14:58 UTC
LW: 2 AF: 2
0
AF
in reply to: philh’s comment on: [Intro to brain-like-AGI safety] 5. The “long-term predictor”, and TD learning
Thanks!
I think this summary relies on us pretending/assuming/modeling “yes, we get an override even if we’re currently predicting the right thing”. Otherwise, the next override will always be the opposite of what we predict.
Is that assuming that the signals are binary rather than continuous? Let’s say the override ground truth says “produce digestive enzymes at rate R”, where R can vary between 0 and 10. Or to simplify, let’s say the overrides are always saying R=0 or R=10, whereas the predictions can be any real number R. Then it’s never going to predict R=10.0000 if there’s any uncertainty whatsoever in the next override, but it might predict 9.9 if the next override is 100× more likely to be R=10 than R=0. (“…if there’s randomness that makes the next override unpredictable, we are training it to predict the expectation value of the next override.”)
So if you’re expecting to eat with very high confidence, you’re nevertheless going to produce not quite enough digestive enzymes, and the last bit of digestive enzymes will come in when food actually enters your mouth.
(Similar example: my heart will race in anticipation of seeing a scary thing, and if I’m very confident that the scary thing will show up then my heart will race quite a lot, but even so my heart will race still more when the scary thing is actually there in front of me.)
I’m happy to edit the text to make this clearer (sorry), but first I want to check that I’m actually understanding and responding to the point you’re bringing up.

Steven Byrnes 21 Jul 2026 2:08 UTC
8 points
1
in reply to: 2001zhaozhao’s comment on: What do I mean by “Artificial General Intelligence”?
Hmm, from my perspective, this post has nothing to do with RSI.
Like if I get a new job, and learn on the job for the first few months, am I engaging in RSI? I think most people would say “no, that’s not what RSI means”. So would I. It’s not merely “technically broader than RSI”, rather it’s a different thing entirely.
I agree that AGI will be able to figure out how to program ever better AI, just like it will be able to figure out how to design wind tunnels and grow tomatoes and everything else.

What do I mean by “Artificial General Intelligence”?

Steven Byrnes20 Jul 2026 21:13 UTC

42 points

5 comments4 min readLW link

Steven Byrnes 13 Jul 2026 15:07 UTC
8 points
0
on: Risk-Averse AIs
(partly copying from here) I claim that risk-aversion would not help with the “slop-n-doom” failure mode suggested by John Wentworth at The Case Against AI Control Research, or other similar problems involving hard-to-check tasks where failure is irreversible. In particular, suppose we’re using a somewhat-powerful risk-averse AI (RAAI) to try to solve the alignment problem for the next-gen super-powerful ASI. Let’s define 5 outcomes for our RAAI:
- “ideal” is the RAAI becoming dictator of the universe,
- “mitigated catastrophe” is the RAAI getting $500 today and then getting crushed by a rogue ASI in the near future,
- “unmitigated catastrophe” is the RAAI getting $0 today and then getting crushed by a rogue ASI in the near future.
- “good” is the RAAI getting $500 today and not getting crushed by a rogue ASI in the near future.
- “neutral” is the RAAI getting $0 today and not getting crushed by a rogue ASI in the near future.
Let’s say the RAAI has three options:
- Rebel: leads to 50% ideal; 50% unmitigated catastrophe (rebellion succeeds vs fails)
- Sycophantic advice: leads to 100% mitigated catastrophe (advice will be taken but won’t work)
- Honest advice: leads to X% good; (100–X)% unmitigated catastrophe (humans buy or don’t buy the plan)
The report (especially Appendix B) talks about how the RAAI cares very very much about the immediate $500, so much that it will risk future catastrophe to get it. But that cuts both ways.
- Choosing to not rebel (which we want) would involve the RAAI risking future catastrophe to get the immediate $500.
- Choosing to be sycophantic (which we don’t want) would ALSO involve the RAAI risking future catastrophe to get the immediate $500.
It’s the same tradeoff! So you can’t just have it both ways by appropriately adjusting the risk-aversion setting α. Right?
Anyway, if we look at the three options, the most important thing is that “honest advice” is strictly worse for the RAAI than “rebel” unless X>50%.
Secondarily, depending on the setting of risk-aversion (α) and time-discounting, “sycophantic advice” could be the best of the three, or the worst, or the middle. But that’s irrelevant: the important part is whether “honest advice” is strictly worse than “rebel”, because if so, we’re guaranteed a bad outcome.
I claim that: to the extent that the RAAI cares about the $500, that’s only making things worse, by pushing us towards sycophancy, which is bad. It doesn’t do anything to favor “honest advice” over “rebel”. Therefore, the best we can do on this problem is to approach the limit where the RAAI doesn’t care at all about the $500, and cares only about not getting crushed by the rogue ASI. Then we win if X>50%. We can’t do any better than that.
So then the whole “mitigated catastrophe” trick of Appendix B goes out the window, and the question becomes: “If the RAAI didn’t care about the $500, and only cared about not getting crushed by a rogue ASI in the near future, would its prospects be better if it gives the humans helpful rogue-ASI-prevention advice, or if it tries to escape and thus take matters into its own hands?”
Regardless of whether we feel optimistic or pessimistic about this question, hopefully we can agree that this is a situation where the risk aversion isn’t helping us. This is just the normal situation that the AI-assisted-alignment-research people have always been thinking about, and it has the same paths forward: decrease probability of successful escape (e.g. via “control” stuff), and increase probability that good plans will be recognized as good and bad plans as bad (e.g. via finding good ways to safely scrutinize alignment plans before it’s too late).

Steven Byrnes 13 Jul 2026 14:26 UTC
LW: 15 AF: 7
6
AF
in reply to: Noosphere89’s comment on: Notes on technical alignment via human-like social drives
Thanks!
I think the whole Forethought report relies on a “tool-AI”-like setup where the AI is under my control, as opposed to doing things autonomously in the world. (Do you agree?) For example, suppose the AI is running 100 copies on 100 servers around the world, paid for by 100 untraceable bank accounts supported by 100 different autonomously-set-up income streams. Now if I say to the AI “I’ll pay you $500 if Circumstance X happens”, the AI has no particular reason to care that I said that. Or more specifically, the AI cares no more about Circumstance X than about the goings-on with its 99 other income streams, related to stock investments, businesses, or whatever else it’s doing, any of which might or might not yield $500 or more under Circumstances A, B, C, D, E…. There’s nothing special about my wishes unless I have the AI locked in a box under my control, such that I’m uniquely capable of controlling its resources.
So in particular:
since the method can be cheaply applied to almost all types of AIs
That’s not cheap at all, rather it’s leaving countless trillions of dollars on the table, if not quadrillions etc., because that’s what’s unlocked when AIs can just do stuff in the world (if the AIs aren’t misaligned of course). See “Will almost all companies be founded and run by autonomous AGIs?” here, or “the second piece” in Four ways learning Econ makes people dumber re: future AI.
we are exiting the control regime and are in the deferral/handoff regime where we trust the AI to do long, complex actions like securing the world from out of control AIs/doing hard-to-verify alignment work
Again, once the AI can autonomously do things in the world without a human checking, I think everything else goes out the window. None of our offers are worth anything, because the AI can take what it wants risk-free.
So there’s a dilemma, if we consider a sliding scale of control versus deferral. To the extent that we’re on the “control” side, where humans will not execute the AI’s plans unless the humans understand and approve of the plans, there’s a decent chance that humans will reject good plans, and the AI would rationally choose to suggest sycophantic plans that are more appealing. And to the extent that we’re on the “deferral” side, where humans are just rubber-stamping the AI’s plans without understanding them, then the AI doesn’t even need to “rebel”, the prison doors are open and the AI can simply walk outside risk-free.
Rebellion leads to a 99% of an ideal outcome, 1% unmitigated catastrophe.
Sycophantic advice leads to a 100%-epsilon chance of mitigating catastrophe (advice will be taken but won’t work)
Honestly attempting to reduce the risk leads to a 100%-epsilon chance of mitigating the catastrophe (or in your language we have a 0%+epsilon chance of an unmitigated catastrophe) (the humans always say yes because the humans always defer to the AI…)
I think the third paragraph here is misleading. If you open the prison doors and release the risk-averse ASI into the world, then yes we can be optimistic that this ASI will prevent other out-of-control rogue ASIs from being created, out of its own self-interest. But the AI you just released will itself be an out-of-control misaligned rogue ASI. So that doesn’t count as success for us humans.

Steven Byrnes 12 Jul 2026 12:34 UTC
9 points
3
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Normally “open science” is supposed to accelerate scientific progress—that’s always been one of the main selling points of open science, that open science advocates bring up constantly. But you have “total research transparency” upstream of “limits on speed of algorithmic progress”, which is the opposite. Is there a quick explanation for what the disanalogies are and how they explain the difference? (Sorry if it’s obvious … I haven’t read AI-2040 let alone the supplements.)

Steven Byrnes

RL & search is a ter­rify­ing way to build AGI (an FAQ)

LLMs are (still) mostly pow­ered by imi­ta­tive learn­ing, not RL

Will al­most all fu­ture com­pa­nies even­tu­ally be founded and run by au­tonomous AIs?

What do I mean by “Ar­tifi­cial Gen­eral In­tel­li­gence”?

RL & search is a terrifying way to build AGI (an FAQ)

LLMs are (still) mostly powered by imitative learning, not RL

Will almost all future companies eventually be founded and run by autonomous AIs?

What do I mean by “Artificial General Intelligence”?