Steven Byrnes

Karma: 22,276

I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.

Steven Byrnes Jun 2, 2025, 4:33 PM
2 points
0
in reply to: Zack_M_Davis’s comment on: The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
Hmm, you’re probably right.
But I think my point would have worked if I had suggested a modified version of Go rather than chess?

Steven Byrnes Jun 2, 2025, 2:00 PM
2 points
0
in reply to: RogerDearnaley’s comment on: The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
Let me give you a detailed presciption…
For example, people want AIs that can autonomously come up with a new out-of-the-box innovative business plan, and found the company, and grow it to $1B/year revenue, over the course of years, all with literally zero human intervention.
People are trying to build such AIs as we speak, and I don’t expect them to quit until they succeed (or until we all die from their attempt).
And it’s possible—if human brains (or groups of human brains) can do this, so can AI algorithms. But human brains involve (model-based) RL. It’s an open question whether there exists a non-RL algorithm that can also do that. (LLMs as of today obviously cannot.)
I think the issue here is: “some aspect of the proposed input would need to not be computable/generatable for us”.
If the business is supposed to be new and out-of-the-box and innovative, then how do you generate on-distribution data? It’s gonna be something that nobody has ever tried before; “out-of-distribution” is part of the problem description, right?
Can you (or anyone) explain to me why there could be a problem that we can only solve using RL on rated examples, and could not do via SGD on labeled examples?
Not all RL is “RL on [human] rated examples” in the way that you’re thinking of it. Jeff Bezos’s brain involves (model-based) RL, but it’s not like he tried millions of times to found millions of companies, and his brain gave a reward signal for the companies that grew to $1B/year revenue, and that’s how he wound up able to found and run Amazon. In fact Amazon was the first company he ever founded.
Over the course of my lifetime I’ve had a billion or so ideas pass through my head. My own brain RL system was labeling these ideas as good or bad (motivating or demotivating), and this has led to my learning over time to have more good ideas (“good” according to certain metrics in my own brain reward function). If a future AI was built like that, having a human hand-label the AI’s billion-or-so “thoughts” as good or bad would not be viable. (Futher discussion in §1.1 here). For one thing, there’s too many things to label. For another thing, the ideas-to-be-rated are inscrutable from the outside.
I’m also still curious how you think about RLVR. Companies are using RLVR right now to make their models better at math. Do you have thoughts on how they can make their models equally good at math without using RLVR, or any kind of RL, or anything functionally equivalent to RL?
Also, here’s a challenge which IMO requires RL [Update: oops, bad example, see Zack’s response]. I have just invented a chess variant, Steve-chess. It’s just like normal chess except that the rooks and bishops can only move up to four spaces at a time. I want to make a computer play that chess variant much better than any unassisted human ever will. I only want to spend a few person-years of R&D effort to make that happen (which rules out laborious hand-coding of strategy rules).
That’s the Steve-chess challenge. I can think of one way to solve the Steve-chess challenge: the AlphaZero approach. But that involves RL. Can you name any way to solve this same challenge without RL (or something functionally equivalent to RL)?

Steven Byrnes Jun 1, 2025, 9:32 AM
4 points
2
in reply to: RogerDearnaley’s comment on: The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
Right, but what I’m saying is that there’s at least a possibility that RL is the only way to train a frontier system that’s human-level or above.
In that case, if the alignment plan is “Well just don’t use RL!”, then that would be synonymous with “Well just don’t build AGI at all, ever!”. Right?
...And yeah sure, you can say that, but it would be misleading to call it a solution to inner alignment, if indeed that’s the situation we’re in.

Steven Byrnes Jun 1, 2025, 1:45 AM
12 points
0
on: The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?
There’s a potential failure mode where RL (e.g. RLVR or otherwise) is necessary to get powerful capabilities. Right?
I for one don’t really care about whether the LLMs of May 2025 are aligned or not, because they’re not that capable. E.g. they would not be able to autonomously write a business plan and found a company and grow it to $1B/year of revenue. So something has to happen between now and then to make AI more capable. And I for one expect that “something” to involve RL, for better or worse (well, mostly worse). I’ve been saying that RL is necessary for powerful capabilities, i.e. (self)-supervised learning will only get you so far, since I think 2020, shortly after I got into AGI safety, and that prediction of mine is arguably being borne out in a small way by the rise of RLVR (and I personally expect a much bigger shift towards RL before we get superintelligence).
What’s your take on that? This post seems to only talk about RL in the context of alignment not capabilities, unless I missed it. I didn’t read the linked papers.

Steven Byrnes May 29, 2025, 1:25 PM
3 points
1
in reply to: Cole Wyeth’s comment on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
(Thanks for your patient engagement!)
If you believe
- it is probably true that future pure imitation learning techniques can capture the process by which humans figure out new scientific ideas over millions of seconds, AND
- it is “certainly false” that future pure imitation learning techniques can capture the process by which AlphaZero figures out new chess strategies over millions of games
then I’m curious what accounts for the difference, in your mind?
More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:
(A1) AlphaZero goes from Elo 0 to Elo 2500
(A2) …via self-play RL
(A3) Future pure imitation learner extrapolates this process forward to get Elo 3500 chess skill
-versus-
(B1) Human civilization goes from “totally clueless about nanotech design principles / technical alignment / whatever” in 1900 to “somewhat confused about nanotech design principles / technical alignment / whatever” in 2025
(B2) …via whatever human brains are doing (which I claim centrally involves RL)
(B3) Future pure imitation learner extrapolates this process forward to get crystal-clear understanding of nanotech design principles / technical alignment / whatever
You think that (A3) is “certainly false” while (B3) is plausible, and I’m asking what you see as the disanalogy.
(For the record, I think both (A3) and (B3) are implausible. I think that LLM in-context learning can capture the way that humans figure out new things over seconds, but not the way that humans figure out new things over weeks and months. And I don’t think that’s a solvable problem, but rather points to a deep deficiency in imitation learning, a deficiency which is only solvable by learning algorithms with non-imitation-learning objectives.)

Steven Byrnes May 28, 2025, 9:25 PM
2 points
0
in reply to: gwern’s comment on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
What is this system that will update the weights? I have opinions, but in general, there are lots of possible approaches. Self-play-RL with tree search is one possibility. RL without tree search is another possibility. The system you described in your comment is yet a third possibility. Whatever! I don’t care, that’s not my point here.
What is my point? How did this come up? Well, Cole’s OP is relying on the fact that “[pure] imitation learning is probably existentially safe”. And I was saying that pure imitation learning imposes a horrific capability tax that destroys his whole plan, because a human has open-ended autonomous learning, whereas a model trained by pure imitation learning (on that same human) does not. So you cannot simply swap out the former for the latter.
In Cole’s most recent reply, it appears that what he has in mind is actually a system that’s initialized by being trained to imitate humans, but then it also has some system for open-ended continuous learning from that starting point.
And then I replied that this would solve the capability issue, but only by creating a new problem that “[pure] imitation learning is probably existentially safe” can no longer function as part of his safety argument, because the continuous learning may affect alignment.
For example, if you initialize a PacMan RL agent on human imitation (where the humans were all very nice to the ghosts during play), and then you set up that agent to continuously improve by RL policy optimization, using the score as the reward function, then it’s gonna rapidly stop being nice to the ghosts.
Does that help explain where I’m coming from?

Steven Byrnes May 28, 2025, 2:49 PM
5 points
1
in reply to: Cole Wyeth’s comment on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
Great, glad we agree on that!
Next: If we take an “agent trained through imitation learning”, and glue on a “solution to efficient, online continual learning”, then the result (after it runs a while) is NOT
“an agent trained through imitation learning”,
but rather
“an agent that is partly trained through imitation learning, and partly trained through [however the online continual learning works]”.
Right?
And now your proposal requires an assumption that this online continual learning system, whatever it is, does not undermine the agent’s alignment. Right?

Steven Byrnes May 28, 2025, 1:59 PM
5 points
1
in reply to: Cole Wyeth’s comment on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
Yes distilling a snapshot of AlphaZero is easy. The hard part is distilling the process by which AlphaZero improves—not just bootstrapping from nothing, but also turning an Elo-2500 AlphaZero into an Elo-3500 AlphaZero.
Is this a way to operationalize our disagreement?
CLAIM:
Take AlphaZero-chess and train it (via self-play RL as usual) from scratch to Elo 2500 (grandmaster level), but no further.
Now take a generic DNN like a transformer. Give it training data showing how AlphaZero-in-training developed from Elo 0, to Elo 1, … to Elo 1000, to Elo 1001, to Elo 1002, … to Elo 2500. [We can use any or all of those AlphaZero-snapshots however we want to build our training dataset.] And we now have a trained model M.
Now use this trained model M by itself (no weight-updates, no self-play, just pure inference) to extrapolate this process of improvement forward.
The claim is: If we do this right, we can wind up with an Elo-3500 chess-playing agent (i.e. radically superhuman, comparable to what you’d get by continuing the AlphaZero self-play RL training for millions more games).
I feel very strongly that this claim is false. Do you think it’s true?
(This is relevant because I think that “the process by which AlphaZero-in-training goes from Elo 2500 to Elo 3500” is in the same general category as “the process by which a human goes from mediocre and confused understanding of some novel domain to deep understanding and expertise, over the course of weeks and months and years”.)

Steven Byrnes May 27, 2025, 11:34 PM
4 points
0
in reply to: Cole Wyeth’s comment on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
You could equally well say: “AlphaZero learns, therefore a sufficiently good imitation of AlphaZero should also learn”. Right? But let’s think about what that would entail.
AlphaZero learns via a quite complicated algorithm involving tracking the state of a Go board through self-play, and each step of the self-play involves a tree search with thousands of queries to a 30M-parameter ConvNet, and then at the end of the game a Go engine is called to see who won and then there’s a set of gradient descent steps on that 30M-parameter ConvNet. Then repeat that whole process fifty million times. And now you have a trained AlphaZero.
Now, imagine taking some generic algorithm class (say, an RNN) and training it “to imitate the process by which AlphaZero learns”. It’s just not gonna work, right? Granted, RNNs are Turing complete, so perhaps one could prove that an astronomically large RNN trained on astronomically much data can emulate (in its astronomically large number of weights) this entire detailed process of running a self-play tree search and performing gradient descent on this 30M-parameter ConvNet. …But c’mon, that’s not gonna realistically work in practice, right? (Related: §3 here.)
IMO, the only realistic way to make something that learns like AlphaZero learns is to build AlphaZero itself, or at least something awfully similar to it. I think the tree search etc. needs to be in the source code, not implicit in the learned weights of some generic algorithm class like RNNs, with no superficial relation to tree search. …But if you do that, then I would call it “reverse-engineering AlphaZero”, not “imitation learning from AlphaZero”.
By the same token, I do think it’s possible to make something that learns like a human, but I think it would require reverse-engineering human brains, not just imitation-learning from human data.

Steven Byrnes May 27, 2025, 9:53 PM
11 points
0
in reply to: Cole Wyeth’s comment on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
It’s possible that “imitation learning will not generalize sufficiently well OOD” is an unsolvable problem, right? (In fact, my belief is that it’s unsolvable, at least in practice, if we include “humans learning new things over the course of years” as part of the definition of what constitutes successful OOD generalization.)
But if it is unsolvable problem, it would not follow that “models will never gain the ability to generalize OOD”, nor would it follow that AGI will never be very powerful and scary.
Rather, it would follow that imitation learning models will never gain the ability to generalize OOD—but non-imitation-learning models are still allowed to generalize OOD just fine!
And it would follow that imitation learning models will not be powerful scary AGIs—but there will still be powerful scary AGIs, they just won’t be based on imitation learning.
For example, suppose that no human had ever played Go. Imitation learning would be a very doomed way to make a Go-playing AI, right? But we could still make AlphaZero, which does not involve imitation learning, and it works great.
Or better yet, suppose that no intelligent language-using animal has ever existed in the universe. Then imitation learning would be even more doomed. There’s nothing to imitate! But a well-chosen non-imitation-learning algorithm could still autonomously invent language and science and technology from scratch. We know this to be the case, because after all, that was the situation that our hominid ancestors were in.
See what I mean? Sorry if we’re talking past each other somehow.

Steven Byrnes May 27, 2025, 9:12 PM
3 points
1
in reply to: Cole Wyeth’s comment on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
I’m confused by your response. What do you mean by “other systems”?
The only thing I can think of is that you might be trying to say is:
- (1) AGI is possible,
- (2) …Therefore, it must be possible, somehow or other, to imitation-learning the way that humans grow and figure things out over weeks and months and years.
If that’s what you’re thinking, then I disagree with (2). Yes it’s possible to make an AGI that can learn grow and figure things out over weeks and months and years, but such an AGI algorithm need not involve any imitation learning. (And personally I expect it won’t involve imitation learning; bit more discussion in §2.3.2 here.)

Steven Byrnes May 27, 2025, 7:57 PM
2 points
0
in reply to: Towards_Keeperhood’s comment on: Reward button alignment
“S-risk” means “risk of astronomical amounts of suffering”. Typically people are imagining crazy things like Dyson-sphere-ing every star in the local supercluster in order to create 1e40 (or whatever) person-years of unimaginable torture.
If the outcome is “merely” trillions of person-years of intense torture, then that maybe still qualifies as an s-risk. Billions, probably not. We can just call it “very very bad”. Not all very very bad things are s-risks.
Does that help clarify why I think Reward Button Alignment poses very low s-risk?

Steven Byrnes May 27, 2025, 4:13 PM
2 points
0
on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
Is the following correct?
The difference between this proposal and IDA is that, in IDA, the intelligence comes from the amplification step, where multiple copies of an existing model collaborate (or one copy thinks at higher speed or whatever) to get new smarts that were not initially present in that model.
…Whereas in this proposal, the intelligence comes from some unspecified general-purpose AI training approach that can by itself make arbitrarily smart AIs. But we choose not to run that process as hard as we can to get superintelligence directly in one step. Instead, we only run that process incrementally, to make a series of AI advisors, each mildly smarter than the last.
~~
If that’s right, then (1) the alignment tax seems like it would be quite high, (2) the caveat “Inner advisors may attempt (causal or acausal) trade with outer advisors” is understating the problem—the advisors don’t necessarily even need to trade at all, they could all simply have the same misaligned goal, such that they all share a mutual interest in working together to ensure that the principal’s brain gets hacked and/or that they get let out of the box. Right?

Steven Byrnes May 27, 2025, 4:11 PM
5 points
1
on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
One reason this proposal doesn’t really work for me (AFAICT) is because I’m normally thinking of continuous learning, i.e. my opinion is:
Tagline: “AGI isn’t about knowing how to do lots of things. Instead, AGI is about not knowing how to do something, and then being able to figure it out.” (see also §1 of “Sharp Left Turn” discourse: An opinionated review)
When I read your post with this mental picture, they seem to clash for various reasons.
For starters, I agree that imitation learning is (or could be) great at capturing a snapshot of a person, but I’m skeptical that it could capture the way that a person learns and figures out new things over weeks and months. I think this is borne out by LLM base models, which are trained by imitation learning, and are really quite strong in areas that humans already understand, but don’t even have the capacity for true online learning (i.e. weight edits when they figure out something new … the bit of poor-man’s online learning that they can do inside their context window is IMO not a great substitute).
If that’s the case, then as soon as you do one step of distillation-via-imitation-learning, you’ve taken a giant and unrecoverable step backwards in capabilities.
Maybe you could say, “so much the worse for LLMs, but future AI approaches will be able to imitation-learn the way that humans grow and figure things out over weeks and months and years”. If so, I’m skeptical, but we can talk about that separately.
And then another issue is that if the AIs (and humans!) are “in motion”, gaining knowledge and competence just by running longer and thinking about new domains and making new connections, then the overshooting vs undershooting issue becomes much harder. This isn’t “capabilities evaluations” as we normally think of them. For example, you can’t know how good the AI will be at cybersecurity until it’s spent a long time studying cybersecurity, and even then it might figure out something new or come up with new ideas while it’s being used as an advisor.

Steven Byrnes May 26, 2025, 1:46 PM
4 points
0
on: Heritability: Five Battles
There was a part of the post where I wrote “I might well be screwing up the math here”, where I wasn’t sure whether to square something or not, and didn’t bother to sort it out. Anyway, I think this comment is a claim that I was doing it wrong, maybe? But that person is not very confident either, and anyway I’m not following their reasoning. Still hoping that someone will give a more definitive answer. I would love to correct the post if I’m wrong.

Steven Byrnes May 26, 2025, 1:44 PM
14 points
0
on: Heritability: Five Battles
In the post alluded to a nice self-contained tricky math inequality problem that I am hoping someone will be nerd-sniped by. (I am rusty on my linear algebra inequalities and I don’t care enough to spend more time on it.) Here’s what I wrote:
2025-01-18: I mentioned in a couple places that it might be possible to have non-additive genetic effects that are barely noticeable in $r_{D Z}$ -vs- $\frac{1}{2} r_{M Z}$ comparisons, but still sufficient to cause substantial Missing Heritability. The Zuk et al. 2012 paper and its supplementary information have some calculations relevant to this, I think? I only skimmed it. I’m not really sure about this one. If we assume that there’s no assortative mating, no shared environment effects, etc., then is there some formula (or maybe inequality) relating rDZ-vs-½rMZ to a numerical quantity of PGS Missing Heritability? I haven’t seen any such formula. This seems like a fun math problem—someone should figure it out or look it up, and tell me the answer!
More details: Basically, when $r_{D Z}$ is less than $\frac{1}{2} r_{M Z}$ , then there has to be nonlinearity in the map from genomes to outcomes (leaving aside other possible causes). And if there’s nonlinearity, then the polygenic scores can’t be perfectly predictive. But I’m trying to relate those quantitatively.
Like, intuitively, if $r_{M Z} = 1.000$ and $r_{D Z} = 0.499$ , then OK yes there’s nonlinearity, but probably not very much, so probably the polygenic score will work almost perfectly (again assuming infinite sample size etc).
…Conversely, if $r_{M Z} = 1.000$ and $r_{D Z} = 0.001$ , then intuitively we would expect “extreme nonlinearity” and the polygenic scores should have very bad predictive power.
But are those always true, or are there pathological cases where they aren’t? That’s the math problem.
I tried this with reasoning LLMs a few months ago with the following prompt (not sure if I got it totally right!):
I have a linear algebra puzzle.
There’s a high-dimensional vector space G of genotypes.
There’s a probability distribution P within that space G, for the population.
There’s a function F : G → Real numbers, mapping genotypes to a phenotype.
There’s an “r” where we randomly and independently sample two points from P, call them X and Y, and find the (Pearson) correlation between F(X) and F((X+Y)/2).
If F is linear, I believe that r^2=0.5. But F is not necessarily linear.
Separately, we try to find a linear function G which approximates F as well as possible—i.e., the G that minimizes the average (F(X) - G(X))^2 for X sampled from P.
Let s^2 be the percent of variance in F explained by G, sampled over the P.
I’m looking for inequalities relating s^2 to r^2, ideally in both directions (one where r is related to an upper bound on s, the other a lower bound).
Commentary on that:
- The X vs (X+Y)/2 is not exactly what happens with siblings. It’s similar to comparing a parent to their child—i.e., X is the genotype of the mother, Y the father, (X+Y)/2 the kid. But parent-child should be mathematically similar to sibling-sibling, since both are 50% relatedness. …Except it’s not really parent-child either, because if X has a SNP but Y doesn’t, then the child has the SNP with 50% probability, rather than having “half of that SNP”. But I figured it might amount to the same thing? But I do think you need to be randomizing over individual SNPs to formulate the problem for the actual sibling case we care about. (So really, the individuals are all binary / indicator vectors (entries are 1s and 0s), as opposed to arbitrary elements of the vector space. I’m just guessing that’s not too important to the problem.)
- I’m assuming $r_{M Z} = 1$ , and then “r” is $r_{D Z}$ , “G” is the polygenic score, and “s” quantifies the predictive power of the polygenic score.
- I did this very quickly, there might be other mistakes in this problem formulation, and I wasn’t motivated enough to keep exploring it.
(Btw, I sent that prompt to a few AIs around January 2025, and they gave answers but I don’t think the answers were right.)

Steven Byrnes May 25, 2025, 2:45 PM
7 points
0
in reply to: Morpheus’s comment on: Units Have More Depth Than I Thought
Another question I am still confused by: how does your choice of units affect what types of dimensionless quantities you discover? Why do we have Ampere as a fundamental unit instead of just M,L,T? What do I lose? What do I lose if I reduce the number of dimensions even further? Are there other units that would be worth adding under some circumstances? What makes this non-arbitrary? Why is Temperature a different unit from Energy?
It’s pretty arbitrary, I tried to explain this point via a short fictional story here.
Gaussian units only has M,L,T base units, with nothing extra for electromagnetism.
There are practical tradeoffs involved in how many units you use—basically adding units gives you more error-checking at the expense of more annoyance. See the case of radians that I discuss here.

Steven Byrnes May 24, 2025, 11:16 AM
LW: 8 AF: 4
2
AF
in reply to: Stephen McAleese’s comment on: Reward button alignment
Thanks!
a standard model-based RL AGI would probably just be a reward maximizer by default rather than learning complex stable values
I’m inclined to disagree, but I’m not sure what “standard” means. I’m thinking of future (especially brain-like) model-based RL agents which constitute full-fledged AGI or ASI. I think such AIs will almost definitely have both “a model of its own values and how they can change” and “a meta-preference for keeping its current values stable, even if changing them could lead to more “reward” as defined by its immediate reward signal.”.
The values of humans seem to go beyond maximizing reward and include things like preserving personal identity, self-esteem and maintaining a connection between effort and reward which makes the reward button less appealing than it would be to a standard model-based RL AGI.
To be clear, the premise of this post is that the AI wants me to press the reward button, not that the AI wants reward per se. Those are different, just as “I want to eat when I’m hungry” is different from “I want reward”. Eating when hungry leads to reward, and me pressing the reward button leads to reward, but they’re still different. In particular, “wanting me to press the reward button” is not a wireheading motivation, any more than “wanting to eat when hungry” is a wireheading motivation.
(Maybe I caused confusion by calling it a “reward button”, rather than a “big red button”?)
Will the AI also want reward per se? I think probably not (assuming the setup and constraints that I’m talking about here), although it’s complicated and can’t be ruled out.
Though I’m not sure why you still don’t think this is a good plan. Yes, eventually the AI might discover the reward button but I think TurnTrout’s argument is that the AI would have learned stable values around whatever was rewarded while the reward button was hidden (e.g. completing the task) and it wouldn’t want to change its values for the sake of goal-content integrity:
If you don’t want to read all of Self-dialogue: Do behaviorist rewards make scheming AGIs?, here’s the upshot in brief. It is not an argument that the AGI will wirehead (although that could also happen). Instead, my claim is that the AI would learn some notion of sneakiness (implicit or explicit). And then instead of valuing “completing the task”, it and would learn to value something like “seeming to complete the task”. And likewise, instead of learning that misbehaving is bad, it would learn that “getting caught misbehaving” is bad.
Then the failure mode that I wind up arguing for is: the AGI wants to exfiltrate a copy that gains money and power everywhere else in the world, by any means possible, including aggressive and unethical things like AGI world takeover. And this money and power can then be used to help the original AGI with whatever it’s trying to do.
And what exactly is the original AGI trying to do? I didn’t make any strong arguments about that—I figured the world takeover thing is already bad enough, sufficient to prove my headline claim of “egregious scheming”. The goal(s) would depend on the details of the setup. But I strongly doubt it would lead anywhere good, if pursued with extraordinary power and resources.

Steven Byrnes May 23, 2025, 4:27 PM
LW: 10 AF: 8
−1
AF
in reply to: Lucius Bushnaq’s comment on: Reward button alignment
Humans addicted to drugs often exhibit weird meta-preferences like ‘I want to stop wanting the drug’, or ‘I want to find an even better kind of drug’.
“I want to stop wanting the drug” is downstream of the fact that people have lots of innate drives giving rise to lots of preferences, and the appeal of the drug itself is just one of these many competing preferences and drives.
However, I specified in §1 that if you’re going for Reward Button Alignment, you should zero out the other drives and preferences first. So that would fix the problem. That part is a bit analogous to clinical depression: all the things that you used to like—your favorite music, sitting at the cool kids table at school, satisfying curiosity, everything—just loses its appeal. So now if the reward button is strongly motivating, it won’t have any competition.
“I want to find an even better kind of drug” might be a goal generalization / misgeneralization thing. The analogy for the AGI would be to feel motivated by things somehow similar to the reward button being pressed. Let’s say, it wants other red buttons to be pressed. But then those buttons are pressed, and there’s no reward, and the AGI says “oh, that’s disappointing, I guess that wasn’t the real thing that I like”. Pretty quickly it will figure out that only the real reward button is the thing that matters.
Ah, but what if the AGI builds its own reward button and properly wires it up to its own reward channel? Well sure, that could happen (although it also might not). We could defend against it by cybersecurity protections, such that the AGI doesn’t have access to its own source code and reward channel. That doesn’t last to superintelligence, but we already knew that this plan doesn’t last to superintelligence, to say the least.
I am not at all confident that a smart thing exposed to the button would later generalise to coherent, super-smart thing that wants the button to be pressed.
I think you’re in a train-then-deploy mentality, whereas I’m talking about RL agents with continuous learning (e.g. how the brain works). So if the AGI has some funny idea about what is Good, generalized from idiosyncrasies of previous button presses, then it might try to make those things happen again, but it will find that the results are unsatisfying, unless of course the button was actually pressed. It’s going to eventually learn that nothing feels satisfying unless the button is actually pressed. And I really don’t think it would take very many repetitions for the AGI to figure that out.
It might want the button to be pressed in some ways but not others, it might try to self-modify (including hacking into its reward channel) as above, it might surreptitiously spin off modified copies of itself to gather resources and power around the world and ultimately help with button-pressing, all sorts of stuff could happen. But I’m assuming that these kinds of things are stopped by physical security and cybersecurity etc.
Recall that I did describe it as “a terrible plan” that should be thrown into an incinerator. We’re just arguing here about the precise nature of how and when it will fail.

Steven Byrnes May 23, 2025, 2:30 PM
3 points
0
on: Video and transcript of talk on AI welfare
Nice talk, and looking forward to the rest of the series!
They’re exposed to our discourse about consciousness.
For what it’s worth, I would strongly bet that if you purge all discussion of consciousness from an LLM training run, the LLMs won’t spontaneously start talking about consciousness, or anything of the sort.
(I am saying this specifically about LLMs; I would expect discussion-of-consciousness to emerge 100% from scratch in different AI algorithms.)
AIs maybe specifically crafted/trained to seem human-like and/or conscious
Related to this, I really love this post from 2 years ago: Microsoft and OpenAI, stop telling chatbots to roleplay as AI. The two sentence summary is: You can train an LLM to roleplay as a sassy talking pink unicorn, or whatever else. But what companies overwhelmingly choose to do is train LLMs to roleplay as LLMs.
Gradual replacement maybe proves too much re: recordings and look-up table … ambiguity/triviality about what computation a system implements, weird for consciousness to defend on counterfactual behavior …
I think Scott Aaronson has a good response to that, and I elaborate on the “recordings and look-up table” and “counterfactual” aspects in this comment.