Steven Byrnes

Karma: 28,089

I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.

Steven Byrnes 30 Mar 2026 19:24 UTC
13 points
0
in reply to: habryka’s comment on: adam_scholl’s Shortform
For example the IAEA has heavily curtailed research into how to build nuclear weapons more cheaply and efficiently, which seems like it applies pretty straightforwardly to algorithmic progress.
IIUC, it’s legal everywhere on Earth to do basic research that might eventually lead to a new, much more inexpensive and hard-to-monitor method to enrich uranium to weapons grade.
I’m thinking mainly of laser isotope enrichment, which was first explored in the 1970s. No super-inexpensive method has turned up, thankfully. (The best-known approach seems to be in the same ballpark as gas centrifuges in terms of cost, specialty parts etc., or if anything somewhat worse. Definitely not radically simpler and cheaper.) But I think there’s a big space of possible techniques, and meanwhile people in academia keep inventing new types of lasers and new optical excitation and separation paradigms. I don’t think there’s any general impossibility proof that kg-scale uranium enrichment in a random basement with only widely-available parts can’t ever get invented someday by this line of research.
(If it did, it probably wouldn’t be the death of nonproliferation because you can still try to monitor and control the un-enriched uranium. But it would still make nonproliferation substantially harder. By the way, once you have a lot of weapons grade uranium, making a nuclear bomb is trivial. The fancy implosion design is only needed for plutonium bombs not uranium.)
AFAICT, if someone is explicitly developing a system for “kg-scale uranium enrichment via laser isotope separation”, then the authorities will definitely go talk to them. But for every step prior to that last stage, where you’re doing “basic R&D”, building new types of lasers, etc., then my impression is that people can freely do whatever they want, and publish it, and nobody will ask questions. I mean, it’s possible that there’s someone in some secret agency who is on the ball, five steps ahead of everyone in academia and industry, and they know where problems might arise on the future tech tree and are ready to quietly twist arms if necessary. But I dunno man, that seems pretty over-optimistic, especially when the research can happen in any country.
My former PhD advisor wrote a book in the 1980s with a whole chapter on laser isotope separation techniques, and directions for future research. The chapter treats it as completely unproblematic! Not even one word about why this might be bad. I remember feeling super weirded out by that when I read it (15 years ago), but I figured, maybe I’m the crazy one? So I never asked him about it.
(Low confidence on all this.)

Steven Byrnes 30 Mar 2026 17:25 UTC
2 points
0
in reply to: Linda Linsefors’s comment on: A Theory of Laughter
Whoops, the wikipedia article was deleted a few months ago.
I meant “kinda the same idea” in the sense that, at the end of the day, a similar problem is being solved by the communicative signal. I agree that there’s a sign-flip.
Anyway, I’ll reword, thanks.

Steven Byrnes 30 Mar 2026 17:12 UTC
2 points
0
in reply to: Dev.Errata’s comment on: You can’t imitation-learn how to continual-learn
Sure, if you have an RNN (e.g. SSM) with a (say) billion-dimensional hidden state, then in principle the hidden state could imitate the billion weights of some other entirely different learning algorithm, and the RNN propagation steps could imitate the weight-update steps (e.g. gradient descent or TD learning or whatever) of that other learning algorithm, along with the querying-the-model steps, the replay-learning steps, and/or whatever else is involved.
But I have a rather strong belief that this would never happen, in real life, in any practical, AGI-relevant sense. Even if such an RNN update step exists in principle, I think it would not be learnable in practice, nor runnable without many orders of magnitude of performance overhead. I won’t get into details here, but this old discussion of mine is vaguely related.
some evidence of this: https://arxiv.org/abs/2506.13892
I’m sorry, but the more I read about “algorithm distillation”, the more I want to treat that term as a giant red flag that the paper is probably garbage. I cited this example in the post (which is I think overly diplomatic), and for a second one see my discussion thread with glazgogabgolab on this page.
Basically, nobody in that subfield seems to be carefully distinguishing “learning object-level things from the teacher” versus “learning how to learn from the teacher”. The second is exciting, the first is boring.
As far as I can tell, “in-context reinforcement learning” has never been demonstrated to exist at all, at least in the sense that matters. I.e., real RL algorithms can figure out how to do complicated new things that they’ve never seen demonstrated, whereas the so-called “ICRL” models seem to only be capable of doing things very similar to what they’ve seen the teacher do in their context window.
…And this paper does not change my mind on that. For example, in figure 1, none of the four learning curves shows the student doing better than it saw the teacher do within its context window.
Even outside of that graph, I really think that if the ICLR agent was using some innovative clever strategy that the teacher never used, the way actual RL algorithms do every day, then the authors would have noticed that, and been very excited by it, and centered their whole paper around it, all the way up to the title. The fact that they don’t mention anything like that is I think a strong sign that it didn’t happen.

Steven Byrnes 30 Mar 2026 14:57 UTC
6 points
0
in reply to: Alex Mallen’s comment on: You can’t imitation-learn how to continual-learn
I think it’s important that the AI doesn’t need to do all of its continual learning in the activations of a single forward pass: It can autoregressively generate tokens too. This way it can leave notes to itself, like “next time I run into a bug like this, I should look in this place first.”
I don’t think that’s adequate for (what I was calling) “real” continual learning. There’s a trivial sense in which an LLM can do anything via a context window because it can e.g. emulate a Turing Machine without understanding what the Turing Machine is doing. But that’s not realistic (nor alignment-relevant). Realistically, I claim LLM “understanding” has to be in the weights, not the context window.
Here’s a thought experiment I often bring up: imagine training an LLM purely on data before linear algebra existed (or equivalently, train a new LLM from scratch while carefully filtering out anything related to or downstream of linear algebra from the training data). Then put a linear algebra textbook (or many textbooks) in the context window.
My question is: can the LLM can answer tricky questions that are not directly in those textbooks, to build on those linear algebra ideas and make further progress?
My strong prediction is: No.
Why do I think that? The issue is: linear algebra is a giant pile of interrelated concepts: matrices, bases, rank, nullity, spans, determinants, trace, eigenvectors, dual space, unitarity, etc. Any one sentence in the textbook makes no sense to someone who doesn’t already know some linear algebra, because it’s probably describing some connection between one nonsensical concept and another nonsensical concept.
E.g. here’s a sentence from a linear algebra textbook: “As a reminder, for any matrix M , and a matrix M′ equal to M after a row operation, multiplying by an elementary matrix E gave M ′ = EM.” Try looking at that sentence through the eyes of someone who has never heard the words “matrix”, “row operation”, etc. It’s totally unintelligible gobbledygook, right?
The LLM needs to somehow make sense of this gobbledygook within the duration of a single forward pass, well enough to write down the first token on its scratchpad.
Now we do the second forward pass to add the second token to the CoT. But the weights haven’t changed! So the textbook is still gobbledygook! And the LLM still has only the duration of one forward pass to make sense of it.
No matter how many tokens are appended to the end of the CoT, you still have the issue that, each time you do a new forward pass, the LLM looks at its context window (textbooks + CoT scratchpad) “with fresh eyes”, and what it sees is a bunch of unintelligible gobbledygook that it has only the duration of one forward pass to make sense of.
Even if it somehow manages to print out some tokens that constitute progress on the linear algebra problem, those very tokens that it just printed out will also be gobbledygook, when it looks at them “with fresh eyes” on the next forward pass.
By contrast, if you give a human the same problem, i.e. she doesn’t know linear algebra but she has these textbooks and a scratchpad, she would be able to make progress on the problem, as long as you give her enough time (probably weeks or months), but she would make progress in a very different way from LLM CoT inference: she would be learning as she goes, changing the “weights” in her brain. After a few weeks, she could look at a sentence in the textbook, and it would no longer be unintelligible gobbledygook, but rather describing something about concepts that she is beginning to understand, and she can thus refine her understanding more and more. And likewise, if she writes down notes on her scratchpad, she will be able to understand those notes afterwards, because she has been learning (changing the weights) the whole time. The learning (changing weights) is the essential part, the scratchpad is incidental and optional. A scratchpad without “real” continual learning (changing weights) would be useless to her. Indeed, if she could time-travel to her past self, who didn’t yet know anything about linear algebra, and gift her own scratchpad to her past self, it wouldn’t help much. Her past self would still need to spend weeks learning all these new concepts. Indeed, time-traveled-notes-to-self is kinda what a textbook is—but owning a library full of unread math textbooks does not make someone a mathematician :-)
OK, so that’s my hypothesis: the linear-algebra-holdout LLM experiment would definitely fail. Nobody has done that experiment, but I claim that my guess is consistent with observations of actual LLMs:
For one thing, we might notice that companies care an awful lot about pretraining data (1,2), spending billions of dollars a year on it, which dovetails with my theory that LLMs are generally great at using concepts that already exist in the pretraining data, but bad at inventing and using new concepts that aren’t. It’s just that there’s so much pretraining data that you can do quite a lot without ever exiting the concept space that exists in the pretraining data.
For another thing, at least some brilliant people doing bleeding-edge stuff report that, when you’re doing something sufficiently innovative, LLMs get confused and fall back to concepts in the pretraining data. Relatedly, mathematicians seem to agree that LLMs, for all their impressive achievements, have not been coming up with new useful conceptual frames. See discussion here.
For another thing, I think it’s widely agreed that LLMs are best at self-contained tasks, and at things that have been done lots of times before, and that the more you get into weird idiosyncratic proprietary codebases, with lots of interrelated complexities that are not anywhere on the internet, the more likely they are to fail. This likewise seems to fit my theory that LLMs get “real understanding” ~only from the pretraining process, and that they crash and burn when the context window has lots of interconnected layered complexity that differs from anything in the pretraining data.
My main takeaway from your post is that naively training LLMs to imitate the behaviors of continually-learning policies (e.g., humans) who don’t leave externalized traces of their continual learning process is unlikely to work. (And I believe this is your main point.)
No, I believe something stronger than that, because I don’t think “externalized traces of their continual learning process” is relevant. I think that in the linear algebra holdout thought experiment above, LLMs would fail equally hard if we digitize Arthur Cayley’s notes from when he was inventing the matrix in the 1800s and put it into the context window, along with Hermann Grassmann’s notes etc. That’s not relevant.

Steven Byrnes 29 Mar 2026 1:23 UTC
3 points
0
in reply to: Linda Linsefors’s comment on: A Theory of Laughter
It refers a few paragraphs earlier, i.e.: “I originally had two justifications for putting this in. (1) Provine found that laughter was 30× less frequent when people were alone; (2) Evolutionarily, there’s no point in emitting communicative signals when there’s no one around to hear them.” (Sorry that’s unclear, I’ll reword.)

Steven Byrnes 23 Mar 2026 3:02 UTC
LW: 14 AF: 5
2
AF
in reply to: Wei Dai’s comment on: “Act-based approval-directed agents”, for IDA skeptics
This post contains no plan for technical AGI alignment (or anything else). I have no such plan. See the last two paragraphs of the post.
I am trying to find such a plan (or prove that none exists), and in the course of doing so, occasionally I come across a nugget of deconfusion that I want to share :-) Hence this post.
As a general rule, I take interest in certain things that humans sometimes do or want, not because I’m interested in copying those things directly into AGIs, but rather because they are illustrative case studies for building my nuts-and-bolts understanding of aspects of motivation and learning etc. And then I can use that understanding to try to dream up some engineered system that might be useful in AGIs. The resulting engineered system might or might not resemble anything in humans or biology. By analogy, the Wright Brothers learned a lot from soaring birds, but their plane did not look like a bird.
I think they’re mainly trying to win approval of other actual humans.
I think what people “mainly” do is not of much interest to me right now. If a few people sometimes do X, then it follows that X is a possible thing that a brain can do, and then I can go try to figure out how the brain does that, and maybe learn something useful for technical alignment of brain-like AGI.
So along those lines: I think that there exist people who have a self-image as a person with such-and-such virtue, and take pride in that, and will (sometimes) make decisions driven by that self-image even when they have high confidence that nobody will ever find out, or worse, when they have high confidence that the people they care most about will despise them for it. They (sometimes) make that decision anyway.
I think this kind of self-image-related motivation has a deep connection to other people’s approval, and is causally downstream of their experience of such approval over a lifetime. But it is definitely NOT the same as consequentialist planning to maximize future approval / status.

Steven Byrnes 22 Mar 2026 20:54 UTC
3 points
1
in reply to: condorcet’s comment on: You can’t imitation-learn how to continual-learn
is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm? Even with eg. 3-5 OOMs more compute than GPT-4.5?
I say yes. You left out an important part, here it is in italics: “is it really practically impossible for a transformer forward pass to simulate a deep-Q style learning algorithm churning for millions of steps?”
Yes, because an awful lot can happen in millions of steps, including things that build on each other in a serial way.
I worry you could’ve made this same argument ten years ago for simulating human expert behavior over 8 hour time horizons — which involves some learning, eg navigating a new code base, checking code on novel unit tests. It’s shallow learning, sure. You don’t have to update your world model that much. But it’s not nothing, and ten years ago I probably would’ve been convinced that a transformer forward pass could never practically approximate it.
I disagree that it should be called “learning” at all. It would be “learning” for a human in real life, but if you imagine a person who has read 2 billion lines of code [that’s the amount of GitHub code in The Pile … actually today’s LLMs probably see way more code than that], which would correspond to reading code 24 hours a day for 100 years, then I believe that such a person could do the METR 8 hour tasks without “learning” anything new whatsoever. You don’t need to “learn” new things to mix-and-match things you already know in novel ways—see my example here of “imagine a pink fuzzy microphone falling out of a helicopter into a football stadium full of bunnies”. And see also: related discussion here.
why have a transformer simulate a neural net running some RL algorithm when you could just train the RL agent yourself?
Yup, that’s my main point in this post, I expect that sooner or later somebody will invent real-deal continual learning, and it will look like a bona fide learning algorithm written in PyTorch with gradient descent steps and/or TD learning steps and/or whatever else, as opposed to (so-called) “in-context learning” or RAG etc.

Steven Byrnes 22 Mar 2026 20:20 UTC
LW: 3 AF: 2
0
AF
in reply to: Linda Linsefors’s comment on: [Intro to brain-like-AGI safety] 1. What’s the problem & Why work on it now?
Thanks, I just deleted that whole part. I do believe there’s something-like-that which is true, but it would take some work to pin down, and it’s not very relevant to this post, so I figure, I should just delete it. :-)
In case anyone’s curious, here’s the edit I just made:
OLD VERSION:
- Anyway, by assuming “brain-like AGI”, I get the right to make certain assumptions about the cognitive architecture, representations, learning algorithms, and so on.
- - Some of these “brain-like AGI ingredients” are universal parts of today’s popular ML algorithms (e.g. learning algorithms; distributed representations).
  - Others of these “brain-like AGI ingredients” are (individually) present in a subset of today’s popular ML algorithms but absent from others (e.g. reinforcement learning; predictive [a.k.a. self-supervised] learning; explicit planning).
  - Still others of these “brain-like AGI ingredients” seem mostly or totally absent from today’s most popular ML algorithms (e.g. ability to form “thoughts” [e.g. “I’m going to the store”] that blend together immediate actions, short-term predictions, long-term predictions, and flexible hierarchical plans, inside a generative world-model that supports causal and counterfactual and metacognitive reasoning).
- So in this sense, “brain-like AGI” is a specific thing that might or might not happen, independently of “prosaic AGI”. Much more on “brain-like AGI”, or at least its safety-relevant aspects, in the subsequent posts.
NEW VERSION:
- Anyway, by assuming “brain-like AGI”, I get the right to make certain assumptions about the cognitive architecture, representations, learning algorithms, and so on. Some of those assumptions would also apply to some existing AI algorithms. But if you take the whole package together—all the parts and how they interconnect—it constitutes a yet-to-be-invented AI architecture. So in this sense, “brain-like AGI” is a specific thing that might or might not happen, independently of “prosaic AGI”. Much more on “brain-like AGI”, or at least its safety-relevant aspects, in the subsequent posts.

Steven Byrnes 22 Mar 2026 18:10 UTC
2 points
0
in reply to: Linda Linsefors’s comment on: Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions
No worries, seems fine.
FWIW, my current feeling is like 25% probability that narrowing eyes (in anger etc.) has a functional explanation related to vision (as opposed to changing how your face looks to other people, or defending your eyes from attack, or whatever), and 80% probability that widening eyes (in fear etc.) has a functional explanation related to vision. But I didn’t think about it too hard.
In both cases, regardless of whether it’s functional or not, I have very high confidence that it’s an innate reaction, not a product of within-lifetime learning.

Steven Byrnes 22 Mar 2026 15:44 UTC
LW: 3 AF: 2
0
AF
in reply to: Wei Dai’s comment on: “Act-based approval-directed agents”, for IDA skeptics
I think the point I was trying to make in this post is both narrower and weirder than the general topics of humans supervising more competent AIs, and generation-verification gaps. For example, my self-image might be partly formed from admiration of the character traits of a cartoon character, or Jesus, etc., and I might feel pride in acting in ways that I imagine them approving of, and that might influence how I go about my day-to-day conduct as a string theory researcher. But Jesus is long gone, and the cartoon character doesn’t even exist at all, and certainly neither was able to evaluate string theory ideas. They’re not “supervising” me in that sense.
Actual humans supervising actual AGIs is something that Paul talked about in IDA stuff, and like I said in the OP, I reject that entire line of research as a dead end.
Separately, I agree that “humans are an existence proof that safe & beneficial brain-like AGI is possible in principle” needs a heavy dose of nuance and caveats (humans are working towards misaligned AGI right now, plus I’d generally expect tech progress to drive humanity off the rails even without AGI or other destructive tech, among other things). But I do think there is some “…existence proof…” argument that goes through. E.g. at least some humans are making the overall situation better not worse (or if not, then we’re screwed no matter what), and AGIs don’t have to match the human population distribution.

Steven Byrnes 22 Mar 2026 1:02 UTC
2 points
0
in reply to: Linda Linsefors’s comment on: Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions
Huh, I find the disgust example pretty plausible.
I agree that “universality-through-functionality” (§4.2) is implausible as a theory explaining all universal facial expressions. At least some universal facial expressions do not have immediate functional explanations, seems to me. E.g. the angry open-mouth tooth-showing grimace / scowl was presumably functional in chimps, because they’re showing off their fangs as a credible signal that they’re dangerous. We don’t have any fangs to show off, but we still have that same expression.
But I’m also sympathetic to there being more than zero universal facial expressions that do have immediate functional explanations. Not sure if I’m disagreeing with you or not.

Steven Byrnes 22 Mar 2026 0:51 UTC
4 points
2
in reply to: Linda Linsefors’s comment on: Lisa Feldman Barrett versus Paul Ekman on facial expressions & basic emotions
(I don’t currently have a strong opinion one way or the other about whether Barrett’s claims here are plausible.)
I tried narrowing my eyes. This does not help improve my vision.
Well there is a relation between the aperture of a camera and its depth-of-field. (Famous example: pinhole cameras can focus any depth despite having no lens at all.) (Another famous example: I think this is why people squint when they aren’t wearing their glasses.) If the story is real at all, it might be more apparent in a dark environment, since then your pupil will be dilated, and also more apparent when trying to view something that has both near and far parts such that you can’t focus on both simultaneously. Yes it’s possible that this is too subtle an effect to matter in practice, I’m just trying to steelman it. I can’t immediately think of a DIY demo to try.
Widening my eyes does not seem to improve my peripheral vision.
Hmm, I think I disagree. I think there’s a part of my field-of-view (mainly the top part obviously) that’s black when my eyes are relaxed, but that I can see when my eyes are widened. As usual with peripheral vision, you kinda have to be paying attention to it, and it’s also easier to notice when there’s something moving. Here’s the procedure I tried just now: Hold your head straight, pick a fixation point in front of you (or better yet downward), and hold your hand with wiggling fingers as high up as it can go until you can’t see the wiggling, repeat with and without widening your eyes. Seemed like a nonzero effect to me (but not huge).

Steven Byrnes 20 Mar 2026 20:57 UTC
4 points
0
on: Confusion around the term reward hacking
See also a thread here where I was also complaining about this.

Steven Byrnes 19 Mar 2026 12:49 UTC
LW: 8 AF: 5
0
AF
in reply to: Wei Dai’s comment on: “Act-based approval-directed agents”, for IDA skeptics
I don’t think this is related to the points I was making in the post … But happy to chat about that anyway.
Yeah sure, common sense says that smart people will tend to enjoy being in more meritocratic intellectual fields, rather than less meritocratic ones, and also that fields in general tend to be more meritocratic when quality is easy to judge (although other things matter too, e.g. glamorous fields have it tougher because they attract grifters).
See e.g. what I wrote here about experimental science.
The mathematics community has successfully kept the cranks out, as far as I know, but two grimly amusing failures (in my controversial opinion) are: (1) in the 2000s, the (correct) theoretical physics consensus that we should be focusing on string theory was somewhat broken by an invasion of people unable to tell good physics theory from bad (e.g. “loop quantum gravity”), and there were enough of such people (including department chairs etc) that they broke the blockade and wound up with positions and credentials; (2) this funny anecdote in Dan Dennett’s memoir:
The hegemony of the analytic philosophers evaporated in 1979, at the Eastern Division meeting of the APA [American Philosophical Association] in Boston, when a coup d’état was staged by a group of mostly American but Continental philosophers who called themselves pluralists (let a thousand flowers bloom). I wonder how many of today’s young philosophers and graduate students have ever heard about this. It was an academic earthquake at the time. Frustrated by the short shrift given them by members of the “analytic monolith,” these philosophers studied the bylaws of the APA and discovered that although for decades the nominating committee had put forward a single candidate for vice president who was then elected by acclaim and would succeed as president the following year, the rules allowed nominations from the floor and actual elections! In secret, the pluralists put together their slate, prepared their challenges to the parliamentarian and other officers, and made sure their members were all set to descend en masse on the lightly attended business meeting and take over the APA Eastern Division. About half an hour before the meeting, their security broke down: a coup was rumored to be in the offing, and we monolith members were rounded up in the bar and hustled to the meeting to try to fend off the usurpation. Dick Rorty was president that year, and it was an irony (one of his favorite topics) that he—the most ecumenical and open-minded of the “analytic monolith” leaders—presided over the meeting, while Tom Nagel executed his duties as parliamentarian with aplomb. There were nominating speeches and rebuttals, the most memorable of which was by Ruth Marcus, whose Yale colleague John Smith, a philosopher of religion and a theologian, was the pluralists’ candidate. She explicitly trashed his whole career, his character, his books. I had never heard a philosopher speak so ill of a colleague in public, and seldom in private.
We lost. The establishment had nominated Adolf Grünbaum, a Pittsburgh philosopher of science, to be the new vice president. Not wanting to offend innocent Adolf, the victorious pluralists nominated and elected him vice president the following year, so that in 1982 he finally got to deliver the presidential address he had expected to give earlier. He did not accept the olive branch with equanimity. Adolf was famous for his tirades against Freud as an unscientific poseur, and his address was vintage Grünbaum. I happened to follow a cluster of pluralists out of the hall at the close of his address and overheard the reply when a pluralist who had stayed away asked how Grünbaum’s address had gone: “It was nasty, brutish and long.”
Thereafter, the APA’s programs were filled with papers on topics, and by philosophers, that would never have made the cut before the pluralist coup. Was this a good thing? Yes, said some monolith members, since it meant there was more guilt-free time to spend in the bar at conventions. Yes, said others, since the pluralists had justice on their side. My verdict is mixed. Still, the published programs of the APA meetings list dozens of talks whose titles are so ripe for parody that when I recently perused a few looking for likely examples to anonymize, I had difficulty “improving” on the actual candidates, but ask yourself whether you are aching to go to the sessions where the following talks will be given:
- “The Ineffability of History and the Problem of the Unitary Self”
- “Dialectical Encroachment: Humiliation and Integrity”
- “Can Relationalistic Ontology Avoid Incoherence through a Recursive Metatheory?”
- “Art as War: The Resilience of Autonomy”
Having said all that…
If your proposal is:
von Neumann and Tao did math-y stuff rather than other stuff because they got adulation when they did math-y stuff and they got heckled by idiots when they did other stuff.
…then I think that’s part of it but not all of it. I would note that they presumably got good at math by thinking about math all the time, and if they were thinking about math all the time, it’s probably because they found it very satisfying and enjoyable to think about math. I have a kid like that—when he was like 8 years old, I might be talking about politics at dinner or whatever, and he would interrupt me to share something he just thought of about perfect squares that he found very exciting. I.e., some people, when their mind is wandering, think about other people, and some people think about sports, and he was evidently thinking about perfect squares. Anyway, if a person intrinsically enjoys thinking about numbers and symbols, then it stands to reason that they would probably choose a career where they get to think about numbers and symbols all day.
I sometimes wonder why the AI x-risk community was so overrepresented in physicists in the early-ish days (e.g. Hawking, Tegmark, Wilczek, Musk, Tallinn, Rees, Omohundro, Aguirre…). The best I can come up with is that people who self-select into physics are unusually likely to have the combination of (1) smart & quantitative, and (2) really, deeply, profoundly bothered by not understanding important things about the world.

“Act-based approval-directed agents”, for IDA skeptics

Steven Byrnes18 Mar 2026 18:47 UTC

55 points

8 comments5 min readLW link

Steven Byrnes 18 Mar 2026 13:50 UTC
8 points
4
in reply to: glazgogabgolab’s comment on: You can’t imitation-learn how to continual-learn
I wrote:
Are there any examples where their “GLA” gets much higher reward than anything it ever observed in the training data, in the very same environment that the training data was drawn from, by discovering better strategies that were not seen in the training data (just as PPO itself would do if you keep running it)? E.g. there should be an easy experiment where you just cut off the training data well before the PPO teacher finishes converging to the optimal policy, and see if the GLA keeps rising and rising, just as the (unseen) PPO teacher would do. That seems like a really easy experiment—it all happens in just one RL environment. The fact that they don’t talk about anything like that is fishy.
Then you replied:
Yeah. The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used…
But now I think you’re conceding that you were wrong about that after all, and in fact this graph provides no information either way on whether the GLA agent attained a higher score than it ever saw the PPO agent attain, because the GLA agent probably got to see the PPO agent continue to improve beyond the 31 episodes that we see before the figure cuts off.
Right?
Or if not, then you’re definitely misunderstanding my complaint. The fact that the GLA curve rises faster than the PPO curve in the right side of figure 3 is irrelevant. It proves nothing. It’s like … Suppose I watch my friend play a video game and it takes them an hour to beat the boss after 20 tries, most of which is just figuring out what their weak point is. And then I sit down and beat the same boss after 2 tries in 5 minutes by using the same strategy. That doesn’t prove that I “learned how to learn” by watching my friend. Rather, I learned how to beat the boss by watching my friend.
(That would be a natural mistake to make because the paper is trying to trick us into making it, to cover up the fact that their big idea just doesn’t work.)

Steven Byrnes 18 Mar 2026 13:25 UTC
4 points
0
in reply to: jimmy’s comment on: You can’t imitation-learn how to continual-learn
Is the reason that you think it could work for a minute but not 100yr a practical matter of efficiency or one that has a more fundamental limitation that you couldn’t get around with infinite context window/training data/etc?
The “one minute” thing is less about what LLMs CAN do in one minute, and more about what humans CAN’T do in one minute. My claim would be that humans have a superpower of “real” continual learning, which nobody knows how to do with LLMs. But if you give a human just 60 seconds, then they can’t really use that superpower very much, or at least, they can’t get very far with it. It usually takes much more than one minute for people to build and internalize new concepts and understanding to any noticeable degree.
Even with a context window that contains all 10M moves, or do you mean within reasonably limited context windows?
Yes even with a context window that contains all 10M moves. Making that argument was the whole point of the second half of the OP. If you don’t find that convincing, I’m not sure what else to add. ¯\ˍ(ツ)ˍ/¯

Steven Byrnes 18 Mar 2026 13:06 UTC
2 points
0
in reply to: Cole Wyeth’s comment on: Alignment Proposal: Adversarially Robust Augmentation and Distillation
OK. The “parametrized family of continual learning algorithms” frame makes a lot of your earlier comments make more sense now. Thanks.
Next: I guess we’re assuming that (1) we have a parametrized family of continual learning algorithms, and that (2) human learning and thinking is part of that family (although we don’t know a priori which one), and that (3) that you can take some adult human “Joe”, and search through the parametrized family to find one that matches his behavior, and thus wind up with a Joe-imitating algorithm.
I’ll set aside for now whether these assumptions are plausible, and ask a different question: If we make those assumptions then … aren’t we already done? Just make a Joe-imitation and run a million copies of it at 100× speed, and have them work together on AI x-risk (pivotal act, alignment research, whatever).
To me, this seems much simpler than the iterative protocol you discuss in the OP, and equally viable if not more so. What am I missing?

Steven Byrnes 18 Mar 2026 2:56 UTC
5 points
2
in reply to: derelict5432’s comment on: You can’t imitation-learn how to continual-learn
What is Grog in the context of our conversation? You seem to admit at the end that LLMs are not really at all like Grog, in that Grog has no underlying bedrock of understanding, while modern LLMs do.
Grog understands some things (e.g. intuitive physics) but not others (e.g. pulsed lasers). Likewise, LLMs understand some things (e.g. pulsed lasers) but not others (e.g. some new field of science that hasn’t been invented yet). Right? We’re not at the end of history, where everything that can possibly be understood is already understood, and there’s nothing left.
If I hibernated you until the year 2100, and then woke you up and gave you a database with “actionable knowledge” from 1000 textbooks of [yet-to-be-invented fields of science], and asked you to engineer a state-of-the-art [device that no one today has even conceived of], then you would be just as helpless as Grog. You would have to learn the new fields until you understood them, which might take years, before you could even start on the task. This process involves changing the “weights” in your brain. I.e., you would need “real” learning. The database is not a replacement for that.
So think of it this way: there’s some set of things that are understood (by anyone), and that set of things is not increased via a system for pulling up facts from a database. Otherwise Grog would be able to immediately design LIDAR. And yet, humans are able to increase the set of things that are understood, over time. After all, “the set of things that are understood” sure is bigger today than it was 1000 years ago, and will be bigger still in 2100. So evidently humans are doing something very important that is entirely different from what can be done with database systems. And that thing is what I’m calling “real” continual learning.

Steven Byrnes 17 Mar 2026 21:23 UTC
5 points
0
in reply to: glazgogabgolab’s comment on: You can’t imitation-learn how to continual-learn
The graph on the right in Figure 3 illustrates that the learned in-context RL algorithm performs better than, and improves on, the PPO agent whose data it used.
(still talking about this paper) Are you saying that the GLA was trained ONLY on imitation learning during the 31 episodes shown, in which the PPO “teacher” performed no better than a random policy, and then the GLA got way higher scores?
If so … no way, that’s patently absurd. Even if I grant the premise of the paper for the sake of argument, the GLA can’t learn to improve itself via imitating a PPO teacher that is not actually improving itself!
So, if the right-side-of-figure-3 data is not totally fabricated or mis-described, then my next guess would be that they ran the PPO for many more episodes than the 31 shown, and trained the GLA on all that, and that by the end of the training data, the PPO “teacher” was performing much better than shown in the figure, and at least as well as the top of the GLA curve.

Steven Byrnes

“Act-based ap­proval-di­rected agents”, for IDA skeptics

“Act-based approval-directed agents”, for IDA skeptics