I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.
Steven Byrnes
In every war, both attack and defense have “oneshotness”. But obviously, one of the sides of a war can and often does succeed. In the OP, Germany’s “oneshot” Maginot line plan wound up working great!
(I’m not sure exactly what OP means by “curse”. Wars have “oneshotness” but are not particularly “cursed” if there’s a 50% chance of success on priors.)
So, I think the relevant factors that make it hard are mainly
(1) distribution shift between safe tests and the “oneshot” situation we care about, and
(2) some general sense of hardness-of-the-problem, which is low for winning a war (you merely need to botch it less than the other guy) and which is high for space travel (a.k.a. filling a giant container with 5000 tonnes of the most flammable substance imaginable, strapping delicate equipment onto the front of it, traveling through intense heat, vibration, radiation, and vacuum, and on and on).
(Plus numerous other factors outside the scope of this post.)
Of these, (1) is discussed in the OP. (“Someone could, conceivably, argue that the change to “there being enough machine superintelligence around that ASI could kill humanity if they tried”, from “AIs being experimented-upon that couldn’t kill us if they tried”, will be less than the sort of change from “the sort of tests you can do on a Mars Observer probe on Earth”, to “the actual conditions…” But that would be an argument so incredibly stupid that it might actually sound stupid when they thought about saying it.”)
But (2) is not really discussed in the OP (I think), and seems like a big crux between people. (I’m basically on Eliezer’s side of that debate FWIW.)
Another point related to (1) is my aphorism: “if you’re worried that a nuclear weapon with yield Y might ignite the atmosphere, it doesn’t help to first test a nuclear weapon with yield 0.1×Y, and then if the atmosphere hasn’t been ignited yet, next try testing one with yield 0.2×Y, etc.” I.e., I sometimes (not always!) see people talking about gradual AI progress without having a clear and plausible (to me) mechanism by which the earlier steps actually de-risk the later steps.
Lol oops yeah SFT is posttraining, that explains why I found your comment confusing.
The quote emphasized “after just 250 training examples”, and I thought that was the context, i.e. that you were impressed by the “just 250” part and commenting on that. But I guess the “just 250” was irrelevant to your comment.
On top of that, I tend to mentally lump pretraining and SFT together because they’re algorithmically exactly the same thing, except maybe different hyperparameters. So that’s the other half of why I misread your comment.
Still, given that pretraining and SFT are algorithmically exactly the same thing, it would follow that you need no pretraining data whatsoever if you have enough of the right kind of SFT data. …In principle. Probably not in practice. But still, that’s relevant context here I think.
You say “with only a few examples”, I say “after ~13 million tokens of some mix of code and agentic reasoning about code”. Is it a “shock” because you expecting it to take much more than 13 million tokens? Or is it a “shock” because you expected “number of examples” to be an important constraint independent of the length of each example? Or something else?
(FYI, I just added notes to the top of §4.7 & §4.7.1 that I no longer endorse those sections as written.)
I think a lot of sense-of-self is related to imagining how you look in someone else’s eyes, which invokes a rewarding sense of pride in one’s self-image, as I discussed later in §3 of “Social drives 2” (2025).
So yeah, it makes sense that that reaction might gradually fade away upon prolonged isolation.
Here’s another example: I think that the relevant innate drive is very weak in many sociopaths, and that this explains the fact that at least one sociopath describes herself as not really having any sense of self in the way that most other people do:
I feel like, because it’s almost like a jailbroken phone that’s capable of connecting to everything, it feels like I’m untethered; I don’t have, kind of like normal people, who think of their sense of self as being like an anchor. Even though the ocean could push them one way or the other at any given moment, they’re at work, they’re acting a little bit different. They’re at home, they’re acting a little bit different, but they’re never too far away from this anchor, this center that they have. For me, it’s more like there’s no anchor. It’s very flexible the way that I’m able to interact with people. I can show up pretty much any way that I want to. Obviously not physically, but even sometimes physically. The other day, I had a very distinctive walk. I was walking towards my girlfriend, and I wondered if she would recognize me, so I kind of hid the walk. She said that she wondered if she was in a horror film. The way that I was walking was so weird. There is some ability to kind of adjust physically as well.
I think “predict sensory input” is the main training signal for the Thought Generator, loosely analogous to how “predict next token” is the training signal for LLM pretraining. (Cf. §4.7.) So “predict sensory inputs” wouldn’t be a separate box from the Thought Generator, but rather a core function of the Thought Generator. Does that help? Sorry if I’m missing your point.
it also shows that good pretraining data doesn’t matter nearly as much as one could think
I don’t think it shows that. It arguably suggests that abundant
pretraining data doesn’t matter as much as one could think. As opposed to goodpretraining data. I presume that the codebases + agentic coding transcripts that they SFT’d on were high quality, right? [ETA: WHOOPS SEE MATRICE REPLY]As for data efficiency, after the pre-1930 pretraining, IIUC it takes 250 training examples ≈ 13 million tokens before “the model solves its first [SWE-bench] issue”, and 75000 training examples ≈ 4 billion tokens gets to pass@1 of 4.5%.
Is that a more or less than expected? I dunno, it depends on what you were expecting. For what it’s worth, Gemini says 13 million tokens is about what a human could read in 650 hours non-stop (40 hours/week for 16 weeks).
Given that we live in a world where people are prone to outrage and over-updating when someone warns of a possible problem that winds up (in hindsight) not being a big deal, this constitutes a valid reason to not warn of possible problems, so as to conserve credibility for when you most need it. (But there are other considerations too; no comment on what’s the best policy all things considered.)
BUT, I think the OP is making a different point: We shouldn’t resign ourselves to living in that world! Instead we can say: People should stop being like that! People should stop being prone to outrage and over-updating when someone warns of a possible problem that winds up (in hindsight) not being a big deal. Let us criticize people for being that way, and let’s try to get them to change, including by writing nice blog post explanations of why this is so dumb and bad.
I revised 2 old posts based on deeper appreciation for the orienting reflex as exemplifying a broader neuroscience motif:
(1) In Neuroscience of human social instincts: a sketch (2024):
I changed terminology from “the ‘thinking of a conspecific’ flag” to “the social attention reflex”. I think the new term has better connotations, especially the way it invokes a parallel to “orienting reflex” and “startle reflex”, which likewise are associated with fast, transient, and involuntary changes in both attention and other innate signals like pleasure and arousal.
(2) In Valence series §3.3.5 (2023):
(Update April 2026: I’ve refined my model a bit: I now think of anxiety-related involuntary attention as being more closely analogous to the famous orienting reflex wherein people turn to look at an unexpected loud sound or motion. Traditional orienting reflexes involve involuntary attention towards exteroceptive inputs, coupled with innate motor commands, physiological arousal, etc. By analogy, if you’re anxious, then (I claim) you’ll likewise experience sporadic interoceptive “orienting reflexes” that involve involuntary attention towards the the feeling of anxiety, coupled with a synchronized squirt of negative valence and displeasure (see Appendix A), plus physiological arousal etc. These interoceptive “orienting reflexes” might occur multiple times per second for intense anxiety, or less often for milder anxiety. I’m using anxiety as an example, but the same idea obviously applies as well to fear, hunger, itches, etc.)
Thanks @Lucius Bushnaq , @Linda Linsefors , and @philh for asking tough questions :)
I think there are decent nihilistic justifications for working on AI safety (e.g. it’s fun, it’s cool, it makes me feel important, etc).
I think you are misunderstanding the implications of nihilism. Copying from here:
Compare “I want the oppressed masses to find justice” with “I’ve been standing too long, I want to sit down”. These two “wants” are fundamentally built out of the same mind-stuff. They both derive from positive valence, which in turn ultimately comes from innate drives (specifically, mainly social drives in the first case, and homeostatic energy-conserving drives in the second case). So if “true morality” or “true human morality” or whatever doesn’t exist, then that does not constitute a reason to sit down rather than to seek justice. You still have to make decisions. That’s what I meant by “nihilism is not decision-relevant”, or Yudkowsky by “What would you do without morality?”. …
Here’s that link in the last sentence:
However, nihilism is not decision-relevant. Imagine being a nihilist, deciding whether to spend your free time trying to bring about an awesome post-AGI utopia, vs sitting on the couch and watching TV. Well, if you’re a nihilist, then the awesome post-AGI utopia doesn’t matter. But watching TV doesn’t matter either. Watching TV entails less exertion of effort. But that doesn’t matter either. Watching TV is more fun (umm, for some people). But having fun doesn’t matter either. There’s no reason to throw yourself at a difficult project. There’s no reason NOT to throw yourself at a difficult project! So nihilism is just not a helpful decision criterion!! What else is there?
I propose a different starting point—what I call Dentin’s prayer: Why do I exist? Because the universe happens to be set up this way. Why do I care (about anything or everything)? Simply because my genetics, atoms, molecules, and processing architecture are set up in a way that happens to care. …
Shankar’s original claim was that the 2016 election was BEFORE functional prediction markets, and that the bit of “raising the sanity waterline” in question happened between then and today.
I really don’t think PredictIt should count as a prediction market at all in this context, I recall that they had crazy rules that made it basically impossible for serious people to make serious money by correcting even blindingly obvious market errors. (Don’t know anything about PredictWise.)
Katja Grace has a recent blog post from that genre: How I love running.
For my part, I have never hated exercise, but I would sure do it much less if not for a longstanding policy that I never watch media (TV shows, movies, youtube, etc) for fun alone, except while exercising (exercise bike, stair-stepper, elliptical, etc). And I have an endless backlog of highly-addictive trashy media that I really want to watch, and that nobody else wants to watch with me.
Let’s say that, at a population level, X% of deadly skin cancers come from sunburns and Y% from suntans. (For skin cancers that get contributions from both burns and tans, we can divvy it up by Shapley value or whatever).
After thinking about it a bit more, my claims would be (1) Y is much lower than X, (2) You should really only bother thinking about suntan risk (Y) at all when making decisions that impact the equivalent of 1000s of full days of sun exposure, like occupational decisions (what career to pursue, whether to wear a hat every day at work), not when you’re thinking about a few weeks here or there, or walking the dog for 20 minutes a day, etc. By contrast, even a single blistering sunburn once in your entire life seems to be a measurable cancer risk factor. So on a day-to-day basis, basically you should be focusing 100% on sunburns, 0% on suntans, from a deadly cancer perspective.
For “what would I expect in a world where I’m wrong”, let’s take (1) and (2) separately.
For (1), what would the world look like if Y≳X? Well, for one thing, skin cancers would be super-common on the face, ears, neck, hands, etc. Doctors, even way back in the 19th century, would have noticed this obvious pattern, and also noticed that certain groups like sailors were getting skin cancer at way higher rates than everyone else. And IIUC, this is exactly what happened! …For squamous cell carcinoma (SCC). But not for melanoma. E.g., if you google SCC, all the public health websites seem to say things like: it’s super-common among farmers and sailors, it’s very often on the face, ears, neck, backs-of-hands, etc. I.e., SCC has all the hallmarks that I would expect from a condition related to chronic suntans. So the medical community is evidently capable of noticing these signs—SCC is an existence proof! (And this was even understood before 1900, I think.) And those signs are conspicuously NOT what I find when I google melanoma. Instead, the melanoma pages seem to talk about blistering sunburns, and how it’s common on the torso, etc. (SCC is rarely fatal, IIUC.)
For (2), what would the world look like if (2) was a bad frame for thinking about things? Well, I guess I’m getting (2) from a combination of (2A) a ballpark sense of how much sun exposure the average person gets (which I believe comes from a quite heavy-tailed distribution), and (2B) the assumption that cancer risk is a linear or concave-up function of sun exposure, as opposed to concave-down. If I’m wrong on (2A), I would notice that, if I try to do a more careful fermi estimate, I would find that my ballpark sense was actually way off (in terms of micromorts per hour outside). I could go through such an estimate if it’s a crux. If I’m wrong on (2B), I would expect that either concave-down carcinogens are common (I don’t think they are), or that there would be some legible explanation for why chronic sun exposure is different from normal carcinogens (I don’t think there is).
I don’t have much background on carcinogenesis but if you use jargon I will look it up! :-)
To be clear, hedonic tone is a “genetically hardwired signal” in a certain sense, but many of the inputs to that signal are the dozens to hundreds of thought assessors (for disgust, for arousal, for all the social stuff I discuss here, etc.). I’ll edit to make that clearer.
Yeah, “primary reward” (as I’m using the term here) can definitely involve defer-to-predictor on one or more thought assessors (e.g. disgust, physiological arousal … any of them besides valence).
I’ll bow out of that argument. Time will tell!
its seeming resemblance to how human mathematical progress happens
Well, one important-to-me disanalogy is that they used the Lean proof-assistant as ground truth for an LLM’s purported proof being valid or not. Whereas human mathematical progress obviously does not require proof-assistants—humans were doing math long before proof-assistants existed. (More on this in §1 of my post “Sharp Left Turn” discourse: An opinionated review.)
Steve’s model is not a multi agent model, and I can’t think of a multi agent model that works.
If it helps, I have a brief take on “subagent” terminology in §1.5.2 here.
Self sabotage
If the idea is that there’s something secretly motivating about the idea of failing (in some context), and that it’s helpful to bring this secret motivation up to conscious awareness and consideration, then yeah, definitely, and that sounds related to David Burns “positive reframing” (1,2) and related ideas in other psychiatry traditions I’m less familiar with (3).
Why not google docs?
Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those.
One way to think of it is: there’s a spectrum of how Person A cares about Person B, and this spectrum goes from positive (compassion, desire to help) to neutral (callous indifference) to negative (schadenfreude, desire to pick a fight).
So “it’s a spectrum” is not in itself an argument for optimism here. (Or sorry if I’m misunderstanding.)
I maybe should write a general post about “why I don’t believe in most neat psychopathologies”. I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this…
In case it helps, my take on the psychopathy literature is mostly the same as it was 3 years ago when I wrote this comment.
Everyone agrees sunburns are bad, and so if someone is in a situation where the only way they can avoid sunburns is sunscreen, then they should obviously use sunscreen. That’s what I had in mind when I wrote my post,
but maybe I’ll tweak the wording to make it clearerUpdate: I have now added a little addendum making that explicit:ADDENDUM APRIL 16: I should clarify that for some people in some situations (apparently white people in Australia are often in this category), it might be the case that your body is simply incapable of developing enough of a tan to avoid getting a sunburn. If so, then you should obviously wear sunscreen! At the end of the day, if you’re getting sunburns, then whatever you’re doing is the wrong thing to do, and you should do something different. Sunburns are bad.
As in my other comment, winning a war has “oneshotness” but is not especially hard or “cursed”, in the sense that you just need to botch it less than the other side which also has “oneshotness”.
(Actually, it’s worse than that, because I for one am very skeptical that a failed AI takeover attempt would in fact leads to a some durable prevention of future AI takeover attempts.)