I’m an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. I’m also at: Substack, X/Twitter, Bluesky, RSS, email, and more at this link. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Leave me anonymous feedback here.
Steven Byrnes
I think “predict sensory input” is the main training signal for the Thought Generator, loosely analogous to how “predict next token” is the training signal for LLM pretraining. (Cf. §4.7.) So “predict sensory inputs” wouldn’t be a separate box from the Thought Generator, but rather a core function of the Thought Generator. Does that help? Sorry if I’m missing your point.
it also shows that good pretraining data doesn’t matter nearly as much as one could think
I don’t think it shows that. It arguably suggests that abundant pretraining data doesn’t matter as much as one could think. As opposed to good pretraining data. I presume that the codebases + agentic coding transcripts that they SFT’d on were high quality, right?
As for data efficiency, after the pre-1930 pretraining, IIUC it takes 250 training examples ≈ 13 million tokens before “the model solves its first [SWE-bench] issue”, and 75000 training examples ≈ 4 billion tokens gets to pass@1 of 4.5%.
Is that a more or less than expected? I dunno, it depends on what you were expecting. For what it’s worth, Gemini says 13 million tokens is about what a human could read in 650 hours non-stop (40 hours/week for 16 weeks).
Given that we live in a world where people are prone to outrage and over-updating when someone warns of a possible problem that winds up (in hindsight) not being a big deal, this constitutes a valid reason to not warn of possible problems, so as to conserve credibility for when you most need it. (But there are other considerations too; no comment on what’s the best policy all things considered.)
BUT, I think the OP is making a different point: We shouldn’t resign ourselves to living in that world! Instead we can say: People should stop being like that! People should stop being prone to outrage and over-updating when someone warns of a possible problem that winds up (in hindsight) not being a big deal. Let us criticize people for being that way, and let’s try to get them to change, including by writing nice blog post explanations of why this is so dumb and bad.
I revised 2 old posts based on deeper appreciation for the orienting reflex as exemplifying a broader neuroscience motif:
(1) In Neuroscience of human social instincts: a sketch (2024):
I changed terminology from “the ‘thinking of a conspecific’ flag” to “the social attention reflex”. I think the new term has better connotations, especially the way it invokes a parallel to “orienting reflex” and “startle reflex”, which likewise are associated with fast, transient, and involuntary changes in both attention and other innate signals like pleasure and arousal.
(2) In Valence series §3.3.5 (2023):
(Update April 2026: I’ve refined my model a bit: I now think of anxiety-related involuntary attention as being more closely analogous to the famous orienting reflex wherein people turn to look at an unexpected loud sound or motion. Traditional orienting reflexes involve involuntary attention towards exteroceptive inputs, coupled with innate motor commands, physiological arousal, etc. By analogy, if you’re anxious, then (I claim) you’ll likewise experience sporadic interoceptive “orienting reflexes” that involve involuntary attention towards the the feeling of anxiety, coupled with a synchronized squirt of negative valence and displeasure (see Appendix A), plus physiological arousal etc. These interoceptive “orienting reflexes” might occur multiple times per second for intense anxiety, or less often for milder anxiety. I’m using anxiety as an example, but the same idea obviously applies as well to fear, hunger, itches, etc.)
Thanks @Lucius Bushnaq , @Linda Linsefors , and @philh for asking tough questions :)
I think there are decent nihilistic justifications for working on AI safety (e.g. it’s fun, it’s cool, it makes me feel important, etc).
I think you are misunderstanding the implications of nihilism. Copying from here:
Compare “I want the oppressed masses to find justice” with “I’ve been standing too long, I want to sit down”. These two “wants” are fundamentally built out of the same mind-stuff. They both derive from positive valence, which in turn ultimately comes from innate drives (specifically, mainly social drives in the first case, and homeostatic energy-conserving drives in the second case). So if “true morality” or “true human morality” or whatever doesn’t exist, then that does not constitute a reason to sit down rather than to seek justice. You still have to make decisions. That’s what I meant by “nihilism is not decision-relevant”, or Yudkowsky by “What would you do without morality?”. …
Here’s that link in the last sentence:
However, nihilism is not decision-relevant. Imagine being a nihilist, deciding whether to spend your free time trying to bring about an awesome post-AGI utopia, vs sitting on the couch and watching TV. Well, if you’re a nihilist, then the awesome post-AGI utopia doesn’t matter. But watching TV doesn’t matter either. Watching TV entails less exertion of effort. But that doesn’t matter either. Watching TV is more fun (umm, for some people). But having fun doesn’t matter either. There’s no reason to throw yourself at a difficult project. There’s no reason NOT to throw yourself at a difficult project! So nihilism is just not a helpful decision criterion!! What else is there?
I propose a different starting point—what I call Dentin’s prayer: Why do I exist? Because the universe happens to be set up this way. Why do I care (about anything or everything)? Simply because my genetics, atoms, molecules, and processing architecture are set up in a way that happens to care. …
Shankar’s original claim was that the 2016 election was BEFORE functional prediction markets, and that the bit of “raising the sanity waterline” in question happened between then and today.
I really don’t think PredictIt should count as a prediction market at all in this context, I recall that they had crazy rules that made it basically impossible for serious people to make serious money by correcting even blindingly obvious market errors. (Don’t know anything about PredictWise.)
Katja Grace has a recent blog post from that genre: How I love running.
For my part, I have never hated exercise, but I would sure do it much less if not for a longstanding policy that I never watch media (TV shows, movies, youtube, etc) for fun alone, except while exercising (exercise bike, stair-stepper, elliptical, etc). And I have an endless backlog of highly-addictive trashy media that I really want to watch, and that nobody else wants to watch with me.
Let’s say that, at a population level, X% of deadly skin cancers come from sunburns and Y% from suntans. (For skin cancers that get contributions from both burns and tans, we can divvy it up by Shapley value or whatever).
After thinking about it a bit more, my claims would be (1) Y is much lower than X, (2) You should really only bother thinking about suntan risk (Y) at all when making decisions that impact the equivalent of 1000s of full days of sun exposure, like occupational decisions (what career to pursue, whether to wear a hat every day at work), not when you’re thinking about a few weeks here or there, or walking the dog for 20 minutes a day, etc. By contrast, even a single blistering sunburn once in your entire life seems to be a measurable cancer risk factor. So on a day-to-day basis, basically you should be focusing 100% on sunburns, 0% on suntans, from a deadly cancer perspective.
For “what would I expect in a world where I’m wrong”, let’s take (1) and (2) separately.
For (1), what would the world look like if Y≳X? Well, for one thing, skin cancers would be super-common on the face, ears, neck, hands, etc. Doctors, even way back in the 19th century, would have noticed this obvious pattern, and also noticed that certain groups like sailors were getting skin cancer at way higher rates than everyone else. And IIUC, this is exactly what happened! …For squamous cell carcinoma (SCC). But not for melanoma. E.g., if you google SCC, all the public health websites seem to say things like: it’s super-common among farmers and sailors, it’s very often on the face, ears, neck, backs-of-hands, etc. I.e., SCC has all the hallmarks that I would expect from a condition related to chronic suntans. So the medical community is evidently capable of noticing these signs—SCC is an existence proof! (And this was even understood before 1900, I think.) And those signs are conspicuously NOT what I find when I google melanoma. Instead, the melanoma pages seem to talk about blistering sunburns, and how it’s common on the torso, etc. (SCC is rarely fatal, IIUC.)
For (2), what would the world look like if (2) was a bad frame for thinking about things? Well, I guess I’m getting (2) from a combination of (2A) a ballpark sense of how much sun exposure the average person gets (which I believe comes from a quite heavy-tailed distribution), and (2B) the assumption that cancer risk is a linear or concave-up function of sun exposure, as opposed to concave-down. If I’m wrong on (2A), I would notice that, if I try to do a more careful fermi estimate, I would find that my ballpark sense was actually way off (in terms of micromorts per hour outside). I could go through such an estimate if it’s a crux. If I’m wrong on (2B), I would expect that either concave-down carcinogens are common (I don’t think they are), or that there would be some legible explanation for why chronic sun exposure is different from normal carcinogens (I don’t think there is).
I don’t have much background on carcinogenesis but if you use jargon I will look it up! :-)
To be clear, hedonic tone is a “genetically hardwired signal” in a certain sense, but many of the inputs to that signal are the dozens to hundreds of thought assessors (for disgust, for arousal, for all the social stuff I discuss here, etc.). I’ll edit to make that clearer.
Yeah, “primary reward” (as I’m using the term here) can definitely involve defer-to-predictor on one or more thought assessors (e.g. disgust, physiological arousal … any of them besides valence).
I’ll bow out of that argument. Time will tell!
its seeming resemblance to how human mathematical progress happens
Well, one important-to-me disanalogy is that they used the Lean proof-assistant as ground truth for an LLM’s purported proof being valid or not. Whereas human mathematical progress obviously does not require proof-assistants—humans were doing math long before proof-assistants existed. (More on this in §1 of my post “Sharp Left Turn” discourse: An opinionated review.)
Steve’s model is not a multi agent model, and I can’t think of a multi agent model that works.
If it helps, I have a brief take on “subagent” terminology in §1.5.2 here.
Self sabotage
If the idea is that there’s something secretly motivating about the idea of failing (in some context), and that it’s helpful to bring this secret motivation up to conscious awareness and consideration, then yeah, definitely, and that sounds related to David Burns “positive reframing” (1,2) and related ideas in other psychiatry traditions I’m less familiar with (3).
Why not google docs?
Most things in biology are on a spectrum, I would be surprised of psychopathy is not one of those.
One way to think of it is: there’s a spectrum of how Person A cares about Person B, and this spectrum goes from positive (compassion, desire to help) to neutral (callous indifference) to negative (schadenfreude, desire to pick a fight).
So “it’s a spectrum” is not in itself an argument for optimism here. (Or sorry if I’m misunderstanding.)
I maybe should write a general post about “why I don’t believe in most neat psychopathologies”. I do really wish this field of study was higher quality, and maybe I should do a deep dive and form a more consistent opinion on this…
In case it helps, my take on the psychopathy literature is mostly the same as it was 3 years ago when I wrote this comment.
Everyone agrees sunburns are bad, and so if someone is in a situation where the only way they can avoid sunburns is sunscreen, then they should obviously use sunscreen. That’s what I had in mind when I wrote my post,
but maybe I’ll tweak the wording to make it clearerUpdate: I have now added a little addendum making that explicit:ADDENDUM APRIL 16: I should clarify that for some people in some situations (apparently white people in Australia are often in this category), it might be the case that your body is simply incapable of developing enough of a tan to avoid getting a sunburn. If so, then you should obviously wear sunscreen! At the end of the day, if you’re getting sunburns, then whatever you’re doing is the wrong thing to do, and you should do something different. Sunburns are bad.
Thanks for all this great info! Alas, I am not convinced.
Seems like the two biggest cruxes between us are:
(1) you see tanning beds as strong evidence about actual suntans whereas I see it as merely suggestive. I see it as plenty plausible that getting a tan on a tanning bed is just a different thing from getting a tan in real sunlight. E.g. tanning beds have 10× more intense UV but almost no visible light, and even within UV the precisespectra are presumably somewhat different, etc. I acknowledge that I don’t have a detailed mechanistic story here, just an open-mindedness to the possibility that such a story exists. So that might be hard to resolve. So let’s move onto the other big crux for me:
(2) I really do get a lot of my confidence in “sun-tans are basically fine” theory from the fact that plenty of white people in the USA are outside a ton and simply never wear sunscreen. And when there’s a decent-sized subpopulation that has a much, much higher exposure to an important carcinogen than the median person, then that should be just blindingly obvious to everyone. Like it is with smoking. If you look at a group of lung cancer patients, then most of them will be (current or former) chain smokers, even though chain smokers are a pretty small fraction of the population. And I really don’t think melanoma patients are like that. Like, I don’t think it’s the case that most melanoma patients are members of the subpopulation “white people with outdoor jobs who never wear sunscreen”. If that were the case, everyone would have noticed by now; it would be screaming out of the statistics, indeed you wouldn’t even need statistics to see it. We can argue about these little 20% effects or whatever, but we don’t see “white outdoor-working sunscreen-abstainers” screaming out of the statistics, crowding the melanoma wards, and getting skin cancers at 10× or 50× (or whatever) the rate of the white population median.
Brynes offered the confusing statement that because there are confounders, he is rounding 20% to zero. But confounders go both ways. The study could be vastly underestimating melanoma risk associated with agricultural work, for example because workers are physically healthier or of darker complexion than controls. The data is so noisy it’s close to useless for estimating the risks of tanning.
Oh sorry, I think you misunderstood, I was not rounding 20% to zero because I expect that the 20% is overstated due to confounders, rather I am rounding to zero because a factor of 1.2× is very much closer to a factor of 1× than to a factor of 10× or 50× or whatever. Again, my stance is that a real and decision-relevant effect of suntans-without-sunburns would be screaming out of the data, blindingly obvious to everyone, the way that smoking-vs-lung-cancer is, because some people get much, much more sunscreen-free sun exposure than the median.
In general, you should not expect linear dose-response relationships between carcinogen exposure and cancer risk. As an example to build intuition, the excess incidence of lung cancer in smokers is approximately “proportional to the fourth power of smoking duration multiplied by the number of cigarettes smoked per day.” The reasons why are complicated and depend heavily on the etiology of the cancer in question.
The fact that people who get 50x more sun exposure than you don’t have 50x higher rates skin cancer does not mean your current level of sun exposure is not significantly contributing to your cancer risk.
As I understand it, dose-response curves for carcinogens are usually linear or concave-up, as in your smoking example, whereas your suggestion here seems to be that it might be concave-down, which I think would be very strange. After some guy has been working outside every day without sunscreen for 10 years, now it’s 10 years + 1 day, and he goes outside without sunscreen as usual, and … what? The UV light no longer causes as much CPDs?? The CPDs no longer cause as much melanoma?? That doesn’t make any sense, right? Or what? Sorry if I’m misunderstanding.
Green et al. (2011) conducted a 10-year study (n=1621) in Nambour, a town in Queensland, Australia. Participants were randomly assigned to daily or discretionary sunscreen application. After 10 years, 11 melanomas were found in the daily group and 22 in the discretionary group.
Cool study, thanks! But we don’t know whether that’s mediated by tans vs burns. (Again, everyone agrees that sunburns are bad.) The paper talks about how many sunburns people had at baseline but not during the study period (unless I missed it).
I know I didn’t respond to everything, I hope to keep reading and commenting when I get a chance (might be a couple weeks though…). Thanks again, these are great resources, and I’m looking forward to doing a second more careful pass through this post! It also sounds like you caught some unambiguous errors I made, so I’ll want to fix those. I figured I’d comment anyway just to share my quick response in the meantime.
There’s further discussion here including the comment section. No one found a smoking gun, but one of the two authors was also lead author on this other paper that is obvious BS.
I think of “brain-like AGI” as a threat model, not a plan. And then a question is whether the threat model is plausible vs far-fetched, and I guess you’re saying that my argument for “it’s plausible” is solid but I’m not communicating it clearly?
Nominally, my argument for “it’s plausible” is back in §1.5 (“What’s the probability that we’ll eventually wind up with brain-like AGI?”), and it sounds like you’re in “Opinion #4” camp. (“Opinion #4: “Brains are SO complicated—and we understand them SO little after SO much effort—that there’s just no way we’ll get brain-like AGI even in the next 100 years.”) I could flesh out my answer to Opinion #4, e.g. by adding three one-sentence bullet points that follow the three “brain complexity is easy to overstate” slides here. Hmm, I’ll think about it. The post is already quite long.
Sure, but that could still be consistent with “sunburns bad, suntans fine” theory, I figure. Maybe even if our ancestors were outside all the time, they would still sometimes lose their tan during a cloudy week and get sunburned?
I’m definitely open to the possibility that e.g. people of Scandinavian descent living in Nairobi simply cannot accommodate to the UV exposure by tanning, i.e. even if they are as tanned as they can possibly be, they’ll still get burned, there’s just too much UV. If so, then sunscreen (+ clothes, shade, etc.) is their only option to avoid sunburns, and again everyone agrees that sunburns are bad, both immediately and long-term.
(FYI, I just added notes to the top of §4.7 & §4.7.1 that I no longer endorse those sections as written.)
I think a lot of sense-of-self is related to imagining how you look in someone else’s eyes, which invokes a rewarding sense of pride in one’s self-image, as I discussed later in §3 of “Social drives 2” (2025).
So yeah, it makes sense that that reaction might gradually fade away upon prolonged isolation.
Here’s another example: I think that the relevant innate drive is very weak in many sociopaths, and that this explains the fact that at least one sociopath describes herself as not really having any sense of self in the way that most other people do:
–M.E. Thomas on Clearer Thinking podcast