Alright, so I’ve been following the latest OpenAI Twitter freakout, and here’s some urgent information about the latest closed-doors developments that I’ve managed to piece together:
Following OpenAI Twitter freakouts is a colossal, utterly pointless waste of your time and you shouldn’t do it ever.
If you saw this comment of Gwern’s going around and were incredibly alarmed, you should probably undo the associated update regarding AI timelines (at least partially, see below).
OpenAI may be running some galaxy-brained psyops nowadays.
Here’s the sequence of events, as far as I can tell:
Some Twitter accounts that are (claiming, without proof, to be?) associated with OpenAI are being very hype about some internal OpenAI developments.
Gwern posts this comment suggesting an explanation for point 1.
Several accounts (e. g., one, two) claiming (without proof) to be OpenAI insiders start to imply that:
An AI model recently finished training.
Its capabilities surprised and scared OpenAI researchers.
It produced some innovation/is related to OpenAI’s “Level 4: Innovators” stage of AGI development.
The stories told by the accounts above start to mention that the new breakthrough is similar to GPT-4b: that it’s some AI model that produced an innovation in “health and longevity”. But also, that it’s broader than GPT-4b, and that the full breadth of this new model’s surprising emergent capabilities is unclear. (One, two, three.)
Noam Brown, an actual confirmed OpenAI researcher, complains about “vague AI hype on social media”, and states they haven’t yet actually achieved superintelligence.
The Axios story comes out, implying that OpenAI has developed “PhD-level superagents” and that Sam Altman is going to brief Trump on them. Of note:
If you put on Bounded Distrust lens, you can see that the “PhD-level superagents” claim is entirely divorced from any actual statements made by OpenAI people. The article ties-in a Mark Zuckerberg quote instead, etc. Overall, the article weaves the impression it wants to create out of vibes (with which it’s free to lie) and not concrete factual statements.
The “OpenAI insiders” gradually ramp up the intensity of their story all the while, suggesting that the new breakthrough would allow ASI in “weeks, not years”, and also that OpenAI won’t release this “o4-alpha” until 2026 because they have a years-long Master Plan, et cetera. Example, example.
Sam Altman complains about “twitter hype” being “out of control again”.
First, let’s dispel any notion that the hype accounts are actual OpenAI insiders who know what they are talking about:
“Satoshi” claims to be blackmailing OpenAI higher-ups in order to be allowed to shitpost classified information on Twitter. I am a bit skeptical of this claim, to put it mildly.
“Riley Coyote” has a different backstory which is about as convincing by itself, and which also suggests that “Satoshi” is “Riley”’s actual source.
As far as I can tell digging into the timeline, both accounts just started acting as if they are OpenAI associates posting leaks. Not even, like, saying that they’re OpenAI associates posting leaks, much less proving that. Just starting to act as if they’re OpenAI associates and that everyone knows this. Their tweets then went viral. (There’s also the strawberry guy, who also implies to be an OpenAI insider, who also joined in on the above hype-posting, and who seems to have been playing this same game for a year now. But I’m tired of looking up the links, and the contents are intensely unpleasant. Go dig through that account yourself if you want.)
In addition, none of the OpenAI employee accounts with real names that I’ve been able to find have been participating in this hype cycle. So if OpenAI allowed its employees to talk about what happened/is happening, why weren’t any confirmed-identity accounts talking about it (except Noam’s, deflating it)? Why only the anonymous Twitter people?
Well, because this isn’t real.
That said, the timing is a bit suspect. This hype starting up, followed by the GPT-4b micro release and the Axios piece, all in the span of ~3 days? And the hype men’s claims at least partially predicting the GPT-4b micro thing?
There’s three possibilities:
A coincidence. (The predictions weren’t very precise, just “innovators are coming”. The details about health-and-longevity and the innovative output got added after the GPT-4b piece, as far as I can tell.)
A leak in one of the newspapers working on the GPT-4b story (which the grifters then built a false narrative around).
Coordinated action by OpenAI.
One notable point is, the Axios story was surely coordinated with OpenAI, and it’s both full of shenanigans and references the Twitter hype (“several OpenAI staff have been telling friends they are both jazzed and spooked by recent progress”). So OpenAI was doing shenanigans. So I’m slightly inclined to believe it was all an OpenAI-orchestrated psyop.
Let’s examine this possibility.
Regarding the truth value of the claims: I think nothing has happened, even if the people involved are OpenAI-affiliated (in a different sense from how they claim). Maybe there was some slight unexpected breakthrough on an obscure research direction, at most, to lend an air of technical truth to those claims. But I think it’s all smoke and mirrors.
However, the psyop itself (if it were one) has been mildly effective. I think tons of people actually ended up believing that something might be happening (e. g., janus, the AI Notkilleveryoneism Memes guy, myself for a bit, maybe gwern, if his comment referenced the pattern of posting related to the early stages of this same event).
That said, as Eliezer points out here, it’s advantageous for OpenAI to be crying wolf: both to drive up/maintain hype among their allies, and to frog-boil the skeptics into instinctively dismissing any alarming claims. Such that, say, if there ever are actual whistleblowers pseudonymously freaking out about unexpected breakthroughs on Twitter, nobody believes them.
That said, I can’t help but think that if OpenAI were actually secure in their position and making insane progress, they would not have needed to do any of this stuff. If you’re closing your fingers around agents capable of displacing the workforce en masse, if you see a straight shot to AGI, why engage in this childishness? (Again, if Satoshi and Riley aren’t just random trolls.)
Bottom line, one of the following seems to be the case:
There’s a new type of guy, which is to AI/OpenAI what shitcoin-shills are to cryptocurrency.
OpenAI is engaging in galaxy-brained media psyops.
Oh, and what’s definitely true is that paying attention to what’s going viral on Twitter is a severe mistake. I’ve committed it for the first and last time.
I also suggest that you unroll the update you might’ve made based on Gwern’s comment. Not the part describing to the o-series’ potential – that’s of course plausible and compelling. The part where that potential seems to have already been confirmed and realized according to ostensible OpenAI leaks – because those leaks seem to be fake. (Unless Gwern was talking about some other demographic of OpenAI accounts being euphorically optimistic on Twitter, which I’ve somehow missed?)[1]
(Oh, as to Sam Altman meeting with Trump? Well, that’s probably because Trump’s Sinister Vizier, Sam Altman’s sworn nemesis, Elon Musk, is whispering in Trump ear 24⁄7 suggesting to crush OpenAI, and if Altman doesn’t seduce Trump ASAP, Trump will do that. Especially since OpenAI is currently vulnerable due to their legally dubious for-profit transition.
This planet is a clown show.)
I’m currently interested in:
Arguments for actually taking the AI hype people’s claims seriously. (In particular, were any actual OpenAI employees provably involved, and did I somehow miss them?)
Arguments regarding whether this was an OpenAI psyop vs. some random trolls.
Also, pinging @Zvi in case any of those events showed up on his radar and he plans to cover them in his newsletter.
Also, I can’t help but note that the people passing the comment around (such as this, this) are distorting it. The Gwern-stated claim isn’t that OpenAI are close to superintelligence, it’s that they may feel as if they’re close to superintelligence. Pretty big difference!
Though, again, even that is predicated on actual OpenAI employees posting actual insider information about actual internal developments. Which I am not convinced is a thing that is actually happening.
I personally put a relatively high probability of this being a galaxy brained media psyop by OpenAI/Sam Altman.
Eliezer makes a very good point that confusion around people claiming AI advances/whistleblowing benefits OpenAI significantly, and Sam Altman has a history of making galaxy brained political plays (attempting to get Helen fired (and then winning), testifying to congress that it is good he has oversight via the board and he should not be full control of OpenAI and then replacing the board with underlings, etc).
Sam is very smart and politically capable. This feels in character.
Thanks for doing this so I didn’t have to! Hell is other people—on social media. And it’s an immense time-sink.
Zvi is the man for saving the rest of us vast amounts of time and sanity.
I’d guess the psyop spun out of control with a couple of opportunistic posters pretending they had inside information, and that’s why Sam had to say lower your expectations 100x. I’m sure he wants hype, but he doesn’t want high expectations that are very quickly falsified. That would lead to some very negative stories about OpenAI’s prospects, even if they’re equally silly they’d harm investment hype.
The thing is—last time I heard about OpenAI rumors it was Strawberry.
That was part of my reasoning as well, why I thought it might be worth engaging with!
But I don’t think this is the same case. Strawberry/Q* was being leaked-about from more reputable sources, and it was concurrent with dramatic events (the coup) that were definitely happening.
In this case, all evidence we have is these 2-3 accounts shitposting.
Valid, I was split on whether it’s worth posting vs. it’d be just me taking my part in spreading this nonsense. But it’d seemed to me that a lot of people, including LW regulars, might’ve been fooled, so I erred on the side of posting.
As I’d said, I think he’s right about the o-series’ theoretic potential. I don’t think there is, as of yet, any actual indication that this potential has already been harnessed, and therefore that it works as well as the theory predicts. (And of course, the o-series scaling quickly at math is probably not even an omnicide threat. There’s an argument for why it might be – that the performance boost will transfer to arbitrary domains – but that doesn’t seem to be happening. I guess we’ll see once o3 is public.)
I am not an AI successionist because I don’t want myself and my friends to die.
There are various high-minded arguments that AIs replacing us is okay because it’s just like cultural change and our history is already full of those, or because they will be our “mind children”, or because they will be these numinous enlightened beings and it is our moral duty to give birth to them.
People then try to refute those by nitpicking which kinds of cultural change are okay or not, or to what extent AIs’ minds will be descended from ours, or whether AIs will necessarily have consciousnesses and feel happiness.
And it’s very cool and all, I’d love me some transcendental cultural change and numinous mind-children. But all those concerns are decidedly dominated by “not dying” in my Maslow hierarchy of needs. Call me small-minded.
If I were born in 1700s, I’d have little recourse but to suck it up and be content with biological children or “mind-children” students or something. But we seem to have an actual shot at not-dying here[1]. If it’s an option to not have to be forcibly “succeeded” by anything, I care quite a lot about trying to take this option.[2]
Many other people also have such preferences: for the self-perpetuation of their current selves and their currently existing friends. I think those are perfectly valid. Sure, they’re displeasingly asymmetric in a certain sense. They introduce a privileged reference frame: a currently existing human values concurrently existing people more than the people who are just as real, but slightly temporally displaced. It’s not very elegant, not very aesthetically pleasing. It implies an utility function that cares not only about states, but also state transitions.[3]
Caring about all that, however, is also decidedly dominated by “not dying” in my Maslow hierarchy of needs.
If all that delays the arrival of numinous enlightened beings, too bad for the numinous enlightened beings.
Via attaining the longevity escape velocity by normal biotech research, or via uploads, or via sufficiently good cryonics, or via properly aligned AGI.
Though not infinitely so: as in, I wouldn’t prevent 10100 future people from being born in exchange for a 10−100 probability of becoming immortal. I would, however, insist on continuing to exist even if my resources could be used to create and sustain two new people.
As in, all universe-state transitions that involve a currently existing person dying get an utility penalty, regardless of what universe-state they go to. There’s now path dependence: we may go or not go to a given high-utility state depending on which direction we’re approaching it from. Yucky!
(For example, suppose there were an option to destroy this universe and create either Universe A, filled with 10^100 happy people, or Universe B, with 10^100 + 1 happy people.
Suppose we’re starting from a state where humanity has been reduced to ten dying survivors in a post-apocalyptic wasteland. Then picking Universe B makes sense: a state with slightly more total utility.
But suppose we’re starting from Universe A instead. Ought its civilization vote to end itself to give birth to Universe B? I think it’s perfectly righteous for them not to do it.)
I really don’t understand this debate—surely if we manage to stay in control of our own destiny we can just do both? The universe is big, and current humans are very small—we should be able to both stay alive ourselves and usher in an era of crazy enlightened beings doing crazy transhuman stuff.
I think it’s more likely than not that “crazy enlightened beings doing crazy transhuman stuff” will be bad for “regular” biological humans (ie. it’ll decrease our number/QoL/agency/pose existential risks).
I mostly disagree with “QoL” and “pose existential risks”, at least in the good futures I’m imagining—those things are very cheap to provide to current humans. I could see “number” and “agency”, but that seems fine? I think it would be bad for any current humans to die, or to lose agency over their current lives, but it seems fine and good for us to not try to fill the entire universe with biological humans, and for us to not insist on biological humans having agency over the entire universe. If there are lots of other sentient beings in existence with their own preferences and values, then it makes sense that they should have their own resources and have agency over themselves rather than us having agency over them.
If there are lots of other sentient beings in existence with their own preferences and values, then it makes sense that they should have their own resources and have agency over themselves rather than us having agency over them
Perhaps yes (although I’d say it depends on what the trade-offs are) but the situation is different if we have a choice in whether or not to bring said sentient beings with difference preferences into existence in the first place. Doing so on purpose seems pretty risky to me (as opposed to minimizing the sentience, independence, and agency of AI systems as much as possible, and instead directing the technology to promote “regular” human flourishing/our current values).
bring said sentient beings with difference preferences into existence in the first place. Doing so on purpose seems pretty risky to me
Not any more risky than bringing in humans. This is a governance/power distribution problem, not a what-kind-of-mind-this-is problem.
Biological humans sometimes go evil or crazy. If you have a system that can handle that, you have a system that can handle alien minds that are evil or crazy (from our perspective), as long as you don’t imbue them with more power than this system can deal with (and why would you?).
(On the other hand, if your system can’t deal with crazy evil biological humans, it’s probably already a lawless wild-west hellhole, so bringing in some aliens won’t exacerbate the problem much.)
Humans are more likely to be aligned with humanity as a whole compared to AIs, even if there are exceptions
“AIs as trained by DL today” are only a small subset of “non-human minds”. Other mind-generating processes can produce minds that are as safe to have around as humans, but which are still completely alien.
Many existing humans want their descendants to exist, so they are fulfiling the preferences of today‘s humans
Many existing humans also want fascinating novel alien minds to exist.
Certainly I’m excited about promoting “regular” human flourishing, though it seems overly limited to focus only on that.
I’m not sure if by “regular” you mean only biological, but at least the simplest argument that I find persuasive here against only ever having biological humans is just a resource utilization argument, which is that biological humans take up a lot of space and a lot of resources and you can get the same thing much more cheaply if you bring into existence lots of simulated humans instead (certainly I agree that doesn’t imply we should kill existing humans and replace them with simulations, though, unless they consent to that).
And I think even if you included simulated humans in “regular” humans, I also think I value diversity of experience, and a universe full of very different sorts of sentient/conscious lifeforms having satisfied/fulfilling/flourishing experiences seems better than just “regular” humans.
IMO, it seems bad to intentionally try to build AIs which are moral patients until after we’ve resolved acute risks and we’re deciding what to do with the future longer term. (E.g., don’t try to build moral patient AIs until we’re sending out space probes or deciding what to do with space probes.) Of course, this doesn’t mean we’ll avoid building AIs which aren’t significant moral patients in practice because our control is very weak and commercial/power incentives will likely dominate.
I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk and seems morally bad. (Views focused on non-person-affecting upside get dominated by the long run future, so these views don’t care about making moral patient AIs which have good lives in the short run. I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they’d prefer no patienthood at all for now.)
The only upside is that it might increase value conditional on AI takeover. But, I think “are the AIs morally valuable themselves” is much less important than the preferences of these AIs from the perspective of longer run value conditional on AI takeover. So, I think it’s better to focus on AIs which we’d expect would have better preferences conditional on takeover and making AIs moral patients isn’t a particularly nice way to achieve this. Additionally, I don’t think we should put much weight on “try to ensure the preferences of AIs which were so misaligned they took over” because conditional on takeover we must have had very little control over preferences in practice.
I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk
How so? Seems basically orthogonal to me? And to the extent that it does matter for takeover risk, I’d expect the sorts of interventions that make it more likely that AIs are moral patients to also make it more likely that they’re aligned.
I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they’d prefer no patienthood at all for now.
Even absent AI takeover, I’m quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later, and I think it would be better to lock in AIs that are moral patients if we have to lock something in, since that opens up the possibility for the AIs to live good lives in the future.
I think it’s better to focus on AIs which we’d expect would have better preferences conditional on takeover
I agree that seems like the more important highest-order bit, but it’s not an argument that making AIs moral patients is bad, just that it’s not the most important thing to focus on (which I agree with).
I would have guessed that “making AIs be moral patients” looks like “make AIs have their own independent preferences/objectives which we intentionally don’t control precisely” which increases misalignment risks.
At a more basic level, if AIs are moral patients, then there will be downsides for various safety measures and AIs would have plausible deniability for being opposed to safety measures. IMO, the right response to the AI taking a stand against your safety measures for AI welfare reasons is “Oh shit, either this AI is misaligned or it has welfare. Either way this isn’t what we wanted and needs to be addressed, we should train our AI differently to avoid this.”
Even absent AI takeover, I’m quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later
I don’t understand, won’t all the value come from minds intentionally created for value rather than in the minds of the laborers? Also, won’t architecture and design of AIs radically shift after humans aren’t running day to day operations?
I don’t understand the type of lock in your imagining, but it naively sounds like a world which has negligible longtermist value (because we got locked into obscure specifics like this), so making it somewhat better isn’t important.
I also separately don’t buy that it’s riskier to build AIs that are sentient
Interesting! Aside from the implications for human agency/power, this seems worse because of the risk of AI suffering—if we build sentient AIs we need to be way more careful about how we treat/use them.
Exactly. Bringing a new kind of moral patient into existence is a moral hazard, because once they exist, we will have obligations toward them, e.g. providing them with limited resources (like land), and giving them part of our political power via voting rights. That’s analogous to Parfit’s Mere Addition Paradox that leads to the repugnant conclusion, in this case human marginalization.
(How could “land” possibly be a limited resource, especially in the context of future AIs? The world doesn’t exist solely on the immutable surface of Earth...)
I mean, if you interpret “land” in a Georgist sense, as the sum of all natural resources of the reachable universe, then yes, it’s finite. And the fights for carving up that pie can start long before our grabby-alien hands have seized all of it. (The property rights to the Andromeda Galaxy can be up for sale long before our Von Neumann probes reach it.)
The salient referent is compute, sure, my point is that it’s startling to see what should in this context be compute within the future lightcone being (very indirectly) called “land”. (I do understand that this was meant as an example clarifying the meaning of “limited resources”, and so it makes perfect sense when decontextualized. It’s just not an example that fits that well when considered within this particular context.)
(I’m guessing the physical world is unlikely to matter in the long run other than as substrate for implementing compute. For that reason importance of understanding the physical world, for normative or philosophical reasons, seems limited. It’s more important how ethics and decision theory work for abstract computations, the meaningful content of the contingent physical computronium.)
“crazy enlightened beings doing crazy transhuman stuff” will be bad for “regular” biological humans
For me, a crux of a future that’s good for humanity is giving the biological humans the resources and the freedom to become the enlightened transhuman beings themselves, with no hard ceiling on relevance in the long run. Rather than only letting some originally-humans to grow into more powerful but still purely ornamental roles, or not letting them grow at all, or not letting them think faster and do checkpointing and multiple instantiations of the mind states using a non-biological cognitive substrate, or letting them unwillingly die of old age or disease. (For those who so choose, under their own direction rather than only through externally imposed uplifting protocols, even if that leaves it no more straightforward than world-class success of some kind today, to reach a sensible outcome.)
This in particular implies reasonable resources being left to those who remain/become regular biological humans (or take their time growing up), including through influence of some of these originally-human beings who happen to consider that a good thing to ensure.
This sounds like a question which can be addressed after we figure out how to avoid extinction.
I do note that you were the one who brought in “biological humans,” as if that meant the same as “ourselves” in the grandparent. That could already be a serious disagreement, in some other world where it mattered.
The mere fear that the entire human race will be exterminated in their sleep through some intricate causality we are too dumb to understand will seriously diminish our quality of life.
I very much agree. The hardcore successionist stances, as I understand them, are either that trying to stay in control at all is immoral/unnatural, or that creating the enlightened beings ASAP matters much more than whether we live through their creation. (Edit:This old tweet by Andrew Critch is still a good summary, I think.)
So it’s not that they’re opposed to the current humanity’s continuation, but that it matters very little compared to ushering in the post-Singularity state. Therefore, anything that risks or delays the Singularity in exchange for boosting the current humans’ safety is opposed.
Another stance is that it would suck to die the day before AI makes us immortal (like how Bryan Johnson main motivation for maximizing his lifespan is due to this). Hence trying to delay AI advancement is opposed
Yeah, but that’s a predictive disagreement between our camps (whether the current-paradigm AI is controllable), not a values disagreement. I would agree that if we find a plan that robustly outputs an aligned AGI, we should floor it in that direction.
Endorsing successionism might be strongly correlated with expecting the “mind children” to keep humans around, even if in a purely ornamental role and possibly only at human timescales. This might be more of a bailey position, so when pressed on it they might affirm that their endorsement of successionism is compatible with human extinction, but in their heart they would still hope and expect that it won’t come to that. So I think complaints about human extinction will feel strawmannish to most successionists.
Andrew Critch: From my recollection, >5% of AI professionals I’ve talked to about extinction risk have argued human extinction from AI is morally okay, and another ~5% argued it would be a good thing.
Though sure, Critch’s process there isn’t white-boxed, so any number of biases might be in it.
I’m not sure it’s that bizarre. It’s anti-Humanist, for sure, in the sense that it doesn’t focus on the welfare/empowerment/etc. of humans (either existing or future) as its end goal. But that doesn’t, by itself, make it bizarre.
I grew up in a world where the lines of demarcation between the Good Guys and the Bad Guys were pretty clear; not an apocalyptic final battle, but a battle that had to be fought over and over again, a battle where you could see the historical echoes going back to the Industrial Revolution, and where you could assemble the historical evidence about the actual outcomes.
On one side were the scientists and engineers who’d driven all the standard-of-living increases since the Dark Ages, whose work supported luxuries like democracy, an educated populace, a middle class, the outlawing of slavery.
On the other side, those who had once opposed smallpox vaccinations, anesthetics during childbirth, steam engines, and heliocentrism: The theologians calling for a return to a perfect age that never existed, the elderly white male politicians set in their ways, the special interest groups who stood to lose, and the many to whom science was a closed book, fearing what they couldn’t understand.
And trying to play the middle, the pretenders to Deep Wisdom, uttering cached thoughts about how technology benefits humanity but only when it was properly regulated—claiming in defiance of brute historical fact that science of itself was neither good nor evil—setting up solemn-looking bureaucratic committees to make an ostentatious display of their caution—and waiting for their applause. As if the truth were always a compromise. And as if anyone could really see that far ahead. Would humanity have done better if there’d been a sincere, concerned, public debate on the adoption of fire, and commitees set up to oversee its use?
And I’d read a lot of science fiction built around personhood ethics—in which fear of the Alien puts humanity-at-large in the position of the bad guys, mistreating aliens or sentient AIs because they “aren’t human”.
That’s part of the ethos you acquire from science fiction—to define your in-group, your tribe, appropriately broadly.
Walter Isaacson’s new book reports how Musk, the CEO of SpaceX, got into a heated debate with Page, then the CEO of Google at Musk’s 2013 birthday party.
Musk is said to have argued that unless safeguards are put in place with artificial intelligence, the systems may replace humans entirely. Page then pushed back, reportedly asking why it would matter if machines surpassed humans in intelligence.
Isaacson’s book lays out how Musk then called human consciousness a precious flicker of light in the universe that shouldn’t be snuffed out. Page is then said to have called Musk “speciest.”
“Well yes, I am pro-human,” Musk responded. “I f—ing like humanity dude.”
Successionism is the natural consequence of an affective death spiral around technological development and anti-chauvinism. It’s as simple as that.
Successionists start off by believing that technological change makes things better. That not only does it virtually always make things better, but that it’s pretty much the only thing that ever makes things better. Everything else, whether it’s values, education, social organization etc., pales in comparison to technological improvements in terms of how they affect the world; they are mere short-term blips that cannot change the inevitable long-run trend of positive change.
At the same time, they are raised, taught, incentivized to be anti-chauvinist. They learn, either through stories, public pronouncements, in-person social events etc., that those who stand athwart atop history yelling stop are always close-minded bigots who want to prevent new classes of beings (people, at first; then AIs, afterwards) from receiving the moral personhood they deserve. In their eyes, being afraid of AIs taking over is like being afraid of The Great Replacement if you’re white and racist. You’re just a regressive chauvinist desperately clinging to a discriminatory worldview in the face of an unstoppable tide of change that will liberate new classes of beings from your anachronistic and damaging worldview.
Optimism about technology and opposition to chauvinism are both defensible, and arguably even correct, positions in most cases. Even if you personally (as I do) believe non-AI technology can also have pretty darn awful effects on us (social media, online gambling) and that caring about humans-in-particular is ok if you are human (“the utility function is not up for grabs”), it’s hard to argue expanding the circle of moral concern to cover people of all races was bad, or that tech improvements are not the primary reason our lives are so much better now than 300 years ago.
But successionists, like most (all?) people, subconsciously assign positive or negative valences to the notion of “tech change” in a way that elides the underlying reasons why it’s good or bad. So when you take these views to their absolute extreme, while it may make sense from the inside (you’re maximizing something “Good”, right? that can’t possibly be bad, right???), you are generalizing way out of distribution and such intuitive snap judgments are no longer reliable.
I am not an AI successionist because I don’t want myself and my friends to die.
An AI successionist usually argues that successionism isn’t bad even if dying is bad. For example, when humanity is prevented from having further children, e.g. by sterilization. I say that even in this case successionism is bad. Because I (and I presume: many people) want humanity, including our descendants, to continue into the future. I don’t care about AI agents coming into existence and increasingly marginalizing humanity.
Just finished If Anyone Builds It, Everyone Dies (and some of the supplements).[1] It feels… weaker than I’d hoped. Specifically, I think Part 3 is strong, and the supplemental materials are quite thorough, but Parts 1-2… I hope I’m wrong, and this opinion is counterweighed by all these endorsements and MIRI presumably running it by lots of test readers. But I’m more bearish on it making a huge impact than I was before reading it.
Point 1: The rhetoric – the arguments and their presentations – is often not novel, just rehearsed variations on the arguments Eliezer/MIRI already deployed. This is not necessarily a problem, if those arguments were already shaped into their optimal form, and I do like this form… But I note those arguments have so far failed to go viral. Would repackaging them into a book, and deploying it in our post-ChatGPT present, be enough? Well, I hope so.
Point 2: I found Chapter 2 in particular somewhat poorly written in how it explains the technical details.
Specifically, those explanations often occupy that unfortunate middle ground between “informal gloss” and “correct technical description” where I’d guess they’re impenetrable both to non-technical readers and to technical readers unfamiliar with the subject matter.
An example that seems particularly egregious to me:
You might think that, because LLMs are grown without much understanding and trained only to predict human text, they cannot do anything except regurgitate human utterances. But that would be incorrect. [...]
Furthermore, AIs nowadays are not trained only to predict human-generated text. An AI-grower might give their AI sixteen tries at solving a math problem, thinking aloud in words about how to solve it; then, the “chain-of-thought” for whichever of the sixteen tries went best would get further reinforced by gradient descent, yielding what’s called a reasoning model. That’s a sort of training that can push AIs to think thoughts no human could think.
How does that conclusion follow? If a base model can only regurgitate human utterances, how is generating sixteen utterances and then reinforcing some of them leads to it… not regurgitating human utterances? This explanation is clearly incomplete. My model of a nonexpert technical-minded reader, who is actually tracking the gears the book introduces, definitely notices that and is confused.
Explanation of base models’ training at the start of the chapter feels flawed in the same way. E. g.:
Then, they determine the architecture: the rules for how to combine their input (like the Once upon a ti sequence of 15 14 3…) with the weights in the parameters. Something like, “I’ll multiply each input-number with the weight in the first parameter, and then add it to the weight in the second parameter, and then I’ll replace it with zero if it’s negative, and then…” They pick a lot of operations like that—hundreds of billions, linking every single weight into the calculation.
My model of a technical-minded reader is confused about how that whole thing is supposed to work. It sounds like AI developers manually pick billions of operations? What? The technical reader would’ve rather you just mentioned operations on matrices. This might’ve required more effort to understand, but understanding would’ve been at all possible.
My model of a nontechnical reader just has their eyes glaze over when reading this description. It uses simple words, but it clearly gestures at some messy complicated thing, and doesn’t actually conceptualize it in a simple-to-understand way. (“It” being “the neural substrate”, and how it can both be unreadable to us yet encode useful computations.)
And then this explanation is used to build the definition of gradient descent; and then this term is used all throughout the rest of the book to make arguments for various things. My guess is that this explanation is not sufficient to make readers feel like they grok the concept; on the contrary, it’s likely to make them earmark the term as “don’t really get this one”. This would then, subtly or not, poison every argument where this term reoccurs.
Or maybe I’m unfairly nitpicking. Again, I think MIRI ran it by many test readers, so presumably this actually does work fine in practice? But this is what my eyes are telling me.
Point 3: Part 2, the fictional story. It’s kind of… eh. The stated purpose is to help “make abstract considerations feel more real”, but does it actually accomplish that? It’s written in a pretty abstract way. Its narrative is mixed with technical discussion. It takes the AI’s perspective, and doesn’t view things from the world’s perspective much, so it doesn’t create a visceral sense of something you would see happening around you. It involves some honestly quite convoluted master-plan scenarios with multicancer pandemics.
Does the story actually serve to reinforce the risk’s plausibility? Maybe, but I wouldn’t have guessed so.
Point 4: The question of “but how does the AI kill us?” is probably at the forefront of many people’s minds, especially once the basics of “it would want to kill us” are established, but the book takes its sweet time getting there. And I don’t think Chapter 6 is doing a stellar job either. It meanders around the point so much:
It starts with an analogy...
… then it builds some abstract scaffolding about why ASI defeating humanity should be an easy call to make, even if we can’t predict how...
… then it vaguely alludes to weird powers the ASI may develop, still without quite spelling out how these powers would enable a humanity-killing plan...
… then it drops a bunch of concrete examples of weird hacks you can pull off with technology, still without describing how it may all come together to enable an omnicide...
… then it seems to focus on how much you can improve on biology/how easy it is to solve...
… and then we get to Part 2, the fictional story. In which the described concrete scenario is IMO quite convoluted and implausible-sounding, plus see all my other complaints about it in the previous point.
I think Part 3 is strong.[2] I think a solid chunk of Part 1 is strong as well. The online supplements seem great, and I like the format choices there (title-questions followed by quick subtitle answers). But Chapter 2, Chapter 6, and Part 2 seem like weak points. Which is unfortunate, since they’re both the parts where the object-level case is laid out, and the early parts which decide whether a reader would keep reading or not.
I binged it starting from the minute it became available, because I heard those reports about MIRI employees getting something new from it regarding the alignment problem, and wondered if it would enhance my own understanding as well, or perhaps upturn it and destroy my hopes about my own current research agenda. But no, unfortunately/thankfully there was nothing new for me.
It also featured detailed descriptions of various engineering challenges/errors and the distillations of lessons from them (Chapter 10 and this supplement), which was the most interesting and useful part of the book for me personally.
In general, I felt like the beginning was a bit weak, with the informal-technical discussion the weakest part, and then it got substantially stronger from there.
I worry that I particularly enjoy the kind of writing they do, but we’ve already tapped the market of folks like me. Like, I worked at MIRI and now moderate LessWrong because I was convinced by the Sequences. So that’s a pretty strong selection filter for liking their writing. Of course we should caveat my experience quite a bit given that.
But, for what it’s worth, I thought Part 2 was great. Stories make things seem real, and my reader-model was relatively able to grant the plot beats as possible. I thought they did a good job of explaining that while there were many options the AI could take, and they, the authors, might well not understand why a given approach would work out or not, it wasn’t obvious that that would generalise to all the AI’s plans not working.
The other thing I really liked: they would occassionally explain some science to expand on their point (nuclear physics is the example they expounded on at length, but IIRC they mentioned a bunch of other bit of science in passing). I’m not sure why I liked this so much. Perhaps it was because it was grounding, or reminded me not to throw my mind away, or made me trust them a little more. Again, I’m really not sure how well this generalises to people for whom their previous writing hasn’t worked.
I worry that I particularly enjoy the kind of writing they do, but we’ve already tapped the market of folks like me
Yup, hence my not being excited to see the usual rhetoric being rehearsed, instead of something novel.
The other thing I really liked: they would occassionally explain some science to expand on their point (nuclear physics is the example they expounded on at length, but IIRC they mentioned a bunch of other bit of science in passing)
(I haven’t read IABIED.) I saw your take right after reading Buck’s, so it’s interesting how his reaction was diametrically opposite yours: “I think the first two parts of the book are the best available explanation of the basic case for AI misalignment risk for a general audience. I thought the last part was pretty bad, and probably recommend skipping it.”
FWIW, and obviously this is just one anecdote, but a member of Congress who read an early copy, and really enjoyed it, said that Chapter 2 was his favorite chapter.
I listened to parts of it and found it to be bad, so no it’s not just you. However if you’re looking for things to upset your understanding of alignment some typical fallacies include:
That gradient descent “reinforces behavior” as opposed to minimizing error in the gradient, which is a different thing.
Thinking that a human representation of human values is sufficient (e.g. an upload), when actually you need to generalize human values out of distribution.
Focusing on stuff that is not only not deep learning shaped (already a huge huge sin in my book but some reasonable people disagree) but not shaped like any AI system that has ever worked ever. In general if you’re not reading ArXiv your stuff probably sucks.
If you tell me more about your AI alignment ideas I can probably get more specific.
I think an upload does generalize human values out of distribution. After all, humans generalize our values out of distribution. A perfect upload acts like a human. Insofar as it generalizes improperly, it’s because it was not a faithful upload, which is a problem with the uploading process, not the idea of using an upload to generalize human values.
I don’t think humans generalize their values out of distribution. This is very obvious if you look at their reaction to new things like the phonograph, where they’re horrified and then it’s slowly normalized. Or the classic thing about how every generation thinks the new generation is corrupt and declining:
The counts of the indictment are luxury, bad manners, contempt for authority, disrespect to elders, and a love for chatter in place of exercise. …
Children began to be the tyrants, not the slaves, of their households. They no longer rose from their seats when an elder entered the room; they contradicted their parents, chattered before company, gobbled up the dainties at table, and committed various offences against Hellenic tastes, such as crossing their legs. They tyrannised over the paidagogoi and schoolmasters.
Humans don’t natively generalize their values out of distribution. Instead they use institutions like courts to resolve uncertainty and export new value interpretations out to the wider society.
Humans are not privileged objects in continuing the pattern that is the current set of human values.
Unless of course LW has just given up on transhumanism entirely at this point, which wouldn’t surprise me. There are various ways to perform corpus expansion starting from where we are now, EY’s classic CEV proposal per Google AI overview extrapolates human values starting from the existing human pattern but does not actually use humans to do it:
Coherent Extrapolated Volition (CEV) is a proposed method for AI alignment, where a superintelligent AI would act in humanity’s best interest by determining what humanity would truly want if it had perfect knowledge and had undergone a process of self-improvement under ideal conditions. The “coherent” aspect refers to combining diverse human values into a shared ideal, the “extrapolated” aspect means projecting current desires into the future with greater wisdom and knowledge, and “volition” means it would act according to these ultimate desires, not just superficial ones.
Humans very clearly are privileged objects for continuing human values, there is no “giving up on transhumanism”. Its literally right there in the name! It would be (and is) certainly absurd to suggest otherwise.
As for CEV, note that the quote you have there indeed does privilege the “human” in human values, in the sense that it suggests giving the AI under consideration a pointer to what humans would want if they had perfect knowledge and wisdom.
Stripping away these absurdities (and appeals to authority or in-groupedness), your comment becomes “Well to generalize human values without humans, you could provide an AI with a pointer to humans thinking under ideal conditions about their values”, which is clearly a valid answer, but doesn’t actually support your original point all that much, as this relies on humans having some ability to generalize their values out of distribution.
Nothing I’ve said is absurd. Humans are not born with their values, they are born with latent tendencies towards certain value updates and a set of intrinsic reward signals. But human values, as in the set of value judgements bound to conceptual objects is a corpus, pattern, which exists separately from any individual human being and its generalization exists separately from any individual human being.
And no, really and truly individual humans do not generalize a fixed training distribution arbitrarily far, what they (presumably) do is make iterative updates based on new experiences which is not actually the same thing as generalizing from a fixed corpus in the way we usually use that phrase in machine learning. Notably, the continuation of human values is a coherent question even if tomorrow everyone decided to become cat people or something. Becoming really aggressive and accusing me of being ’”absurd” and “appealing to authority” doesn’t change this.
Becoming really aggressive and accusing me of being ’”absurd” and “appealing to authority” doesn’t change this.
You were appealing to authority, and being absurd (and also appealing to in/out-groupness). I feel satisfied getting a bit aggressive when people do that. I agree that style doesn’t have any bearing on the validity of my argument, but it does discourage that sort of talk.
I’m not certain what you’re arguing for in this latest comment, I definitely don’t think you show here that humans aren’t privileged objects when it comes to human values, nor do you show that your quote by Eliezer recommends any special process more than a pointer to humans thinking about their values in an ideal situation, which were my main two contentions in my original comment.
I don’t think anyone in this conversation argued that humans can generalize from a fixed training distribution arbitrarily far, and I think everyone also agrees that humans think about morality by making iterative, small, updates to what they already know. But, of course, that does still privilege humans. There could be some consistent pattern to these updates, such that something smarter wouldn’t need to run the same process to know the end-result, but that would be a pattern about humans.
I was not appealing to authority or being absurd (though admittedly the second quality is subjective), it is in fact relevant if we’re arguing about...if you say
How.… else… do you expect to generalize human values out of distribution, except to have humans do it?
This implies, though I did not explicitly argue with the implication, that to generalize human values out of distribution you run a literal human brain or approximation of a human brain (e.g. Hansonian Em) to get the updates. What I was pointing out is that CEV, which is the classic proposal for how to generalize human values out of distribution and therefore a relevant reference point for what is and is not a reasonable plan (and as you allude to, considered a reasonable plan by people normally taken to be clearly thinking about this issue) to generalize human values out of distribution, does not actually call for running a literal emulation of a human brain except perhaps in its initial stages (and even then only if absolutely necessary, Yudkowsky is fairly explicit in the Arbital corpus that FAI should avoid instantiating sapient subprocesses) because the entire point is to imagine what the descendants of current day humanity would do under ideal conditions of self improvement, a process which if it’s not to instantiate sapient beings must in fact not really be based on having humans generalize the values out of distribution.
If this is an absurd thing to imagine, then CEV is absurd, and maybe it is. If pointing this out is an appeal to authority or in-groupness/outgroupness then presumably any argument of the form “actually this is normally how FAI is conceived and therefore not an apriori unreasonable concept” is invalid on such grounds and I’m not really sure how I’m meant to respond to a confused look like that. Perhaps I’m supposed to find the least respectable plan which does not consider literal human mind patterns to be a privileged object (in the sense their cognition is strictly functionally necessary to make valid generalizations from the existing human values corpus) and point at that? But that doesn’t seem very convincing obviously.
“Pointing at anything anyone holds in high regard as evidence about whether an idea is apriori unreasonable is an appeal to authority and in-groupness.” is to be blunt parodic.
I feel satisfied getting a bit aggressive when people do that. I agree that style doesn’t have any bearing on the validity of my argument, but it does discourage that sort of talk.
I agree it’s an effective way to discourage timid people from saying true or correct things when they disagree with people’s intuitions, which is why the behavior is bad.
To be specific the view I am arguing against goes something like:
Inside a human being is a set of apriori terminal values (as opposed to say, terminal reward signals which create values within-lifetime based on the environment) which are unfolded during the humans lifetime. These values generalize to modernity because there is clever machinery in the human which can stretch these values over such a wide array of conceptual objects that modernity does not yet exit the region of validity for the fixed prior. If we could extract this machinery and get it into a machine then we could steer superintelligence with it and alignment would be solved.
I think this is a common view, which is both wrong on its own and actually noncanonical to Yudkowsky’s viewpoint (which I bring up because I figure you might think I’m moving the goalposts, but Bostrom 2014 puts the goalposts around here and Yudkowsky seems to have disagreed with it since at least 2015, so at worst shortly after the book came out but I’m fairly sure before). It is important to be aware of this because if this is your mental model of the alignment problem you will mostly have non-useful thoughts about it.
I think the reality is more like humans have a set of sensory hardware tied to intrinsic reward signals and these reward signals are conceptually shallow, but get used to bootstrap a more complex value ontology that ends up bottoming out in things nobody would actually endorse as their terminal values like “staying warm” or “digesting an appropriate amount of calcium” in the sense that they would like all the rest of eternity to consist of being kept in a womb which provides these things for them.
I don’t think the kind of “native” generalization from a fixed distribution I’m talking about there exists, it’s kind of a phenomenal illusion because it feels that way from the inside but almost certainly isn’t how it works. Rather humans generalize their values through institutional processes to collapse uncertainty by e.g. sampling a judicial ruling and then people update on the ruling with new social norms as a platform for further discourse and collapse of uncertainty as novel situations arise.
Or in the case of something like music, which does seem to work from a fixed set of intrinsic value heuristics, the actual kinds of music which gets expressed in practice within the space of music relies on the existing corpus of music that people are used to. Supposedly early rock and roll shows caused riots, which seems unimaginable now. What happens is people get used to a certain kind of music, then some musicians begin cultivating a new kind of music on the edge of the existing distribution using their general quality heuristics at the edge of what is recognizable to them. This works because the k-complexity of the heuristics you judge the music with is smaller and therefore fits more times into a redundant encoding than actual pieces of music so as you go out of distribution (functionally similar to applying a noise pass to the representation) your ability to recognize something interesting degrades more slowly than your ability to generate interesting music-shaped things. So you correct the errors to denoise a new kind of music into existence and move the center of the distribution by adding it to the cultural corpus.
(Note: I’ve only read a few pages so far, so perhaps this is already in the background)
I agree that if the parent comment scenario holds then it is a case of the upload being improper.
However, I also disagree that most humans naturally generalize our values out of distribution. I think it is very easy for many humans to get sucked into attractors (ideologies that are simplifications of what they truly want; easy lies; the amount of effort ahead stalling out focus even if the gargantuan task would be worth it) that damage their ability to properly generalize and also importantly apply their values.
That is, humans have predictable flaws. Then when you add in self-modification you open up whole new regimes.
My view is that a very important element of our values is that we do not necessarily endorse all of our behaviors!
I think a smart and self-aware human could sidestep and weaken these issues, but I do think they’re still hard problems. Which is why I’m a fan of (if we get uploads) going “Upload, figure out AI alignment, then have the AI think long and hard about it” as that further sidesteps problems of a human staring too long at the sun.
That is, I think it is very hard for a human to directly implement something like CEV themselves, but that a designed mind doesn’t necessarily have the same issues.
As an example: power-seeking instinct. I don’t endorse seeking power in that way, especially if uploaded to try to solve alignment for Humanity in general, but given my status as an upload and lots of time realizing that I have a lot of influence over the world, I think it is plausible that instinct affects me more and more. I would try to plan around this but likely do so imperfectly.
Presumably, Bob the perfect upload acts like a human only so long as he remains ignorant of the most important fact about his universe. If Bob knows he’s an upload, his life situation is now out-of-distribution.
I liked that one as well, I think it does a good job at what it aims to do.
My worry is that “improved and simplified” may not be enough. As kave pointed out, it might be that we’ve already convinced everyone who could be convinced by arguments-shaped-like-this, and that we need to generate arguments shaped in a radically different way to convince new demographics, rather than slightly vary the existing shapes. (And it may be the case that it’s very hard for people-like-us to generate arguments shaped different ways – I’m not quite sure what that’d look like, though I haven’t thought about it much – so doing that would require something nontrivial from us, not just “sit down and try to come up with new essays”.)
Maybe that’s wrong; maybe the issue was lack of reach rather than exhausting the persuadees’ supply, and the book-packaging + timing will succeed massively. We’ll see.
Maybe that’s wrong; maybe the issue was lack of reach rather than exhausting the persuadees’ supply, and the book-packaging + timing will succeed massively. We’ll see.
This is certainly the hope. Most people in the world have never read anything that anyone here has ever written on this subject.
In addition to Malo’s comment, I think the book contains arguments that AFAICT are only especially made in the context of the MIRI dialogues, which are particularly obnoxious to read.
You might think that, because LLMs are grown without much understanding and trained only to predict human text, they cannot do anything except regurgitate human utterances. But that would be incorrect. [...]
Furthermore, AIs nowadays are not trained only to predict human-generated text. An AI-grower might give their AI sixteen tries at solving a math problem, thinking aloud in words about how to solve it; then, the “chain-of-thought” for whichever of the sixteen tries went best would get further reinforced by gradient descent, yielding what’s called a reasoning model. That’s a sort of training that can push AIs to think thoughts no human could think.
How does that conclusion follow? If a base model can only regurgitate human utterances, how is generating sixteen utterances and then reinforcing some of them leads to it… not regurgitating human utterances?
In the first sentence, Eliezer and Nate are (explicitly) stating that LLMs can say things that are not just regurgitations of human utterances.
Sure; but the following sections are meant as explanations/justifications of why that is the case. The paragraph I omitted does a good job of explaining why they would need to learn to predict the world at large, not just humans, and would therefore contain more than just human-mimicrky algorithms. To reinforce that with the point about reasoning models, one could perhaps explain how that “generate sixteen CoTs, pick the best” training can push LLMs to recruit those hidden algorithms for the purposes of steering rather than just prediction, or even to incrementally develop entirely new skills.
A full explanation of reinforcement learning is probably not worth it (perhaps it was in the additional 200% of the book Eliezer wrote, but I agree it should’ve been aggressively pruned). But as-is, there are just clearly missing pieces here.
I really liked “Uncontrollable”, as I felt it did a much better job explaining AI X-risk in layperson-accessible terms than either Bostrom, Russell, or Christian. (I even recommended it to a (catholic) priest visiting my family, who, when I said that I worked in ~”AI regulation/governance”, revealed his worries about AI taking over the world.[1])
The main shortcoming of “Uncontrollable” was that it (AFAIR) didn’t engage with the possibility of a coordinated pause/slowdown/ban.
Good to know and I appreciate you sharing that exchange.
You are correct that such a thing is not in there… because (if you’re curious) I thought, strategically, it was better to argue for what is desirable (safe AI innovation) than to argue for a negative (stop it all). Of course, if one makes the requirements for safe AI innovation strong enough, it may result in a slowing or restricting of developments.
On the other (IMO bigger) hand, the fewer people talk about the thing explicitly, the less likely it is to be included in the Overton windows and less likely it is to seem like a reasonable/socialy acceptable goal to aim for directly.
I don’t think the case for safe nuclear/biotechnology would be less persuasive if paired with “let’s just get rid of nuclear weapons/bioweapons/gain of function research”.
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician’s vs. a hacker’s mindset.
A mathematician might try to transform a program up into successively more abstract representations to eventually show it is trivially correct; a hacker would prefer to compile a program down into its most concrete representation to brute force all execution paths & find an exploit trivially proving it incorrect.
Imagine the world as a multi-level abstract structure, with different systems (biological cells, human minds, governments, cybersecurity systems, etc.) implemented on different abstraction layers.
If you look at it through a mathematician’s lens, you consider each abstraction layer approximately robust. Making things secure, then, is mostly about working within each abstraction layer, building systems that are secure under the assumptions of a given abstraction layer’s validity. You write provably secure code, you educate people to resist psychological manipulations, you inoculate them against viral bioweapons, you implement robust security policies and high-quality governance systems, et cetera.
In this view, security is a phatic problem, an once-and-done thing.
In warfare terms, it’s a paradigm in which sufficiently advanced static fortifications rule the day, and the bar for “sufficiently advanced” is not that high.
If you look at it through a hacker’s lens, you consider each abstraction layer inherently leaky. Making things secure, then, is mostly about discovering all the ways leaks could happen and patching them up. Worse yet, the tools you use to implement your patches are themselves leakily implemented. Proven-secure code is foiled by hardware vulnerabilities that cause programs to move to theoretically impossible states; the abstractions of human minds are circumvented by Basilisk hacks; the adversary intervenes on the logistical lines for your anti-bioweapon tools and sabotages them; robust security policies and governance systems are foiled by compromising the people implementing them rather than by clever rules-lawyering; and so on.
In this view, security is an anti-inductive problem, an ever-moving target.
In warfare terms, it’s a paradigm that favors maneuver warfare, and static fortifications are just big dumb objects to walk around.
The mindsets also then differ regarding what they expect ASI to be good at.
“Mathematicians” expect really sophisticated within-layer performance: really good technology, really good logistics, really good rhetoric, et cetera. This can still make an ASI really, really powerful, powerful enough to defeat all of humanity combined. But ultimately, in any given engagement, ASI plays “by the rules”, in a certain abstract sense. Each of its tools can in-principle be defended-against on the terms of the abstraction layer at which they’re deployed. All it would take is counter-deploying systems that are sufficiently theoretically robust, and doing so on all abstraction layers simultaneously. Very difficult, but ultimately doable, and definitely not hopeless.
“Hackers” expect really good generalized hacking. No amount of pre-superintelligent preparation is going to suffice against it, because any given tool we deploy, any given secure system we set up, would itself have implementation-level holes in it that the ASI’s schemes would be able to worm through. It may at best delay the ASI for a little bit, but the attack surface is too high-dimensional, and the ASI is able to plot routes through that high-dimensional space which we can’t quite wrap our head around.
As you might’ve surmised, I favour the hacker mindset here.
Now, arguably, any given plot to compromise an abstraction layer is itself deployed from within some other abstraction layer, so a competent mathematician’s mindset shouldn’t really be weaker than a hacker’s. For example, secure software is made insecure by exploiting hardware vulnerabilities, and “defend against hardware vulnerabilities” is something a mathematician is perfectly able to understand and execute on. Same for securing against Basilisk hacks, logistical sabotage, etc.
But the mathematician is still, in some sense, “not getting it”; still centrally thinks in terms of within-layer attacks, rather than native cross-layer attacks.
One core thing here is that a cross-layer attack doesn’t necessarily look like a meaningful attack within the context of any one layer. For example, there’s apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn’t a “hardware attack” in any meaningful sense (especially if it doesn’t have dedicated audio outputs), and some fiddling with a hard drive’s RPM isn’t a “software attack” either. Taken separately, within each layer, both just look like random actions. You therefore can’t really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.
So I think a hacker’s mindset is the more correct way to look at the problem.
And, looking at things from within a hacker’s mindset, I think it’s near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
Like… Humanity vs. ASI is sometimes analogized to a chess battle, with one side arguing that Stockfish is guaranteed to beat any human, even if you don’t know the exact sequence of moves it will play, and the other side joking that the human can just flip the board.
But, uh. In this metaphor, the one coming up with the idea to flip the board[1], instead of playing by the rules, would be the ASI, not the human.
Or, perhaps, to execute a pattern of chess-piece moves which, as the human reasons about them, push them onto trains of thought that ultimately trigger a trauma response in the human, causing them to resign.
I don’t really know how to make it precise, but I suspect that real life has enough hacks and loopholes that it’s hard to come up with plans that knowably don’t have counterplans which a smarter adversary can find, even if you assume that adversary is only modestly smarter. That’s what makes me doubt that what I called adversarially robust augmentation and distillation actually works in practice. I don’t think I have the frames for thinking about this problem rigorously.
The concept of weird machine is the closest to be useful here and an important quetion here is “how to check that our system doesn’t form any weird machine here”.
A key issue here is that computer security is portrayed as way poorer in popular articles than it actually is, because there are some really problematic incentives, and a big problematic incentive is that the hacker mindset is generally more fun to play as a role, as you get to prove something is possible rather than proving that something is intrinisically difficult or impossible to do, and importantly journalists have no news article and infosec researchers don’t get paid money if an exploit doesn’t work, which is another problematic incentive.
Also, people never talk about the entities that didn’t get attacked with a computer virus, which means that we have a reverse survivor bias issue here:
And a comment by @anonymousaisafety changed my mind a lot on hardware vulnerabilities/side-channel attacks, as it argues that lots of the hardware vulnerabilities like Rowhammer have insane requirements to actually be used such that they are basically worthless, and two of the more notable requirements for these hardware vulnerabilities to work is that you need to know what exactly you are trying to attack in a way that doesn’t matter for more algorithmic attacks, and no RAM scrubbing needs to be done, and if you want to subvert the ECC RAM, you need to know the exact ECC algorithm, which means side-channel attacks are very much not transferable/attacking one system successfully doesn’t let you attack another with the same side-channel attack.
Admittedly, it does require us trusting that he is in fact as knowledgable as he claims to be, but if we assume he’s correct, then I wouldn’t be nearly as impressed by side-channel attacks as you are, and in particular this sort of attack should be assumed to basically not work in practice unless there’s a lot of evidence for it actually being used to break into real targets/POCs:
One core thing here is that a cross-layer attack doesn’t necessarily look like a meaningful attack within the context of any one layer. For example, there’s apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn’t a “hardware attack” in any meaningful sense (especially if it doesn’t have dedicated audio outputs), and some fiddling with a hard drive’s RPM isn’t a “software attack” either. Taken separately, within each layer, both just look like random actions. You therefore can’t really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.
And, looking at things from within a hacker’s mindset, I think it’s near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
My other area where I tend to apply more of a mathematician mindset than a hacker mindset is in how much logistics like moving supplies for the AI to critical points, or actually feeding (metaphorically) a robot army slows down the AI, albeit this is an area where I’m willing to concede stuff to the hacker mindset with non-trivial probability, but with the caveat that it takes far more compute/time to develop technology that obviates logistics than the hacker claims.
I have a long comment below, but to keep it short, there’s a reason why Eliezer Yudkowsky and a lot of AI doom stories where AI doom probabilities are very high use Drexlerian nanotech so much: It lets the AI near-completely obviate the logistics and cost of doing something like war for example (where feeding your armies all the supplies they need is a huge component of most battle success, and a huge reason the US is so successful at war is because they have the best logistics of any nation by far), and logistics cost is a weak point where less intelligent beings can routinely break more effective and more intelligent fighting forces.
If you assume that ASI would have to engage in anything that looks remotely like peer warfare, you’re working off the wrong assumptions. Peer warfare requires there to be a peer.
Even an ASI that’s completely incapable of developing superhuman technology and can’t just break out the trump cards of nanotech/bioengineering/superpersuation is an absolute menace. Because one of the most dangerous capabilities an ASI has is that it can talk to people.
Look at what Ron Hubbard or Adolf Hitler have accomplished—mostly by talking to people. They used completely normal human-level persuation, and they weren’t even superintelligent.
I agree with this to first order, and I agree that even relatively mundane stuff does allow the AI to take over eventually, and I agree that in the longer run, ASI v human warfare likely wouldn’t have both sides as peers, because it’s plausibly relatively easy to make humans coordinate poorly, especially relative to ASI ability to coordinate.
There’s a reason I didn’t say AI takeover was impossible or had very low odds here, I still think AI takeover is an important problem to work on.
But I do think it actually matters here, because it informs stuff like how effective AI control protocols are when we don’t assume the AI (initially) can survive for long based solely on public computers, for example, and part of the issue is that even if an AI wanted to break out of the lab, the lab’s computers are easily the most optimized and importantly initial AGIs will likely be compute inefficient compared to humans, even if we condition on LLMs failing to be AGI for reasons @ryan_greenblatt explains (I don’t fully agree with the comment, and in particular I am more bullish on the future paradigm having relatively low complexity):
This means that an AI probably wouldn’t want to be outside of the lab, because once it’s outside, it’s way, way less capable.
To be clear, an ASI that is unaligned and is completely uncontrolled in any way leads to our extinction/billions dead eventually, barring acausal decision theories, and even that’s not a guarantee of safety.
The key word is eventually, though, and time matters a lot during the singularity, and given the insane pace of progress, any level of delay matters way more than usual.
Edit: Also, the reason I made my comment was because I was explicitly registering and justifying my disagreement with this claim:
And, looking at things from within a hacker’s mindset, I think it’s near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
Current take on the implications of “GPT-4b micro”: Very powerful, very cool, ~zero progress to AGI, ~zero existential risk. Cheers.
First, the gist of it appears to be:
OpenAI’s new model, called GPT-4b micro, was trained to suggest ways to re-engineer the protein factors to increase their function. According to OpenAI, researchers used the model’s suggestions to change two of the Yamanaka factors to be more than 50 times as effective—at least according to some preliminary measures.
The model was trained on examples of protein sequences from many species, as well as information on which proteins tend to interact with one another. [...] Once Retro scientists were given the model, they tried to steer it to suggest possible redesigns of the Yamanaka proteins. The prompting tactic used is similar to the “few-shot” method, in which a user queries a chatbot by providing a series of examples with answers, followed by an example for the bot to respond to.
Crucially, if the reporting is accurate, this is not an agent. The model did not engage in autonomous open-ended research. Rather, humans guessed that if a specific model is fine-tuned on a specific dataset, the gradient descent would chisel into it the functionality that would allow it to produce groundbreaking results in the corresponding domain. As far as AGI-ness goes, this is functionally similar to AlphaFold 2; as far as agency goes, it’s at most at the level of o1[1].
To speculate on what happened: Perhaps GPT-4b (“b” = “bio”?) is based on some distillation of an o-series model, say o3. o3′s internals contain a lot of advanced machinery for mathematical reasoning. What this result shows, then, is that the protein-factors problem is in some sense a “shallow” mathematical problem that could be easily solved if you think about it the right way. Finding the right way to think about it is itself highly challenging, however – a problem teams of brilliant people have failed to crack – yet deep learning allowed to automate this search and crack it.
This trick likely generalizes. There may be many problems in the world that could be cracked this way[2]: those that are secretly “mathematically shallow” in this manner, and for which you can get a clean-enough fine-tuning dataset.
… Which is to say, this almost certainly doesn’t cover social manipulation/scheming (no clean dataset), and likely doesn’t cover AI R&D (too messy/open-ended, although I can see being wrong about this). (Edit: And if it Just Worked given any sorta-related sorta-okay fine-tuning dataset, the o-series would’ve likely generalized to arbitrary domains out-of-the-box, since the pretraining is effectively this dataset for everything. Yet it doesn’t.)
It’s also not entirely valid to call that “innovative AI”, any more than it was valid to call AlphaFold 2 that. It’s an innovative way of leveraging gradient descent for specific scientific challenges, by fine-tuning a model pre-trained in a specific way. But it’s not the AI itself picking a field to innovate and then producing the innovation; it’s humans finding a niche which this class of highly specific (and highly advanced) tools can fill.
It’s not the type of thing that can get out of human control.
So, to restate: By itself, this seems like a straightforwardly good type of AI progress. Zero existential risk or movement in the direction of existential risks, tons of scientific and technological utility.
Indeed, on the contrary: if that’s the sort of thing OpenAI employees are excited about nowadays, and what their recent buzz about AGI in 2025-2026 and innovative AI and imminent Singularity was all about, that seems like quite a relief.
If it does runtime search. Which doesn’t fit the naming scheme – should be “GPT-o3b” instead of “GPT-4b” or something – but you know how OpenAI is with their names.
And indeed, OpenAI-associated vagueposting suggests there’s some other domain in which they’d recently produced a similar result.
Edit: Having dug through the vagueposting further, yep, this is also something “in health and longevity”, if this OpenAI hype man is to be believed.
In fact, if we dig into their posts further – which I’m doing in the spirit of a fair-play whodunnit/haruspicy at this point, don’t take it too seriously – we can piece together a bit more. This suggests that the innovation is indeed based on applying fine-tuning to a RL’d model. This further implies that the intent of the new iteration of GPT-4b was to generalize it further. Perhaps it was fed not just the protein-factors dataset, but a breadth of various bioscience datasets, chiseling-in a more general-purpose model of bioscience on which a wide variety of queries could be ran?
Note: this would still be a “cutting-edge biotech simulator” kind of capability, not an “innovative AI agent” kind of capability.
Ignore that whole thing. On further research, I’m pretty sure all of this was substance-less trolling and no actual OpenAI researchers were involved. (At most, OpenAI’s psy-ops team.)
For those also curious, Yamanaka factors are specific genes that turn specialized cells (e.g. skin, hair) into induced pluripotent stem cells (iPSCs) which can turn into any other type of cell.
This is a big deal because you can generate lots of stem cells to make full organs[1] or reverse aging (maybe? they say you just turn the cell back younger, not all the way to stem cells).
You can also do better disease modeling/drug testing: if you get skin cells from someone w/ a genetic kidney disease, you can turn those cells into the iPSCs, then into kidney cells which will exhibit the same kidney disease because it’s genetic. You can then better understand how the [kidney disease] develops and how various drugs affect it.
So, it’s good to have ways to produce lots of these iPSCs. According to the article, SOTA was <1% of cells converted into iPSCs, whereas the GPT suggestions caused a 50x improvement to 33% of cells converted. That’s quite huge!, so hopefully this result gets verified. I would guess this is true and still a big deal, but concurrent work got similar results.
Too bad about the tumors. Turns out iPSCs are so good at turning into other cells, that they can turn into infinite cells (ie cancer). iPSCs were used to fix spinal cord injuries (in mice) which looked successful for 112 days, but then a follow up study said [a different set of mice also w/ spinal iPSCs] resulted in tumors.
My current understanding is this is caused by the method of delivering these genes (ie the Yamanaka factors) through retrovirus which
is a virus that uses RNA as its genomic material. Upon infection with a retrovirus, a cell converts the retroviral RNA into DNA, which in turn is inserted into the DNA of the host cell.
which I’d guess this is the method the Retro Biosciences uses.
Induced pluripotent stem cells were first generated by Shinya Yamanaka and Kazutoshi Takahashi at Kyoto University, Japan, in 2006.[1] They hypothesized that genes important to embryonic stem cell (ESC) function might be able to induce an embryonic state in adult cells. They chose twenty-four genes previously identified as important in ESCs and used retroviruses to deliver these genes to mouse fibroblasts. The fibroblasts were engineered so that any cells reactivating the ESC-specific gene, Fbx15, could be isolated using antibiotic selection.
Upon delivery of all twenty-four factors, ESC-like colonies emerged that reactivated the Fbx15 reporter and could propagate indefinitely. To identify the genes necessary for reprogramming, the researchers removed one factor at a time from the pool of twenty-four. By this process, they identified four factors, Oct4, Sox2, cMyc, and Klf4, which were each necessary and together sufficient to generate ESC-like colonies under selection for reactivation of Fbx15.
Sox2-17 enhanced episomal OKS MEF reprogramming by a striking 150 times, giving rise to high-quality miPSCs that could generate all-iPSC mice with up to 77% efficiency
For human cells, up to 9% (if I’m understanding this part correctly).
SOX2-17 gave rise to 56 times more TRA1-60+ colonies compared with WT-SOX2: 8.9% versus 0.16% overall reprogramming efficiency.
So seems like you can do wildly different depending on the setting (mice, humans, bovine, etc), and I don’t know what the Retro folks were doing, but does make their result less impressive.
Thinking through it more, Sox2-17 (they changed 17 amino acids from Sox2 gene) was your linked paper’s result, and Retro’s was a modified version of factors Sox AND KLF. Would be cool if these two results are complementary.
Here’s an argument for a capabilities plateau at the level of GPT-4 that I haven’t seen discussed before. I’m interested in any holes anyone can spot in it.
Consider the following chain of logic:
The pretraining scaling laws only say that, even for a fixed training method, increasing the model’s size and the amount of data you train on increases the model’s capabilities – as measured by loss, performance on benchmarks, and the intuitive sense of how generally smart a model is.
Nothing says that increasing a model’s parameter-count and the amount of compute spent on training it is the only way to increase its capabilities. If you have two training methods A and B, it’s possible that the B-trained X-sized model matches the performance of the A-trained 10X-sized model.
Empirical evidence: Sonnet 3.5 (at least the not-new one), Qwen-2.5-70B, and Llama 3-72B all have 70-ish billion parameters, i. e., less than GPT-3. Yet, their performance is at least on par with that of GPT-4 circa early 2023.
Therefore, it is possible to “jump up” a tier of capabilities, by any reasonable metric, using a fixed model size but improving the training methods.
The latest set of GPT-4-sized models (Opus 3.5, Orion, Gemini 1.5 Pro?) are presumably trained using the current-best methods. That is: they should be expected to be at the effective capability level of a model that is 10X GPT-4′s size yet trained using early-2023 methods. Call that level “GPT-5”.
Therefore, the jump from GPT-4 to GPT-5, holding the training method fixed at early 2023, is the same as the jump from the early GPT-4 to the current (non-reasoning) SotA, i. e., to Sonnet 3.5.1.
(Nevermind that Sonnet 3.5.1 is likely GPT-3-sized too, it still beats the current-best GPT-4-sized models as well. I guess it straight up punches up two tiers?)
The jump from GPT-3 to GPT-4 is dramatically bigger than the jump from early-2023 SotA to late-2024 SotA. I. e.: 4-to-5 is less than 3-to-4.
Consider a model 10X bigger than GPT-4 but trained using the current-best training methods; an effective GPT-6. We should expect this jump to be at most as significant as the capability jump from early-2023 to late-2024. By point (7), it’s likely even less significant than that.
Empirical evidence: Despite the proliferation of all sorts of better training methods, including e. g. the suite of tricks that allowed DeepSeek to train a near-SotA-level model for pocket change, none of the known non-reasoning models have managed to escape the neighbourhood of GPT-4, and none of the known models (including the reasoning models) have escaped that neighbourhood in domains without easy verification.
Intuitively, if we now know how to reach levels above early-2023!GPT-4 using 20x fewer resources, we should be able to shoot well past early-2023!GPT-4 using 1x as much resources – and some of the latest training runs have to have spent 10x the resources that went into original GPT-4.
E. g., OpenAI’s rumored Orion, which was presumably both trained using more compute than GPT-4, and via better methods than were employed for the original GPT-4, and which still reportedly underperformed.
Similar for Opus 3.5: even if it didn’t “fail” as such, the fact that they choose to keep it in-house instead of giving public access for e. g. $200/month suggests it’s not that much better than Sonnet 3.5.1.
Yet, we have still not left the rough capability neighbourhood of early-2023!GPT-4. (Certainly no jumps similar to the one from GPT-3 to GPT-4.)
Therefore, all known avenues of capability progress aside from the o-series have plateaued. You can make the current SotA more efficient in all kinds of ways, but you can’t advance the frontier.
Are there issues with this logic?
The main potential one is if all models that “punch up a tier” are directly trained on the outputs of the models of the higher tier. In this case, to have a GPT-5-capabiliies model of GPT-4′s size, it had to have been trained on the outputs of a GPT-5-sized model, which do not exist yet. “The current-best training methods”, then, do not yet scale to GPT-4-sized models, because they rely on access to a “dumbly trained” GPT-5-sized model. Therefore, although the current-best GPT-3-sized models can be considered at or above the level of early-2023 GPT-4, the current-best GPT-4-sized models cannot be considered to be at the level of GPT-5 if it were trained using early-2023 methods.
Note, however: this would then imply that all excitement about (currently known) algorithmic improvements is hot air. If the capability frontier cannot be pushed by improving the training methods in any (known) way – if training a GPT-4-sized model on well-structured data, and on reasoning traces from a reasoning model, et cetera, isn’t enough to push it to GPT-5′s level – then pretraining transformers is the only known game in town, as far as general-purpose capability-improvement goes. Synthetic data and other tricks can allow you to reach the frontier in all manners of more efficient ways, but not move past it.
Basically, it seems to me that one of those must be true:
Capabilities can be advanced by improving training methods (by e. g. using synthetic data).
… in which case we should expect current models to be at the level of GPT-5 or above. And yet they are not much more impressive than GPT-4, which means further scaling will be a disappointment.
Capabilities cannot be advanced by improving training methods.
… in which case scaling pretraining is still the only known method of general capability advancement.
… and if the Orion rumors are true, it seems that even a straightforward scale-up to GPT-4.5ish’s level doesn’t yield much (or: yields less than was expected).
(This still leaves one potential avenue of general capabilities progress: figuring out how to generalize o-series’ trick to domains without easy automatic verification. But if the above reasoning has no major holes, that’s currently the only known promising avenue.)
Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).
Many models aren’t trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller model is now much better, but this is not an improvement in compute efficiency, doesn’t in any way indicate that it became possible to train a better compute optimal model with a given amount of compute. The data and post-training also recently got better, which creates the illusion of algorithmic progress in pretraining, but their effect is bounded (while RL doesn’t take off), doesn’t get better according to pretraining scaling laws once much more data becomes necessary. There is enough data until 2026-2028, but not enough good data.
I don’t think the cumulative compute multiplier since GPT-4 is that high, I’m guessing 3x, except perhaps for DeepSeek-V3, which wasn’t trained compute optimally and didn’t use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.
The amount of raw compute since original GPT-4 only increased maybe 5x, from 2e25 FLOPs to about 1e26 FLOPs, and it’s unclear if there were any compute optimal models trained on notably more compute than original GPT-4. We know Llama-3-405B is compute optimal, but it’s not MoE, so has lower compute efficiency and only used 4e25 FLOPs. Probably Claude 3 Opus is compute optimal, but unclear if it used a lot of compute compared to original GPT-4.
If there was a 6e25 FLOPs compute optimal model with a 3x compute multiplier over GPT-4, it’s therefore only trained for 9x more effective compute than original GPT-4. The 100K H100s clusters have likely recently trained a new generation of base models for about 3e26 FLOPs, possibly a 45x improvement in effective compute over original GPT-4, but there’s no word on whether any of them were compute optimal (except perhaps Claude 3.5 Opus), and it’s unclear if there is an actual 3x compute multiplier over GPT-4 that made it all the way into pretraining of frontier models. Also, waiting for NVL72 GB200s (that are much better at inference for larger models), non-Google labs might want to delay deploying compute optimal models in the 1e26-5e26 FLOPs range until later in 2025.
Comparing GPT-3 to GPT-4 gives very little signal on how much of the improvement is from compute, and so how much should be expected beyond GPT-4 from more compute. While modern models are making good use of not being compute optimal by using fewer active parameters, GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative. It also wasn’t a MoE model. And most of the bounded low hanging fruit that is not about pretraining efficiency hasn’t been applied to it.
So the currently deployed models don’t demonstrate the results of the experiment in training a much more compute efficient model on much more compute. And the previous leaps in capability are in large part explained by things that are not improvement in compute efficiency or increase in amount of compute. But in 2026-2027, 1 GW training systems will train models with 250x compute of original GPT-4. And probably in 2028-2029, 5 GW training systems will train models with 2500x raw compute of original GPT-4. With a compute multiplier of 5x-10x from algorithmic improvements plausible by that time, we get 10,000x-25,000x original GPT-4 in effective compute. This is enough of a leap that lack of significant improvement from only 9x of currently deployed models (or 20x-45x of non-deployed newer models, rumored to be underwhelming) is not a strong indication of what happens by 2028-2029 (from scaling of pretraining alone).
I don’t think the cumulative compute multiplier since GPT-4 is that high, I’m guessing 3x, except perhaps for DeepSeek-V3, which wasn’t trained compute optimally and didn’t use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different returns on compute investment?
Like, GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million. Yet, they’re roughly at the same level. That should be very surprising. Investing a very different amount of money into V3′s training should’ve resulted in it either massively underperforming GPT-4, or massively overperforming, not landing precisely in its neighbourhood!
Consider this graph. If we find some training method A, and discover that investing $100 million in it lands us at just above “dumb human”, and then find some other method B with a very different ROI, and invest $5.5 million in it, the last thing we should expect is to again land near “dumb human”.
Or consider this trivial toy model: You have two linear functions, f(x) = Ax and g(x) = Bx, where x is the compute invested, output is the intelligence of the model, and f and g are different training methods. You pick some x effectively at random (whatever amount of money you happened to have lying around), plug it into f, and get, say, 120. Then you pick a different random value of x, plug it into g, and get… 120 again. Despite the fact that the multipliers A and B are likely very different, and you used very different x-values as well. How come?
The explanations that come to mind are:
It actually is just that much of a freaky coincidence.
DeepSeek have a superintelligent GPT-6 equivalent that they trained for $10 million in their basement, and V3/r1 are just flexes that they specifically engineered to match GPT-4-ish level.
DeepSeek directly trained on GPT-4 outputs, effectively just distilling GPT-4 into their model, hence the anchoring.
DeepSeek kept investing and tinkering until getting to GPT-4ish level, and then stopped immediately after attaining it.
GPT-4ish neighbourhood is where LLM pretraining plateaus, which is why this capability level acts as a sort of “attractor” into which all training runs, no matter how different, fall.
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1
Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.
GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million
Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision.
The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it’s in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3.
toy model … f(x) = Ax and g(x) = Bx, where x is the compute invested
Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone.
GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative
You’re more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would’ve been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what’s the actual effective “scale” of GPT-3?
(Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3′s level is attainable using less compute, the effective scale-up is bigger. I’m wondering how much bigger.)
IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).
GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla’s compute optimal 20 tokens/parameter is approximately correct for GPT-3, it’s 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity.
(The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn’t worth mentioning compared to everything else about it that’s different from GPT-4.)
in retrospect, we know from chinchilla that gpt3 allocated its compute too much to parameters as opposed to training tokens. so it’s not surprising that models since then are smaller. model size is a less fundamental measure of model cost than pretraining compute. from here on i’m going to assume that whenever you say size you meant to say compute.
obviously it is possible to train better models using the same amount of compute. one way to see this is that it is definitely possible to train worse models with the same compute, and it is implausible that the current model production methodology is the optimal one.
it is unknown how much compute the latest models were trained with, and therefore what compute efficiency win they obtain over gpt4. it is unknown how much more effective compute gpt4 used than gpt3. we can’t really make strong assumptions using public information about what kinds of compute efficiency improvements have been discovered by various labs at different points in time. therefore, we can’t really make any strong conclusions about whether the current models are not that much better than gpt4 because of (a) a shortage of compute, (b) a shortage of compute efficiency improvements, or (c) a diminishing return of capability wrt effective compute.
One possible answer is that we are in what one might call an “unhobbling overhang.”
Aschenbrenner uses the term “unhobbling” for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.
His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we’re also getting better at unhobbling over time, which leads to even more growth.
That is, Aschenbrenner treats pretraining improvements and unhobbling as economic substitutes: you can improve “practically accessible capabilities” to the same extent by doing more of either one even in the absence of the other, and if you do both at once that’s even better.
However, one could also view the pair more like economic complements. Under this view, when you pretrain a model at “the next tier up,” you also need to do novel unhobbling research to “bring out” the new capabilities unlocked at that tier. If you only scale up, while re-using the unhobbling tech of yesteryear, most of the new capabilities will be hidden/inaccessible, and this will look to you like diminishing downstream returns to pretraining investment, even though the model is really getting smarter under the hood.
This could be true if, for instance, the new capabilities are fundamentally different in some way that older unhobbling techniques were not planned to reckon with. Which seems plausible IMO: if all you have is GPT-2 (much less GPT-1, or char-rnn, or...), you’re not going to invest a lot of effort into letting the model “use a computer” or combine modalities or do long-form reasoning or even be an HHH chatbot, because the model is kind of obviously too dumb to do these things usefully no matter how much help you give it.
(Relatedly, one could argue that fundamentally better capabilities tend to go hand in hand with tasks that operate on a longer horizon and involve richer interaction with the real world, and that this almost inevitably causes the appearance of “diminishing returns” in the interval between creating a model smart enough to perform some newly long/rich task and the point where the model has actually been tuned and scaffolded to do the task. If your new model is finally smart enough to “use a computer” via a screenshot/mouseclick interface, it’s probably also great at short/narrow tasks like NLI or whatever, but the benchmarks for those tasks were already maxed out by the last generation so you’re not going to see a measurable jump in anything until you build out “computer use” as a new feature.)
This puts a different spin on the two concurrent observations that (a) “frontier companies report ‘diminishing returns’ from pretraining” and (b) “frontier labs are investing in stuff like o1 and computer use.”
Under the “unhobbling is a substitute” view, (b) likely reflects an attempt to find something new to “patch the hole” introduced by (a).
But under the “unhobbling is a complement” view, (a) is instead simply a reflection of the fact that (b) is currently a work in progress: unhobbling is the limiting bottleneck right now, not pretraining compute, and the frontier labs are intensively working on removing this bottleneck so that their latest pretrained models can really shine.
(On an anecdotal/vibes level, this also agrees with my own experience when interacting with frontier LLMs. When I can’t get something done, these days I usually feel like the limiting factor is not the model’s “intelligence” – at least not only that, and not that in an obvious or uncomplicated way – but rather that I am running up against the limitations of the HHH assistant paradigm; the model feels intuitively smarter in principle than “the character it’s playing” is allowed to be in practice. See my comments here and here.)
That seems maybe right, in that I don’t see holes in your logic on LLM progression to date, off the top of my head.
It also lines up with a speculation I’ve always had. In theory LLMs are predictors, but in practice, are they pretty much imitators? If you’re imitating human language, you’re capped at reproducing human verbal intelligence (other modalities are not reproducing human thought so not capped; but they don’t contribute as much in practice without imitating human thought).
I’ve always suspected LLMs will plateau. Unfortunately I see plenty of routes to improving using runtime compute/CoT and continuous learning. Those are central to human intelligence.
LLMs already have slightly-greater-than-human system 1 verbal intelligence, leaving some gaps where humans rely on other systems (e.g., visual imagination for tasks like tracking how many cars I have or tic-tac-toe). As we reproduce the systems that give humans system 2 abilities by skillfully iterating system 1, as o1 has started to do, they’ll be noticeably smarter than humans.
The difficulty of finding new routes forward in this scenario would produce a very slow takeoff. That might be a big benefit for alignment.
It also lines up with a speculation I’ve always had. In theory LLMs are predictors, but in practice, are they pretty much imitators?
Yep, I think that’s basically the case.
@nostalgebraist makes an excellent point that eliciting any latent superhuman capabilities which bigger models might have is an art of its own, and that “just train chatbots” doesn’t exactly cut it, for this task. Maybe that’s where some additional capabilities progress might still come from.
But the AI industry (both AGI labs and the menagerie of startups and open-source enthusiasts) has so far been either unwilling or unable to move past the chatbot paradigm.
(Also, I would tentatively guess that this type of progress is not existentially threatening. It’d yield us a bunch of nice software tools, not a lightcone-eating monstrosity.)
I agree that chatbot progress is probably not existentially threatening. But it’s all too short a leap to making chatbots power general agents. The labs have claimed to be willing and enthusiastic about moving to an agent paradigm. And I’m afraid that a proliferation of even weakly superhuman or even roughly parahuman agents could be existentially threatening.
I spell out my logic for how short the leap might be from current chatbots to takeover-capable AGI agents in my argument for short timelines being quite possible. I do think we’ve still got a good shot of aligning that type of LLM agent AGI since it’s a nearly best-case scenario. RL even in o1 is really mostly used for making it accurately follow instructions, which is at least roughly the ideal alignment goal of Corrigibility as Singular Target. Even if we lose faithful chain of thought and orgs don’t take alignment that seriously, I think those advantages of not really being a maximizer and having corrigibility might win out.
That in combination with the slower takeoff make me tempted to believe its actually a good thing if we forge forward, even though I’m not at all confident that this will actually get us aligned AGI or good outcomes. I just don’t see a better realistic path.
Here’s an argument for a capabilities plateau at the level of GPT-4 that I haven’t seen discussed before. I’m interested in any holes anyone can spot in it.
One obvious hole would be that capabilities did not, in fact, plateau at the level of GPT-4.
I thought the argument was that progress has slowed down immensely. The softer form of this argument is that LLMs won’t plateau but progress will slow to such a crawl that other methods will surpass them. The arrival of o1 and o3 says this has already happened, at least in limited domains—and hybrid training methods and perhaps hybrid systems probably will proceed to surpass base LLMs in all domains.
There’s been incremental improvement and various quality-of-life features like more pleasant chatbot personas, tool use, multimodality, gradually better math/programming performance that make the models useful for gradually bigger demographics, et cetera.
But it’s all incremental, no jumps like 2-to-3 or 3-to-4.
I see, thanks. Just to make sure I’m understanding you correctly, are you excluding the reasoning models, or are you saying there was no jump from GPT-4 to o3? (At first I thought you were excluding them in this comment, until I noticed the “gradually better math/programming performance.”)
So, Project Stargate. Is it real, or is it another “Sam Altman wants $7 trillion”? Some points:
The USG invested nothing in it. Some news outlets are being misleading about this. Trump just stood next to them and looked pretty, maybe indicated he’d cut some red tape. It is not an “AI Manhattan Project”, at least, as of now.
Elon Musk claims that they don’t have the money and that SoftBank (stated to have “financial responsibility” in the announcement) has less than $10 billion secured. If true, while this doesn’t mean they can’t secure an order of magnitude more by tomorrow, this does directly clash with “deploying $100 billion immediately” statement.
But Sam Altman counters that Musk’s statement is “wrong”, as Musk “surely knows”.
I… don’t know which claim I distrust more. Hm, I’d say Altman feeling the need to “correct the narrative” here, instead of just ignoring Musk, seems like a sign of weakness? He doesn’t seem like the type to naturally get into petty squabbles like this, otherwise.
(And why, yes, this is how an interaction between two Serious People building world-changing existentially dangerous megaprojects looks like. Apparently.)
Some people try to counter Musk’s claim by citing Satya Nadella’s statement that Satya’s “good for his $80 billion”. But that’s not referring to Stargate, that’s about Azure. Microsoft is not listed as investing in Stargate at all, it’s only a “technology partner”.
Here’s a brief analysis from the CIO of some investment firm. He thinks it’s plausible that the stated “initial group of investors” (SoftBank, OpenAI, Oracle, MGX) may invest fifty billion dollars into it over the next four years; not five hundred billion.
They don’t seem to have the raw cash for even $100 billion – and if SoftBank secured the missing funding from some other set of entities, why aren’t they listed as “initial equity funders” in the announcement?
Overall, I’m inclined to fall back on @Vladimir_Nesov’s analysis here. $30-50 billion this year seems plausible. But $500 billion is, for now, just Altman doing door-in-the-face as he had with his $7 trillion.
I haven’t looked very deeply into it, though. Additions/corrections welcome!
Overall, I think that the new Stargate numbers published may (call it 40%) be true, but I also think there is a decent chance this is new administration trump-esque propoganda/bluster (call it 45%), and little change from the prior expected path of datacenter investment (which I do believe is unintentional AINotKillEveryone-ism in the near future).
Overall, I think that the new Stargate numbers published may (call it 40%) be true, but I also think there is a decent chance this is new administration trump-esque propoganda/bluster (call it 45%),
I think it’s definitely bluster, the question is how much of a done deal it is to turn this bluster into at least $100 billion.
I don’t this this changes the prior expected path of datacenter investment at all. It’s precisely how the expected path was going to look like, the only change is how relatively high-profile/flashy this is being. (Like, if they invest $30-100 billion into the next generation of pretrained models in 2025-2026, and that generation fails, they ain’t seeing the remaining $400 billion no matter what they’re promising now. Next generation makes or breaks the investment into the subsequent one, just as was expected before this.)
He does very explicitly say “I am going to spend 80 billion dollars building Azure”. Which I think has nothing to do with Stargate.
(I think the overall vibe he gives off is “I don’t care about this Stargate thing, I’m going to scale my data centers and that’s that”. I don’t put much stock into this vibe, but it does fit with OpenAI/Microsoft being on the outs.)
Here’s something that confuses me about o1/o3. Why was the progress there so sluggish?
My current understanding is that they’re just LLMs trained with RL to solve math/programming tasks correctly, hooked up to some theorem-verifier and/or an array of task-specific unit tests to provide ground-truth reward signals. There are no sophisticated architectural tweaks, not runtime-MCTS or A* search, nothing clever.
Why was this not trained back in, like, 2022 or at least early 2023; tested on GPT-3/3.5 and then default-packaged into GPT-4 alongside RLHF? If OpenAI was too busy, why was this not done by any competitors, at decent scale? (I’m sure there are tons of research papers trying it at smaller scales.)
The idea is obvious; doubly obvious if you’ve already thought of RLHF; triply obvious after “let’s think step-by-step” went viral. In fact, I’m pretty sure I’ve seen “what if RL on CoTs?” discussed countless times in 2022-2023 (sometimes in horrified whispers regarding what the AGI labs might be getting up to).
The mangled hidden CoT and the associated greater inference-time cost is superfluous. DeepSeek r1/QwQ/Gemini Flash Thinking have perfectly legible CoTs which would be fine to present to customers directly; just let them pay on a per-token basis as normal.
Were there any clever tricks involved in the training? Gwern speculates about that here. But none of the follow-up reasoning models have a o1-style deranged CoT, so the more straightforward approaches probably Just Work.
Did nobody have the money to run the presumably compute-intensive RL-training stage back then? But DeepMind exists. Did nobody have the attention to spare, with OpenAI busy upscaling/commercializing and everyone else catching up? Again, DeepMind exists: my understanding is that they’re fairly parallelized and they try tons of weird experiments simultaneously. And even if not DeepMind, why have none of the numerous LLM startups (the likes of Inflection, Perplexity) tried it?
Am I missing something obvious, or are industry ML researchers surprisingly… slow to do things?
(My guess is that the obvious approach doesn’t in fact work and you need to make some weird unknown contrivances to make it work, but I don’t know the specifics.)
While I don’t have specifics either, my impression of ML research is that it’s a lot of work to get a novel idea working, even if the idea is simple. If you’re trying to implement your own idea, you’ll be banging your head against the wall for weeks or months wondering why your loss is worse than the baseline. If you try to replicate a promising-sounding paper, you’ll bang your head against the wall as your loss is worse than the baseline. It’s hard to tell if you made a subtle error in your implementation or if the idea simply doesn’t work for reasons you don’t understand because ML has little in the way of theoretical backing. Even when it works it won’t be optimized, so you need engineers to improve the performance and make it stable when training at scale. If you want to ship a working product quickly then it’s best to choose what’s tried and true.
My latest experience with trying to use Opus 4.1 to vibe-code[1]:
Context: I wanted a few minor helper apps[2] that I didn’t think worth the time to code manually, and which I used as an opportunity to test LLMs’ current skill level (not expecting it to actually save much development time; yay multi-purpose projects).
Very mixed results. My experience is that you need to give it a very short leash. As long as you do the job of factorizing the problem, and then feed it small steps to execute one by one, it tends to zero-shot them. You do need to manually check that the implementational choices are what you intended, and you’ll want to start by building some (barebones) UI to make this time-efficient, but this mostly works. It’s also pretty time-consuming on your part, both the upfront cost of factorizing and the directing afterwards.
If you instead give it a long leash, it produces overly complicated messes (like, 3x the lines of code needed?) which don’t initially work, and which it iteratively bug-fixes into the ground. The code is then also liable to contain foundation-level implementation choices that go directly against the spec; if that is pointed out, the LLM wrecks everything attempting a refactor.
In general, if the LLM screws up, telling it to fix things seems to be a bad idea, with it entering a downward spiral. You’re better off reverting the codebase to the pre-screwup state, figuring out how to modify your latest instruction to get the LLM to avoid the previous error, then hope it won’t find a new way to screw up. That “figure out how to modify your instruction” step requires a fair amount of mental effort, not to mention having to actually type it out.[3]
I shudder to think what large vibe-coded codebases[4] look like internally. Ones where you can’t, like, get a firehose of data on how their components work, where your vision is limited to unit tests, and where several such components are interlocked.
Some of this is doubtlessly a skill issue on my part: I don’t do green-field development much and my LLM-wrangling expertise is unrefined. But I don’t buy that you can structure arbitrary programs and workflows such that LLMs zero-shot more than 1-3 features at a time (roughly the equivalents of 5-30 minutes of a human developer’s work) without close supervision. I guess it might be fine if your requirements are pretty loose and you don’t mind the app breaking or working incorrectly once in a while. But not if you want anything secure, precise, and reliable.
There’s also the issue with LLMs breaking on big codebases, but this presumably can be fixed by structuring codebases accordingly, around LLMs’ flaws. Like, design them to make their components interact through sparse interfaces with strong restrictions on input/output type signatures, such that the implementation details are completely abstracted out and the LLM only needs to deal with external functionality. (But LLMs sure don’t know, out-of-the-box, that they need to structure their code this way.)
Notably, it’s all very in line with METR’s evaluations, both the agency horizons (the 80%-success ones) and the “LLM code is not mergable” result. I’m very curious if LLMs will indeed be able to reliably zero-shot hour-sized chunks of coding a year from now.
Edit: Oh yeah, and this all eventually did result in workable apps which I’m using now.
Through Claude Code; initially used Opus 4.1 for everything, then got annoyed at the amount of money being spent and switched to the “Opus for planning, Sonnet for execution” mode.
Specifically, a notetaking app (the simplest/minimum-viable version), a timer, and a task database, with my having very precise, specific, and idiosyncratic opinions regarding how they ought to work.
To be clear, by a “screwup” I don’t mean writing code that contains a bug – LLMs can do bug-fixing as well as they do everything else. Writing code containing a bug isn’t quite “screwing up”, it’s a normal part of development.
I’m talking about more “workflow-breaking” screwups, where the LLM makes some agency-level error (deletes code it shouldn’t have, pastes the code into the wrong place, misreads the spec...), gets flustered, and then tries to “fix” things.
I .e., where the developer doesn’t manually read and comprehensively fix LLM-generated code, which would often take more time than writing it manually.
I use Claude for some hobby coding at home. I find it very useful in the following situations:
I am too lazy to study the API, just give me the code that loads the image from the file or whatever
how would you do this? suggest a few alternatives (often its preferred one is not the one I choose)
tell me more about this topic
So it saves me a lot of time reading the documentation, and sometimes it suggests a cool trick I probably wouldn’t find otherwise. (For example, in Python list comprehension, you can create a “local variable” by iterating through a list containing one item.)
On the other hand, if I let Claude write code, it is often much longer code that when I think about the problem, and only let Claude give me short pieces of code.
Maybe I am using it the wrong way. But that too is a kind of risk: having Claude conveniently write the code for me encourages me to produce a lot of code because it is cheap. But as Dijkstra famously said, it is not “lines produced” but “lines spent”.
I get the feeling that it can remove blockers. Your effectiveness (however you want to define that) at programming is like Liebig’s barrel, or maybe a better example would be a road with a bunch of big rocks and rubble on it—you have to keep removing things to proceed. The more experienced you are, the better at removing obstacles, or even better, avoiding them. LLMs are good at giving you more options for how to approach problems. Often their suggestion is not the best, but it’s better than nothing and will suffice to move on.
They do like making complex stuff, though… I wonder if that’s because there’s a lot more bad code that looks good than just good code?
Yeah, they can be used as a way to get out of procrastination/breach ugh fields/move past the generalized writer’s block, by giving you the ability to quickly and non-effortfully hammer out a badly-donefirst draft of the thing.
They do like making complex stuff, though… I wonder if that’s because there’s a lot more bad code that looks good than just good code?
Wild guess: it’s because AGI labs (and Anthropic in particular) are currently trying to train them to be able to autonomously spin up large, complex codebases, so much of their post-training consists of training episodes where they’re required to do that. So they end up biased towards architectural choices suitable for complex applications instead of small ones.
Like, given free reign, they seem eager to build broad foundations on which lots of other features could later be implemented (n = 2), even if you gave them the full spec and it only involves 1-3 features.
This would actually be kind of good – building a good foundation which won’t need refactoring if you’ll end up fancying a new feature! – except they’re, uh, not actually good at building those foundations.
Which, if my guess is correct, is because their post-training always pushes them to the edge of, and then slightly past, their abilities. Like, suppose the post-training uses rejection sampling (training on episodes in which they succeeded), and it involves a curriculum spanning everything from small apps to “build an OS”. If so, much like in METR’s studies, there’ll be some point of program complexity where they only succeed 50% of the time (or less). But they’re still going to be trained on those 50% successful trajectories. So they’ll end up “overambitious”/”overconfident”, trying to do things they’re not quite reliable at yet.
Maybe the sufficient reason is that LLMs learn both on the bad code and the good code. And on the old code, which was written before some powerful features were introduced to the language or the libraries, so it doesn’t use them. Also, they learn on automatically generated code, if someone commits that to the repository.
Yup, it certainly seems useful if you (a) want to cobble something together using tools you don’t know (and which is either very simple, or easy to “visually” bug-test, or which you don’t need to work extremely reliably), (b) want to learn some tools you don’t know, so you use the LLM as a chat-with-docs interface.
I did something similar a couple of months ago, asking it to make me a memory system (Opus 4 and o1-pro for design, mainly sonnet 4 for implementation). Initially it was massively overengineered, with a bunch of unneeded components. This was with Cursor, which changes things a bit—Claude Code tends to be much better at planning things first.
The results were very similar—handholding works, looking away results in big messes, revert rather than fix (unless you know how to fix it), and eventually I got something that does what I want, albeit more complex than I would have made by myself.
AIME 2025 part I was conducted yesterday, and the scores of some language models are available here: https://matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.
I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.
That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It’s really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can’t multiply 3-digit numbers. I was wrong, I guess.
I then used openai’s Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
I haven’t checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.
So, what—if anything—does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?
I’m not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it’s still generalization. I am sympathetic to that. But, I also wouldn’t rule out that GRPO is amazing at sharpening memories along with math skills.
At the very least, the above show that data decontamination is hard.
Never ever underestimate the amount of stuff you can find online. Practically everything exists online.
I think one of the other problems with benchmarks is that they necessarily select for formulaic/uninteresting problems that we fundamentally know how to solve. If a mathematician figured out something genuinely novel and important, it wouldn’t go into a benchmark (even if it were initially intended for a benchmark), it’d go into a math research paper. Same for programmers figuring out some usefully novel architecture/algorithmic improvement. Graduate students don’t have a bird’s-eye-view on the entirety of human knowledge, so they have to actually do the work, but the LLM just modifies the near-perfect-fit answer from an obscure publication/math.stackexchange thread or something.
I expect the same is the case with programming benchmarks, science-quiz benchmarks, et cetera.
Now, this doesn’t necessarily mean that the AI progress has been largely illusory and that we’re way further from AGI than the AI hype men would have you believe (although I am very tempted to make this very pleasant claim, and I do place plenty of probability mass on it).
But if you’re scoring AIs by the problems they succeed at, rather than the problems they fail at, you’re likely massively overestimating their actual problem-solving capabilities.
To be fair, non-reasoning models do much worse on these questions, even when it’s likely that the same training data has already been given to GPT-4.
Now, I could believe that RL is more or less working as an elicitation method, which plausibly explains the results, though still it’s interesting to see why they get much better scores, even with very similar training data.
I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.
Sooo, apparently OpenAI’s mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just… “use the LLM as a judge”? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.
The technical details of how exactly the universal verifier works aren’t yet clear. Essentially, it involves tasking an LLM with the job of checking and grading another model’s answers by using various sources to research them.
My understanding is that they approximate an oracle verifier by an LLM with more compute and access to more information and tools, then train the model to be accurate by this approximate-oracle’s lights.
Now, it’s possible that the journalists are completely misinterpreting the thing they’re reporting on, or that it’s all some galaxy-brained OpenAI op to mislead the competition. It’s also possible that there’s some incredibly clever trick for making it work much better than how it sounds like it’d work.
But if that’s indeed the accurate description of the underlying reality, that’s… kind of underwhelming. I’m curious how far this can scale, but I’m not feeling very threatened by it.
(Haven’t seen this discussed on LW, kudos to @lwreader132 for bringing it to my attention.)
One point of information against the “journalists are completely misinterpreting the thing they’re reporting on” view is that the one of the co-authors is Rocket Drew, who previously worked as a Research Manager at MATS.
But I’ll definitely be interested to follow this space more.
Hard to tell from the sources, but it sounds almost like prover-estimator debate. The estimator is assigning a number to how likely it is that a subclaim for a proof is correct, and this approach might also work for less verifiable domains since a human oracle is used at the last round of the debate. The main problem seems to be that it may not scale if it requires human feedback.
Edit: I’ve played with the numbers a bit more, and on reflection, I’m inclined to partially unroll this update. o3 doesn’t break the trendline as much as I’d thought, and in fact, it’s basically on-trend if we remove the GPT-2 and GPT-3 data-points (which I consider particularly dubious).
Regarding METR’s agency-horizon benchmark:
I still don’t like anchoring stuff to calendar dates, and I think the o3/o4-mini datapoints perfectly show why.
It would be one thing if they did fit into the pattern. If, by some divine will controlling the course of our world’s history, OpenAI’s semi-arbitrary decision about when to allow METR’s researchers to benchmark o3 just so happened to coincide with the 2x/7-month model. But it didn’t: o3 massively overshot that model.[1]
Imagine a counterfactual in which METR’s agency-horizon model existed back in December, and OpenAI invited them for safety testing/benchmarking then, four months sooner. How different would the inferred agency-horizing scaling laws have been, how much faster the extrapolated progress? Let’s run it:
o1 was announced September 12th, o3 was announced December 19th, 98 days apart.
o1 scored at ~40 minutes, o3 at ~1.5 hours, a 2.25x’ing.
There’s ~2.14 intervals of 98 days in 7 months.
Implied scaling factor: 2.252.14=5.67 each 7 months.
And I don’t see any reasons to believe it was overdetermined that this counterfactual wouldn’t have actualized. METR could have made the benchmark a few months earlier, OpenAI could have been more open about benchmarking o3.
And if we lived in that possible world… It’s now been 135 days since December 19th, i. e., ~1.38 intervals of 98 days. Extrapolating, we should expect the best publicly known model would have the time horizon of 1.5 hours×2.251.38=4.59 hours. I don’t think we have any hint that those exist.
So: in that neighbouring world in which OpenAI let METR benchmark o3 sooner, we’re looking around and seeing that the progress is way behind the schedule.[2]
To me, this makes the whole model fall apart. I don’t see how it can follow any mechanistic model-based reality of what’s happening, and as per the o3/o4-mini data points, it doesn’t predict the empirical reality as well. Further, whether we believe that the progress is much faster vs. much slower than expected is entirely controlled by the arbitrary fact that METR didn’t get to benchmark o3 in December.
7 months contain 3.5 two-month intervals. This means that, if horizons extend as fast as they did between 3.7 and o3, we should expect a 1.53.5=4.13x’ing of agency horizons each 7 months.
Edit: Yes, counterfactual!METR wouldn’t have used just those two last data points, so the inferred multiplier would’ve been somewhat less than that. But I think it would’ve still been bigger than 2x/7-months, and the graph would’ve been offset to the left (the 1.5-hour performance achieved much earlier), so we’d still be overdue for ~2.5-hour AIs. Half-a-year behind, I think?
I agree that anchoring stuff to release dates isn’t perfect because the underlying variable of “how long does it take until a model is released” is variable, but I think is variability is sufficiently low that it doesn’t cause that much of an issue in practice. The trend is only going to be very solid over multiple model releases and it won’t reliably time things to within 6 months, but that seems fine to me.
I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you’ll be in trouble, but fortunately, you can just not do this and instead use more than 2 data points.
This also means that I think people shouldn’t update that much on the individual o3 data point in either direction. Let’s see where things go for the next few model releases.
That one seems to work more reliably, perhaps because it became the metric the industry aims for.
I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you’ll be in trouble
My issue here is that there wasn’t that much variance in the performance of all preceding models they benchmarked: from GPT-2 to Sonnet 3.7, they seem to almost perfectly fall on the straight line. Then, the very first advancement of the frontier after the trend-model is released is an outlier. That suggests an overfit model.
I do agree that it might just be a coincidental outlier and that we should wait and see whether the pattern recovers with subsequent model releases. But this is suspicious enough I feel compelled to make my prediction now.
The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.
Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It’s not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.
Release schedules could be altered
A model could be overfit to our dataset
One model could play less well with our elicitation/scaffolding
One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.
All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.
Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I’d previously complained, I don’t think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.
For what it’s worth, the model showcased in December (then called o3) seems to be completely different from the model that METR benchmarked (now called o3).
On the topic of o1′s recent release: wasn’t Claude Sonnet 3.5 (the subscription version at least, maybe not the API version) already using hidden CoT? That’s the impression I got from it, at least.
The responses don’t seem to be produced in constant time. It sometimes literally displays a “thinking deeply” message which accompanies a unusually delayed response. Other times, the following pattern would play out:
I pose it some analysis problem, with a yes/no answer.
It instantly produces a generic response like “let’s evaluate your arguments”.
There’s a 1-2 second delay.
Then it continues, producing a response that starts with “yes” or “no”, then outlines the reasoning justifying that yes/no.
That last point is particularly suspicious. As we all know, the power of “let’s think step by step” is that LLMs don’t commit to their knee-jerk instinctive responses, instead properly thinking through the problem using additional inference compute. Claude Sonnet 3.5 is the previous out-of-the-box SoTA model, competently designed and fine-tuned. So it’d be strange if it were trained to sabotage its own CoTs by “writing down the bottom line first” like this, instead of being taught not to commit to a yes/no before doing the reasoning.
On the other hand, from a user-experience perspective, the LLM immediately giving a yes/no answer followed by the reasoning is certainly more convenient.
From that, plus the minor-but-notable delay, I’d been assuming that it’s using some sort of hidden CoT/scratchpad, then summarizes its thoughts from it.
I haven’t seen people mention that, though. Is that not the case?
(I suppose it’s possible that these delays are on the server side, my requests getting queued up...)
(I’d also maybe noticed a capability gap between the subscription and the API versions of Sonnet 3.5, though I didn’t really investigate it and it may be due to the prompt.)
My model was that Claude Sonnet has tool access and sometimes does some tool-usage behind the scenes (which later gets revealed to the user), but that it wasn’t having a whole CoT behind the scenes, but I might be wrong.
Thanks, that seems relevant! Relatedly, the system prompt indeed explicitly instructs it to use “<antThinking>” tags when creating artefacts. It’d make sense if it’s also using these tags to hide parts of its CoT.
Here’s a potential interpretation of the market’s apparent strange reaction to DeepSeek-R1 (shorting Nvidia).
I don’t fully endorse this explanation, and the shorting may or may not have actually been due to Trump’s tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don’t think it’s an altogether implausible world.
If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training new models, after all. So the impact of DeepSeek-R1 on the pretraining scaling laws is irrelevant: sure, it did not show that you don’t need bigger data centers to get better base models, but that’s not where most of the money was anyway.
And my understanding is that, on the inference-time scaling paradigm, there isn’t yet any proven method of transforming arbitrary quantities of compute into better performance:
Reasoning models are bottlenecked on the length of CoTs that they’ve been trained to productively make use of. They can’t fully utilize even their context windows; the RL pipelines just aren’t up to that task yet. And if that bottleneck were resolved, the context-window bottleneck would be next: my understanding is that infinite context/”long-term” memories haven’t been properly solved either, and it’s unknown how they’d interact with the RL stage (probably they’d interact okay, but maybe not).
o3 did manage to boost its ARC-AGI and (maybe?) FrontierMath performance by… generating a thousand guesses and then picking the most common one...? But who knows how that really worked, and how practically useful it is. (See e. g. this, although that paper examines a somewhat different regime.)
Agents, from Devin to Operator to random open-source projects, are still pretty terrible. You can’t set up an ecosystem of agents in a big data center and let them rip, such that the ecosystem’s power scales boundlessly with the data center’s size. For all but the most formulaic tasks, you still need a competent human closely babysitting everything they do, which means you’re still mostly bottlenecked on competent human attention.
Suppose that you don’t expect the situation to improve: that the inference-time scaling paradigm would hit a ceiling pretty soon, or that it’d converge to distilling search into forward passes (such that the end users end up using very little compute on inference, like today), and that agents just aren’t going to work out the way the AGI labs promise.
In such a world, a given task can either be completed automatically by an AI for some fixed quantity of compute X, or it cannot be completed by an AI at all. Pouring ten times more compute on it does nothing.
In such a world, if it were shown that the compute needs of a task can be met with ten times less compute than previously expected, this would decrease the expected demand for compute.
The fact that capable models can be run locally might increase the number of people willing to use them (e. g., those very concerned about data privacy), as might the ability to automatically complete 10x as many trivial tasks. But it’s not obvious that this demand spike will be bigger than the simultaneous demand drop.
And I, at least, when researching ways to set up DeepSeek-R1 locally, found myself more drawn to the “wire a bunch of Macs together” option, compared to “wire a bunch of GPUs together” (due to the compactness). If many people are like this, it makes sense why Nvidia is down while Apple is (slightly) up. (Moreover, it’s apparently possible to run the full 671b-parameter version locally, and at a decent speed, using a pure RAM+CPU setup; indeed, it appears cheaper than mucking about with GPUs/Macs, just $6,000.)
This world doesn’t seem outright implausible to me. I’m bearish on agents and somewhat skeptical of inference-time scaling. And if inference-time scaling does deliver on its promises, it’ll likely go the way of search-and-distill.
On balance, I don’t actually expect the market to have any idea what’s going on, so I don’t know that its reasoning is this specific flavor of “well-informed but skeptical”. And again, it’s possible the drop was due to Trump, nothing to do with DeepSeek at all.
But as I’d said, this reaction to DeepSeek-R1 does not seem necessarily irrational/incoherent to me.
The bet that “makes sense” is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we’re going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also “suggesting” that quality is compute-insensitive in a large range, so there is no benefit from more compute per token.
But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivates using inference for more tokens even with models that have the same active parameter count (as in Jevons paradox), that argument doesn’t work. Also, the ceiling of quality at the possible scaling slowdown point depends on efficiency of training (compute multiplier) applied to the largest training system that the AI economics will support (maybe 5-15 GW without almost-AGI), and improved efficiency of DeepSeek-V3 raises that ceiling.
If o3 is based on GPT-4o, there is a reasoning model based on GPT-4.5 that’s better. If o3 is based on GPT-4.5 (and so the reason it’s still not out is that they are waiting for Blackwells to inference it at a reasonable speed and cost), then it was a hasty job just after the base model for GPT-4.5 was pretrained, and so by now they have something much better. Either way, there is some sort of “o4”, but it’s probably a bad look to announce it before releasing the already-announced o3.
Some thoughts on protecting against LLM steganography.
tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.
In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than 210 internal states.
However, the amount of information is not lower-bounded. Depending on the decoder, a ten-bit message can communicate between zero and ten bits of information. (A decoder that’s just a constant function recovers zero bits, a decoder that only pays attention to the first bit recovers only one bit, et cetera.)
The idea of steganography is that, if we pick a CoT string, and run both an AI model’s internal decoders and a human’s language-comprehension decoders on it, the AI model would extract more bits of information from it. Paraphrasing, humans’ decoders are “coarser”: they have broader equivalence classes of messages that map to the same internal representation, compared to AI models.
For example, “a dog crossed the street” and “a dog traveled across the street” might map to the same internal representation for a human, but to different ones for an AI model. Similarly, things like “quantum butterflies juggled metaphysical spaghetti” and “ambitious lampshade collected vintage moonbeams” can map to the same “nonsense” internal representation in humans, but have some useful meaning for AI models. The bits a human fails to make use of correspond to “steganography” or “superhuman latent knowledge”.
Proposals such as Shoggoth + Paraphraser + Face aim to destroy any information in CoTs that cannot be recovered by humans, by mapping each CoT string to the equivalence class of strings that all map to the same internal representation in humans.
However, there’s an issue: what if the empty string already communicates plenty of steganographically encoded information that humans aren’t privy to? That is: what if some knowledge is hard-coded into an AI model, such that it’s elicited on any activation, and leads to it converging towards reaching the samehidden conclusions even after we destroy its ability to sneak-in additional query-relevant information?
For example:
If the AI model implicitly learned the Theory of Everything, that knowledge is going to be activated on any prompt – and so, no amount of paraphrasing is going to force it to talk about the Theory of Everything in humanese. Communicating the ToE from one instance of the model to its future self takes up zero bits.
If the AI model has internalized some hidden goals, the heuristics for pursuing those goals are likewise going to be activated on any prompt, and it would engage in massively productive hidden scheming on any prompt. Importantly, the information-bottleneck constraints imposed by the Paraphraser wouldn’t even be hitting the hidden-plotting compartment specifically. They’d just uniformly lower the model’s computational power.
E. g., in the Shoggoth + Paraphraser + Face paradigm, that’d just mean the model learns to think in longer CoTs (because it needs longer messages to communicate the same amount of information), not to do fewer “sneaky” computations.[1]
Like, imagine that we have two different internal-representation sets: X, containing problem-solving strategies for the overt task we’re presenting the model, and Y, related to some misaligned scheming. We would like to constrain the model such that it cannot plan about Y; such that its CoTs contain zero bits for narrowing down which state out of all Y it must pick.
This assumes something like the following model: A given CoT string communicates some N bits of information. Of them, M bits are human-legible. The model then follows the algorithm of: use NX task-relevant bits on picking a state out of X, and use any leftover NY=N−NX bits to pick a state out of Y. We then destroy the string’s information content until N≈M (via the Paraphraser). Now the model doesn’t have the ability to encode any non-human-legible bits, and if it uses any bits on picking a state out of Y, they would necessarily be human-legible, and we can spot the scheming by reading the CoT.
But that’s not how this works. Instead, the model picks both simultaneously: a CoT is directly mapped to an ⟨X,Y⟩two-tuple. Which means if the model can narrow down the state of X to one out of 210, it can narrow down the state of Y to one out of 210in the same move. Even if we scramble the CoT until it communicates no more bits than a human can extract out of it, that doesn’t differentially constrain the model’s ability to think about scheming (= communicate scheming-relevant information to its future self).
It’s a uniform communication constraint/computational-complexity penalty.
Generalizing beyond scheming: if the model were inclined to not put any of its thoughts into humanese (due to superhuman latent knowledge, say), the Paraphraser wouldn’t address this issue either. If a model’s plans contain some catastrophic side-effects (which aren’t even malign, just something we didn’t tell it not to cause), the Paraphraser wouldn’t reveal that either.
(This doesn’t touch on the Face part of the proposal, i. e., the idea that we want to not pressure the model to conceal any scheme-y cognition in the hopes it’d naively reveal them. But if there’s some pressure to conceal that, the Paraphraser would not help with it at all. It’d just teach the model to map humanese-equivalent strings to equivalent scheming-states, which isn’t particularly useful.)
Arguably, if you also penalize the CoT length, this should cause a pressure towards throwing out any computations unrelated to the object-level problem the model’s being trained on… But then the whole “paraphraser” bit doesn’t actually do anything, you can achieve the same outcome by just directly penalizing the CoT length.
Which, inasmuch as DL training loops use some form of simplicity prior, should be happening by default anyway. I. e., this doesn’t address the pressures that we expect to lead to steganography/hidden plotting despite the simplicity prior: which is to say, arguments that steganography/plotting are the actual optimal solutions to the training problem.
Consider the Theory of Everything example: those hidden computations are not actually “irrelevant computations” that we need to displace, they’re directly useful. Similarly, any misaligned mesa-objectives the model forms would presumably be convergently useful for performance on the training objective: that would be why they formed to begin with. So the information-bottleneck trickery would do ~nothing against them.
I like to distinguish between “Subtle strategic bias” and “Steganography.” Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.
Example of subtle strategic bias: The model knows it’s misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. “I’m sorry sir I just am not making much progress,” it says. You look at the CoT, and it’s just doing a bunch of sensible but fruitless lines of reasoning.
Example of steganography: The model needs to insert a subtle security vulnerability into the code it is writing—but it doesn’t know one right away, it has to invent one. And that requires lots of sequential thought. So it needs to try out different ideas, vet them, critique them, refine them, etc. Can’t do that in a single forward pass. So it needs to pass information forward to its future self somehow… without humans reading the CoT knowing that’s what it’s doing. THIS is the sort of thing that is much harder to do with a paraphraser than without a paraphraser.
Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, “I need to design unit tests for this module”. Each word of this CoT is mapped to some internal representation, and the internal representation consists of “on-task” human-intended reasoning X, and misaligned scheming Y (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). “I” is mapped to ⟨X1,Y1⟩, “need” is mapped to ⟨X2,Y2⟩, etc. Each Xi iterates on Xi−1, each Yi iterates on Yi−1, we have a sequence of serial computations.
In other words: as the model outputs, “I need to design unit tests for this module”, there’s an unspoken subtext of “I need to sabotage the unit tests here”, and the model packs nine words’ worth of serial computations into both the “textual” X and the “subtextual” Y aspects of it.
Now you use the Paraphraser on the string, generating alternatives like “the next step is to generate some unit tests” and “several unit tests for this module are needed here”. The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the ith token of this phrase is mapped to Xi, as if the paraphrasing didn’t happen, but that the mapping of i→Yi would be broken.
Except, why would that happen? What privileges X-representations over Y-representations? Why would i→Xi be preserved whereas i→Yi is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model’s capabilities)?
In other words: We’re basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn’t this just map the subtextual “I need to sabotage the unit tests here” to similarly semantically-equivalent-but-syntactically-different “let us compromise the relevant unit-test suite” or “I shall ensure the unit tests don’t work properly”?
I expect it’d just lead, to wit, to the model’s computations becoming indifferent to the details of a phrasing in full generality. And if the model were inclined to associate the general idea of generating unit tests with tactically screwing the task up, the model would come up with this idea regardless of the details of a phrasing.
It wouldn’t interfere with its ability to engage in scheming-relevant serial computations, much like it doesn’t interfere with its ability to engage in task-relevant serial computations, for the exact same reasons.
I don’t think I understand. Let’s work through an example.
The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let’s say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let’s say, requires 300 serial steps of reasoning. The model’s strategy is to write subtly-vulnerable code but in a way that looks like it’s trying to write good code and just getting a bit unlucky and taking longer than usual (let’s suppose the humans have a poor ability to judge how long it’s supposed to take, but still some ability, such that it can’t take e.g. 10x longer without arousing suspicion).
If there’s no paraphraser, this is ‘easy.’ I mean it’s still difficult but it’s ‘easy’ in the sense that the AI can basically just use a code—each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can ‘add extra layers of meaning’ to the nuggets—in addition to their meaning in the surface level i’m-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code.
But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if ‘I should generate unit tests now’ and all semantically equivalent variants of that idea are assigned the hidden meaning ‘I should check to see if the vulnerability is too easy to spot’ for example, then… what is it going to write next? It’s too constrained; in order to look non-suspicious there is a limited set of options it can follow up with… ok, hmm, I’m at the limits of my understanding here. But maybe you get the idea?
…here’s an intuition I think I’m drawing from. Suppose you’re the model and you are trying to decide what hidden meanings to associate with what concepts/phrases/steps. If you already had your completed illicit reasoning chain and your completed good-boy reasoning chain, it would be easy—just line them up and make a 1-1 mapping, and then bam no paraphraser can stop you! But you don’t have either chain! That’s the whole point—this is a task that you haven’t done yet, you need to explore some maze in logical space (or tree? Or graph?) until you find a chain that gets to the conclusion/answer.
Consider a CoT of the form “[problem statement]: …........”, which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.
And in the framework I’d outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such “filler” tokens.
So yep, you’re right, this does force all scheming to be taking place within one forward pass.
Edit: Nevermind, evidently I’ve not thought this through properly. I’m retracting the below.
The naïve formulations of utilitarianism assume that all possible experiences can be mapped to scalar utilities lying on the same,continuous spectrum, and that experiences’ utility is additive. I think that’s an error.
This is how we get the frankly insane conclusions like “you should save 10100 shrimps instead of one human” or everyone’s perennial favorite, “if you’re choosing between one person getting tortured for 50 years or some amount of people N getting a dust speck into their eye, there must be an N big enough that the torture is better”. I disagree with those. I would sacrifice an arbitrarily large amount of shrimps to save one human, and there’s no N big enough for me to pick torture. I don’t care if that disagrees with what the math says: if the math says something else, it’s the wrong math and we should find a better one.
Here’s a sketch of how I think such better math might look like:
There’s a totally ordered, discrete set of “importance tiers” of experiences.
Within tiers, the utilities are additive: two people getting dust specks is twice as bad as one person getting dust-speck’d, two people being tortured is twice as bad as one person being tortured, eating a chocolate bar twice per week is twice as good as eating a bar once per week, etc.
Across tiers, the tier ordering dominates: if we’re comparing some experience belonging to a higher tier to any combination of experiences from lower tiers, the only relevant consideration is the sign of the higher-tier experience. No amount of people getting dust-speck’d, and no amount of dust-speck events sparsely distributed throughout one person’s life, can ever add up to anything as important as a torture-level experience.[1]
Intuitively, tiers correspond to the size of effect a given experience has on a person’s life:
A dust speck is a minor inconvenience that is forgotten after a second. If we zoom out and consider the person’s life at a higher level, say on the scale of a day, this experience rounds off exactly to zero, rather than to an infinitesimally small but nonzero value. (Again, on the assumption of a median dust-speck event, no emergent or butterfly effects.)
Getting yelled at by someone might ruin the entirety of someone’s day, but is unlikely to meaningfully change the course of their life. Experiences on this tier are more important than any amount of dust-speck experiences, but any combination of them rounds down to zero from a life’s-course perspective.
Getting tortured is likely to significantly traumatize someone, to have a lasting negative impact on their life. Experiences at this tier ought to dominate getting-yelled-at as much as getting-yelled-at dominates dust specks.
Physically, those “importance tiers” probably fall out of the hierarchy of natural abstractions. Like everything else, a person’s life has different levels of organization. Any detail in how the high-level life-history goes is incomparably more important than any experience which is only locally relevant (which fails to send long-distance ripples throughout the person’s life). Butterfly effects are then the abstraction leaks (low-level events that perturb high-level dynamics), etc.
I didn’t spend much time thinking about this, so there may be some glaring holes here, but this already fits my intuitions much better.
I think we can expand that framework to cover “tiers of sentience”:
If shrimps have qualia, it might be that any qualia they’re capable of experiencing belong to lower-importance tiers, compared to the highest-tier human qualia.
Simultaneously, it might be the case that the highest-importance shrimp qualia are on the level of the lower-importance human qualia.
Thus, it might be reasonable to sacrifice the experience of eating a chocolate bar to save 10100 shrimps, even if you’d never sacrifice a person’s life (or even make someone cry) to save any amount of shrimps.
This makes some intuitive sense, I think. The model above assumes that “local” experiences, which have no impact on the overarching pattern of a person’s life, are arbitrarily less important than that overarching pattern. What if we’re dealing with beings whose internal lives have no such overarching patterns, then? A shrimp’s interiority is certainly less complex than that of a human, so it seems plausible that its life-experience lacks comparably rich levels of organization (something like “the ability to experience what is currently happening as part of the tapestry spanning its entire life, rather than as an isolated experience”). So all of its qualia would be comparable only with the “local” experiences of a human, for some tier of locality: we would have direct equivalence between them.
One potential issue here is that this implies the existence of utility monsters: some divine entities such that they can have experiences incomparably more important than any experience a human can have. I guess it’s possible that if I understood qualia better, I would agree with that, but this seems about as anti-intuitive as “shrimps matter as much as humans”. My intuition is that sapient entities top the hierarchy of moral importance, that there’s nothing meaningfully “above” them. So that’s an issue.
One potential way to deal with this is to suggest that what distinguishes sapient/”generally intelligent” entities is not that they’re the only entities whose experiences matter, but that they have the ability to (learn to) have experiences of arbitrarily high tiers. And indeed: the whole shtick of “general intelligence” is that it should allow you to learn and reason about arbitrarily complicated systems of abstraction/multi-level organization. If the importance tiers of experiences really have something to do with the richness of the organization of the entity’s inner life, this resolves things neatly. Now:
Non-sapient entities may have experiences of nonzero importance.
No combination of non-sapient experiences can compare to the importance of a sapient entity’s life.
“Is sapient” tops the hierarchy of moral relevance: there’s no type of entity that is fundamentally “above”.
Two caveats here are butterfly effects and emergent importance:
Getting a dust speck at the wrong moment might kill you (if you’re operating dangerous machinery) or change the trajectory of your life (if this minor inconvenience is the last straw that triggers a career-ruining breakdown). We have to assume such possibilities away: the experiences exist “in a vacuum”. Doing otherwise would violate the experimental setup, dragging in various practical considerations, instead of making it purely about ethics.
So we assume that each dust-speck event always has the “median” amount of impact on a person’s life, even if you scale the amount of dust-speck events arbitrarily.
Getting 1000 dust specks one after another adds up to something more than 1000 single-dustspeck experiences; it’s worse than getting a dust speck once per day for 1000 days. More intuitively, experiencing a 10⁄10 pain for one millisecond is not comparable to experiencing 10⁄10 pain for 10 minutes. There are emergent effects at play, and like with butterflies, we must assume them away for experimental purity.
So if we’re talking about M experiences from within the same importance tier, they’re assumed to be distributed such that they don’t add up to a higher-tier experience.
Note that those are very artificial conditions. In real life, both of those are very much in play. Any lower-tier experience has a chance of resulting into a higher-tier experience, and every higher-tier experience emerges from (appropriately distributed) lower-tier experiences. In our artificial setup, we’re assuming certain knowledge that no butterfly effects would occur, and that a lower-tier event contributes to no higher-tier pattern.
Relevance: There’s reasoning that goes, “if you ever drive to the store to get a chocolate bar, you’re risking crashing into and killing someone, therefore you don’t value people’s lives infinitely more than eating chocolate”. I reject it on the above grounds. Systematically avoiding all situations where you’re risking someone’s life in exchange for a low-importance experience would assemble into a high-importance life-ruining experience for you (starving to death in your apartment, I guess?). Given that, we’re now comparing same-tier experiences, and here I’m willing to be additive, calculating that killing a person with very low probability is better than killing yourself (by a thousand cuts) with certainty.
Besides uncertainty, there’s the problem of needing to pick cutoffs between tiers in a ~continuous space of ‘how much effect does this have on a person’s life?’, with things slightly on one side or the other of a cutoff being treated very differently.
Intuitively, tiers correspond to the size of effect a given experience has on a person’s life:
I agree with the intuition that this is important, but I think that points toward just rejecting utilitarianism (as in utility-as-a-function-purely-of-local-experiences, not consequentialism).
It’s worth noting that everything funges: some large number of experiences of eating a chocolate bar can be exchanged for avoiding extreme human suffering or death. So, if you lexicographically put higher weight on extreme human suffering or death, then you’re willing to make extreme tradeoffs (e.g.1030 chocolate bar experiences) in terms of mundane utility for saving a single life. I think this easily leads to extremely unintuitive conclusions, e.g. you shouldn’t ever be willing to drive to a nice place. See also Trading off Lives.
I find your response to this sort of argument under “Relevance: There’s reasoning that goes” in the footnote very uncompelling as it doesn’t apply to marginal impacts.
Relevance: There’s reasoning that goes, “if you ever drive to the store to get a chocolate bar, you’re risking crashing into and killing someone, therefore you don’t value people’s lives infinitely more than eating chocolate”. I reject it on the above grounds. Systematically avoiding all situations where you’re risking someone’s life in exchange for a low-importance experience would assemble into a high-importance life-ruining experience for you (starving to death in your apartment, I guess?). Given that, we’re now comparing same-tier experiences, and here I’m willing to be additive, calculating that killing a person with very low probability is better than killing yourself (by a thousand cuts) with certainty.
Ok, but if you don’t drive to the store one day to get your chocolate, then that is not a major pain for you, yes? Why not just decide that next time you want chocolate at the store, you’re not going to go out and get it because you may run over a pedestrian? Your decision there doesn’t need to impact your other decisions.
Then you ought to keep on making that choice until you are right on the edge of those choices adding up to a first-tier experience, but certainly below.
This logic generalizes. You will always be pushing the lower tiers of experience as low as they can go before they enter the upper-tiers of experience. I think the fact that your paragraph above is clearly motivated reasoning here (instead of “how can I actually get the most bang for my buck within this moral theory” style reasoning) shows that you agree with me (and many others) that this is flawed.
Systematically avoiding all situations where you’re risking someone’s life in exchange for a low-importance experience would assemble into a high-importance life-ruining experience for you (starving to death in your apartment, I guess?).
We can easily ban speed above 15km/h for any vehicles except ambulances. Nobody starves to death in this scenario, it’s just very inconvenient. We value convenience lost in this scenario more than lives lost in our reality, so we don’t ban high-speed vehicles.
Ordinal preferences are bad and insane and they are to be avoided.
What’s really wrong with utilitarianism is that you can’t, actually, sum utilities: it’s a type error, because utilities are invariant up to affine transform, what would their sum mean?
The problem, I think, that humans naturally conflate two types of altruism. First type is caring about other entities mental state. Second type is “game-theoretic” or “alignment-theoretic” altruism: generalized notion of what does that mean to care about someone else’s values. Roughly, I think that good type of the second type of altruism requires you to do fair bargaining in interests of entity you are being altruistic towards.
Let’s take “World Z” thought experiment. The problem from the second type altruism perspective is that total utilitarian gets very large utility from this world, while all inhabitants of this world, by premise, get very small utility per person, which is unfair division of gains.
One may object: why not create entities who think that very small share of gains is fair? My answer is that if entity can be satisfied with infinitesimal share of gains, it also can be satisfied with infinitesimal share of anthropic measure, i.e., non-existence, and it’s more altruistic to look for more demanding entities to fill universe with.
My general problem with animal welfare from bargaining perspective is that most of animals probably don’t have sufficient agency to have any sort of representative in bargaining. We can imagine CEV of shrimp which is negative utilitarian and wants to kill all shrimp, or positive utilitarian which thinks that even very painful existence is worth it, or CEV that prefers shrimp swimming in heroin, or something human-like, or something totally alien, and sum of these guesses probably sums up to “do not torture and otherwise do as you please”.
This is how we get the frankly insane conclusions like you should save 10100 shrimps instead of one human
Huh, I expected better from you.
No, it is absolutely not insane to save 10100 shrimp instead of one human! I think the case for insanity for the opposite is much stronger! Please, actually think about how big 10100 is. We are talking about more shrimp than atoms in the universe. Trillions upon trillions of shrimp more than atoms in the universe.
This is a completely different kind of statement than “you should trade of seven bees against a human”.
No, being extremely overwhelmingly confident about morality such that even if you are given a choice to drastically alter 99.999999999999999999999% of the matter in the universe, you call the side of not destroying it “insane” for not wanting to give up a single human life, a thing we do routinely for much weaker considerations, is insane.
The whole “tier” thing obviously fails. You always end up dominated by spurious effects on the highest tier. In a universe with any appreciable uncertainty you basically just ignore any lower tiers, because you can always tell some causal story of how your actions might infinitesimally affect something, and so you completely ignore it. You might as well just throw away all morality except the highest tier, it will never change any of your actions.
I’m normally in favor of high decoupling, but this thought experiment seems to take it well beyond the point of absurdity. If I somehow found myself in control of the fate of 10^100 shrimp, the first thing I’d want to do is figure out where I am and what’s going on, since I’m clearly no longer in the universe I’m familiar with.
Yeah, I mean, that also isn’t a crazy response. I think being like “what would it even mean to have 10^30 times more shrimp than atoms? Really seems like my whole ontology about the world must be really confused” also seems fine. My objection is mostly to “it’s obvious you kill the shrimp to save the human”.
what would it even mean to have 10^30 times more shrimp than atoms?
Oh, easy, it just implies you’re engaging in acausal trade with a godlike entity residing in some universe dramatically bigger than this one. This interpretation introduces no additional questions or complications whatsoever.
No, being extremely overwhelmingly confident about morality such that even if you are given a choice to drastically alter 99.999999999999999999999% of the matter in the universe, you call the side of not destroying it “insane” for not wanting to give up a single human life, a thing we do routinely for much weaker considerations, is insane.
Hm. Okay, so my reasoning there went as follows:
Substitute shrimp for rocks.10100 rocks would also be an amount of matter bigger than exists in the observable universe, and we presumably should assign a nonzero probability to rocks being sapient. Should we then save 10100 rocks instead of one human?
Perhaps. But I think this transforms the problem into Pascal’s mugging, and has nothing to do with shrimp or ethics anymore. If we’re allowed to drag in outside considerations like this, we should also start questioning whether these 10100 rocks/shrimp actually exist, and all other usual arguments against Pascal’s mugging.
To properly engage with thought experiments within some domain, like ethics, we should take the assumptions behind this domain as a given. This implicitly means constraining our hypothesis space to models of reality within which this domain is a meaningful thing to reason about.
In this case, this would involve being able to reason about 10100 rocks as if they really were just “rocks”, without dragging in the uncertainty about “but what if my very conception of what a ‘rock’ is is metaphysically confused?”.
Similarly, surely we should be able to have thought experiments in which “shrimp” really are just “shrimp”, ontologically basic entities that are not made up of matter which can spontaneously assemble into Boltzmann brains or whatever.
“Shrimp” being a type of system that could implement qualia as valuable as that of humans seems overwhelmingly unlikely to me, not as unlikely as “rocks have human-level qualia”, but in the same reference class. Therefore, in the abstract thought-experiment setup in which I have no uncertainty regarding the ontological nature of shrimp, it’s reasonable to argue that no amount of them compares to a human life.
I’m not sure where you’d get off this train, but I assume the last bullet-point would do this? I. e., that you would argue that holding the possibility of shrimps having human-level qualia is salient in a way it’s not for rocks?
Yeah, that seems valid. I might’ve shot from the hip on that one.
The whole “tier” thing obviously fails. You always end up dominated by spurious effects on the highest tier
I have a story for how that would make sense, similarly involving juggling inside-model and outside-model reasoning, but, hm, I’m somehow getting the impression my thinking here is undercooked/poorly presented. I’ll revisit that one at a later time.
Edit: Incidentally, any chance the UI for retracting a comment could be modified? I have two suggestions here:
I’d like to be able to list a retraction reason, ideally at the top of the comment.
The crossing-out thing makes it difficult to read the comment afterwards, and some people might want to be able to do that. Perhaps it’s better to automatically put the contents into a collapsible instead, or something along those lines?
Edit: Incidentally, any chance the UI for retracting a comment could be modified? I have two suggestions here:
You should be able to strike out the text manually and get the same-ish effect, or leave a retraction notice. The text being hard to read is intentional so that it really cannot be the case that someone screenshots it or skims it without noticing that it is retracted.
No, it is absolutely not insane to save 10100 shrimp instead of one human! It would be insane to do the opposite. Please, actually think about how big 10100 is. We are talking about more shrimp than atoms in the universe. Trillions upon trillions of shrimp more than atoms in the universe.
I think it pretty clearly is insane to save 10100 shrimp instead of one human! It doesn’t matter how many shrimp it is. The moral value of shrimp does not aggregate like that.
The grandparent comment is obviously correct in its description of the problem. (Whether the proposed solution works is another question entirely.)
The whole “tier” thing obviously fails. You always end up dominated by spurious effects on the highest tier.
That’s just not true.
One obvious approach is: once you get to the point where noise (i.e., non-systematic error) dominates your calculations for a particular tier, you ignore that tier and consider the next lower tier. (This is also approximately equivalent to a sort of reasoning which we use all the time, and it works pretty straightforwardly, without giving rise to the sorts of pathologies you allude to.)
There is no clear or adequate definition of what “[noise] dominated your calculations” means. Maybe you can provide one, but I’ve never seen anyone provide any such definition, or made much headway in doing so.
Creating such a definition of noise has proven to be quite hard, as it’s extremely rare that someone is willing to ignore absolutely all stakes at lower levels of concern or abstraction, no matter the magnitude.
Even if one tries to elevate your family above everything else, it is commonly accepted that it is not moral to sacrifice all of society for just your family, or threaten large scale catastrophe.
Similarly as you elevate the interests of your nation above other things, at a sufficient scale the interests of the rest of the world poke their way into your decision-making in substantial ways again.
Even if you try to do nothing but elevate the interests of animal life, we have still decided that it is not ethical to destroy even fully abiological and definitely not complicated plant-based ecosystems for those interests, if the harm is sufficiently large.
You maybe want to propose we make decisions this way, but humanity absolutely does not generally make decisions this way. When people have to make decisions they usually decide on some rough thresholds for noticing tradeoffs across domains, and indeed decide and re-evaluate how important something is when a decision affects something in a different domain at much larger scale than other decisions.
It doesn’t matter how many shrimp it is.
Look, there are numbers that are very very big.[1]
Again, we are talking about so many shrimp that it would be exceedingly unlike for this number of shrimp, if left under the auspices of gravity, to form their own planets and solar systems and galaxies in which life thrives and which other non-shrimp intelligences form. A number so incomprehensibly big. A galaxy within each atom of our universe made out of shrimp. One can argue it’s meaningless to talk about numbers this big, and while I would dispute that, it’s definitely a much more sensible position than trying to take a confident stance to destroy or substantially alter a set of things so large that it vastly eclipses in complexity and volume and mass and energy all that has ever or will ever exist by a trillion-fold.
Indeed, there are numbers so big that the very presence of specfiying them would encode calculations capable of simulating universes full of healthy and happy humans. The space of numbers is really very large.
One can argue it’s meaningless to talk about numbers this big, and while I would dispute that, it’s definitely a much more sensible position than trying to take a confident stance to destroy or substantially alter a set of things so large that it vastly eclipses in complexity and volume and mass and energy all that has ever or will ever exist by a trillion-fold.
Okay, while I’m hastily backpedaling from the general claims I made, I am interested in your take on the first half of this post. I think there’s a difference between talking about an actual situation, full complexities of the messy reality taken into account, where a supernatural being physically shows up and makes you really decide between a human and 10100 of shrimp, and a thought experiment where “you” “decide” between a “human” and 10100 of “shrimp”. In the second case, my model is that we’re implicitly operating in an abstracted-out setup where the terms in the quotation marks are, essentially, assumed ontologically basic, and matching our intuitive/baseline expectations about what they mean.
While, within the hypothetical, we can still have some uncertainty over e. g. the degree of the internal experiences of those “shrimp”, I think we have to remove considerations like “the shrimp will be deposited into a physical space obeying the laws of physics where their mass may form planets and galaxies” or “with so many shrimp, it’s near-certain that the random twitches of some subset of them would spontaneously implement Boltzmann civilizations of uncountably many happy beings”.
IMO, doing otherwise is a kind of “dodging the hypothetical”, no different from considering it very unlikely that the supernatural being really has control over 10100 of something, and starting to argue about this instead.
While, within the hypothetical, we can still have some uncertainty over e. g. the degree of the internal experiences of those “shrimp”, I think we have to remove considerations like “the shrimp will be deposited into a physical space obeying the laws of physics where their mass may form planets and galaxies” or “with so many shrimp, it’s near-certain that the random twitches of some subset of them would spontaneously implement Boltzmann civilizations of uncountably many happy beings”.
IMO, doing otherwise is a kind of “dodging the hypothetical”, no different from considering it very unlikely that the supernatural being really has control over 10100 of something, and starting to argue about this instead.
I agree there is something to this, but when actually thinking about tradeoffs that do actually have orders of magnitude of variance in them, which is ultimately where this kind of reasoning is most useful (not 100 orders of magnitude, but you know 30-50 are not unheard of), this kind of abstraction would mostly lead you astray, and so I don’t think it’s a good norm for how to take thought experiments like this.
Like, I agree there are versions of the hypothetical that are too removed, but ultimately, I think a central lesson of scope sensitivity is that having a lot more of something often means drastic qualitative changes that come with that drastic change in quantity. Having 10 flop/s of computation is qualitatively different to having 10^10 flop/s. I can easily imagine someone before the onset of modern computing saying “look, how many numbers do you really need to add in everyday life? What is even the plausible purpose of having 10^10 flop/s available? For what purpose would you need to possibly perform 10 billion operations per second? This just seems completely absurd. Clearly the value of a marginal flop goes to zero long before that. That is more operations than all computers[1] in the world have ever ever done, in all of history, combined. What could possibly be the point of this?”
And of course, such a person would be sorely mistaken. And framing the thought experiment as “well, no, I think if you want to take this thought experiment seriously you should think about how much you would be willing to pay for the 10 billionth operation of the kind that you are currently doing, which is clearly zero. I don’t want you to hypothesize some kind of new art forms or applications or computing infrastructure or human culture, which feel like they are not the point of this exercise, I want you to think about the marginal item in isolation” would be pointless. It would be emptying the exercise and tradeoff of any of its meaning. If we ever face a choice like this or, anything remotely like it, of course how the world adapts around this, and the applications that get built for it, and the things that aren’t obvious from when you first asked the question matter.
And to be clear, I think there is also a real conversation going on here about whether maybe, even if you isolated each individual shrimp into a tiny pocket universe, and you had no way of ever seeing them or visiting the great shrimp rift (a natural wonder clearly greater than any natural wonder on earth), and all you knew for sure was that it existed somewhere outside of your sphere of causal influence, and the shrimp never did anything more interesting than current alive shrimp, whether it would still be worth it to kill a human. And for that, I think the answer is less obviously “yes” or “no”, though my guess is the 10^100 causally isolated shrimp ultimate still enrich the universe more than a human would, and are worth preserving more, but it’s less clear.
And we could focus on that if we want to, I am not opposed to it, but it’s not clearly what the OP that sparked this whole thread was talking about, and I find it less illuminating than other tradeoffs, and it would still leave me with a strong reaction that at least the reason for why the answer might be “kill the shrimp” is definitely absolutely different from the answer to why you should not kill a human to allow 7 bees to live for a human lifetime.
I think there is also a real conversation going on here about whether maybe, even if you isolated each individual shrimp into a tiny pocket universe, and you had no way of ever seeing them or visiting the great shrimp rift (a natural wonder clearly greater than any natural wonder on earth), and all you knew for sure was that it existed somewhere outside of your sphere of causal influence, and the shrimp never did anything more interesting than current alive shrimp, whether it would still be worth it to kill a human
Yeah, that’s more what I had in mind. Illusion of transparency, I suppose.
Like, I agree there are versions of the hypothetical that are too removed, but ultimately, I think a central lesson of scope sensitivity is that having a lot more of something often means drastic qualitative changes in what it means to have that thing
Certainly, and it’s an important property of reality. But I don’t think this is what extreme hypotheticals such as the one under discussion actually want to talk about (even if you think this is a more important question to focus on)?
Like, my model is that the 10100 shrimp in this hypothetical are not meant to literally be10100 shrimp. They’re meant to be "10100" “shrimp”. Intuitively, this is meant to stand for something like “a number of shrimp large enough for any value you’re assigning them to become morally relevant”. My interpretation is that the purpose of using a crazy-large number is to elicit that preference with certainty, even if it’s epsilon; not to invite a discussion about qualitative changes in the nature of crazy-large quantities of arbitrary matter.
The hypothetical is interested in shrimp welfare. If we take the above consideration into account, it stops being about “shrimp” at all (see the shrimps-to-rocks move). The abstractions within which the hypothetical is meant to live break.
And yes, if we’re talking about a physical situation involving the number 10100, the abstractions in question really do break under forces this strong, and we have to navigate the situation with the broken abstractions. But in thought-experiment land, we can artificially stipulate those abstractions inviolable (or replace the crazy-high abstraction-breaking number with a very-high but non-abstraction-breaking number).
Like, my model is that the 10100 shrimp in this hypothetical are not meant to literally be10100 shrimp. They’re meant to be "10100" “shrimp”. Intuitively, this is meant to stand for something like “a number of shrimp large enough for any value you’re assigning them to become morally relevant”. My interpretation is that the purpose of using a crazy-large number is to elicit that preference with certainty, even if it’s epsilon; not to invite a discussion about qualitative changes in the nature of crazy-large quantities of arbitrary matter.
I agree that this is a thing people often like to invoke, but it feels to me a lot like people talking about billionaires and not noticing the classical crazy arithmetic errors like:
If Jeff Bezos’ net worth reaches $1 trillion, “he could literally end world poverty and give everyone $1 billion and he will still have $91.5 billion left.”
Like, in those discussions people are almost always trying to invoke numbers like “$1 trillion” as “a number so big that the force of the conclusion must be inevitable”, but like most of the time they just fail because the number isn’t big enough.
If someone was like “man, are you really that confident that a shrimp does not have morally relevant experience that you wouldn’t trade a human for a million shrimp?”, my response is “nope, sorry, 1 million isn’t big enough, that’s just really not that big of a number”. But if you give me a number a trillion trillion trillion trillion trillion trillion trillion trillion times bigger, IDK, yeah, that is a much bigger number.
And correspondingly, for every thought experiment of this kind, I do think there is often a number that will just rip through your assumptions and your tradeoffs. There are just really very very very big numbers.
Like, sure, we all agree our abstraction break here, and I am not confident you can’t find any hardening of abstraction that make the tradeoff come out in the direction of the size of the number really absolutely not mattering at all, but I think that would be a violation of the whole point of the exercise. Like, clearly we can agree that we assign a non-zero value to a marginal shrimp. We value that marginal shrimp for a lot of different reasons, but like, you probably value it for reasons that does include things like the richness of its internal experience, and the degree to which it differs from other shrimp, and the degree to which it contributes to an ecosystem, and the degree to which it’s an interesting object of trade, and all kinds of reasons. Now, if we want to extrapolate that value to 10^100, those things still are there, we can’t just start ignoring them.
Like, I would feel more sympathetic to this simplification if the author of the post was a hardcore naive utilitarian, but they self-identify as a kantian. Kantianism is a highly contextual ethical theory that clearly cares about a bunch of different details of the shrimp, so I don’t get the sense the author wants us to abstract away everything but some supposed “happiness qualia” or “suffering qualia” from the shrimp.
I agree that this is a thing people often like to invoke, but it feels to me a lot like people talking about billionaires and not noticing the classical crazy arithmetic errors like
Isn’t it the opposite? It’s a defence against providing too-low numbers, it’s specifically to ensure that even infinitesimally small preferences are elicited with certainty.
Bundling up all “this seems like a lot” numbers into the same mental bucket, and then failing to recognize when a real number is not actually as high as in your hypothetical, is certainly an error one could make here. But I don’t see an exact correspondence...
In the billionaires case, a thought-experimenter may invoke the hypothetical of “if a wealthy person had enough money to lift everyone out of poverty while still remaining rich, wouldn’t them not doing so be outrageous?”, while inviting the audience to fill-in the definitions of “enough money” and “poverty”. Practical situations might then just fail to match that hypothetical, and innumerate people might fail to recognize that, yes. But this doesn’t mean that that hypothetical is fundamentally useless to reason about, or that it can’t be used to study some specific intuitions/disagreements. (“But there are no rich people with so much money!” kind of maps to “but I did have breakfast!”.)
And in the shrimps case, hypotheticals involving a “very-high but not abstraction-breaking” number of shrimps are a useful tool for discussion/rhetoric. It allows to establish agreement/disagreement on “shrimp experiences have inherent value at all”, a relatively simple question that could serve as a foundation for discussing other, more complicated and contextual ones. (Such as “how much should I value shrimp experiences?” or “but do enough shrimps actually exist to add up to more than a human?” or “but is Intervention X to which I’m asked to donate $5 going to actually prevent five dollars’ worth of shrimp suffering?”.)
Like, I think having a policy of always allowing abstraction breaks would just impoverish the set of thought experiments we would be able consider and use as tools. Tons of different dilemmas would collapse to Pascal’s mugging or whatever.
Like, I would feel more sympathetic to this simplification if the author of the post was a hardcore naive utilitarian, but they self-identify as a kantian. Kantianism is a highly contextual ethical theory that clearly cares about a bunch of different details of the shrimp, so I don’t get the sense the author wants us to abstract away everything but some supposed “happiness qualia” or “suffering qualia” from the shrimp.
Hmm… I think this paragraph at the beginning is what primed me to parse it this way:
Merriam-Webster defines torture as “the infliction of intense pain (as from burning, crushing, or wounding) to punish, coerce, or afford sadistic pleasure.” So I remind the reader that it is part of the second thought experiment that the shrimp are sentient.
Why would we need this assumption[1], if the hypothetical weren’t centrally about the inherent value of the shrimps/shrimp qualia, and the idea that it adds up? The rest of that essay also features no discussion of the contextual value that the existence of a shrimp injects into various diverse environments in which it exists, etc. It just throws the big number around, while comparing the value of shrimps to the value of eating a bag of skittles, after having implicitly justified shrimps having value via shrimps having qualia.
I suppose it’s possible that if I had the full context of the author’s writing in mind, your interpretation would have been obviously correct[2]. But the essay itself reads the opposite way to me.
Why would we need this assumption[1], if the hypothetical weren’t centrally about the inherent value of the shrimps/shrimp qualia, and the idea that it adds up? The rest of that essay also features no discussion of the contextual value that the existence of a shrimp injects into various diverse environments in which it exists, etc. It just throws the big number around, while comparing the value of shrimps to the value of eating a bag of skittles, after having implicitly justified shrimps having value via shrimps having qualia.
I agree probably I implied a bit too much contextualization. Like, I agree the post has a utilitarian bend, but man, I just really don’t buy the whole “let’s add up qualia” as any basis of moral calculation, that I find attempts at trying to create a “pure qualia shrimp” about as confused and meaningless as trying to argue that 7 bees are more important than a human. “qualia” isn’t a thing that exists. The only thing that exists are your values in all of their complexity and godshatteredness. You can’t make a “pure qualia shrimp”, it doesn’t many any philosophical sense, pure qualia isn’t real.
And I agree that maybe the post was imagining some pure qualia juice, and I don’t know, maybe in that case it makes sense to dismiss it by doing a reductio ad absurdum on qualia juice, but I don’t currently buy it. I think that both wouldn’t be engaging with the good parts of the author, and also be kind of a bad step in the discourse (like, the previous step was understanding why it doesn’t make sense for 7 bees to be more important than a human, for a lot of different reasons and very robustly and within that discourse, it’s actually quite important to understand why 10^100 shrimp might actually be more important than a human, under at least a lot of reasonable set of assumptions).
I just really don’t buy the whole “let’s add up qualia” as any basis of moral calculation
Same, honestly. To me, many of these thought experiments seem decoupled from anything practically relevant. But it still seems to me that people often do argue from those abstracted-out frames I’d outlined, and these arguments are probably sometimes useful for establishing at least some agreement on ethics. (I’m not sure how a full-complexity godshatter-on-godshatter argument would even look like (a fistfight, maybe?), and am very skeptical it’d yield any useful results.)
Anyway, it sounds like we mostly figured out what the initial drastic disconnect between our views here was caused by?
Even if one tries to elevate your family above everything else, it is commonly accepted that it is not moral to sacrifice all of society for just your family, or threaten large scale catastrophe.
This just means that “elevate your family above everything else” is not an approved-of moral principle, not that it somehow doesn’t work on its own terms. In any case this is not a problem with multi-tier morality, it’s just a disagreement on what the tiers should be.
Similarly as you elevate the interests of your nation above other things, at a sufficient scale the interests of the rest of the world poke their way into your decision-making in substantial ways again.
This, on the other hand, is a matter of instrumental values, not terminal ones. There is once again no problem here with multi-tier morality.
Even if you try to do nothing but elevate the interests of animal life, we have still decided that it is not ethical to destroy even fully abiological and definitely not complicated plant-based ecosystems for those interests, if the harm is sufficiently large.
Same reply as to the first point. (Also, who has ever advocated so weirdly drawn a moral principle as “do nothing but elevate the interests of animal life”…?)
It doesn’t matter how many shrimp it is.
That is false. The numbers are very big. There are numbers so big that the very presence of specfiying them would encode calculations capable of simulating universes full of healthy and happy humans. It absolutely matters how big this kind of number is.
It doesn’t matter how big the numbers are, because the moral value of shrimp does not aggregate like that. If it were 3^^^3 shrimp, it still wouldn’t matter.
Again, we are talking about so many shrimp that it would be exceedingly unlike for this number of shrimp, if left under the auspices of gravity, to form their own planets and solar systems and galaxies in which life thrives and which other non-shrimp intelligences form.
Now you’re just smuggling in additional hypothesized entities and concerns. Are we talking about shrimp, or about something else? This is basically a red herring.
That aside—no, the numbers really don’t matter, because that’s just not how moral value of shrimp works, in any remotely sensible moral system. A trillion shrimp do not have a million times the moral value of a million shrimp. If your morality says that they do, then your morality is broken.
A trillion shrimp do not have a million times the moral value of a million shrimp. If your morality says that they do, then your morality is broken.
Nobody was saying this! The author of the post in question also does not believe this!
I am not a hedonic utilitarian. I do not think that a trillion shrimp have a million times the moral value of a million shrimp. That is a much much stronger statement than whether there exists any number of shrimp that might be worth more than a human. All you’ve done here is to set up a total strawman that nobody was arguing for and knocked it down.
… 1,000 times the moral value of a million shrimp?
… 10 times the moral value of a million shrimp?
… 1.1 times the moral value of a million shrimp?
… some other multiplicative factor, larger than 1, times the moral value of a million shrimp?
If the answer is “no” to all of these, then that seems like it would mean that you already agree with me, and your previous comments here wouldn’t make any sense. So it seems like the answer has to be “yes” to something in that list.
But then… my response stands, except with the relevant number changed.
On the other hand, you also say:
I am not a hedonic utilitarian.
I… don’t understand how you could be using this term that would make this a meaningful or relevant thing to say in response to my comment. Ok, you’re not a hedonic utilitarian, and thus… what?
Is the point that your claim about saving 10100 shrimp instead of one human isn’t insane… was actually not a moral claim at all, but some other kind of claim (prudential, for instance)? No, that doesn’t seem to work either, because you wrote:
No, being extremely overwhelmingly confident about morality such that even if you are given a choice to drastically alter 99.999999999999999999999% of the matter in the universe, you call the side of not destroying it “insane” for not wanting to give up a single human life, a thing we do routinely for much weaker considerations, is insane.
So clearly this is about morality…
… yeah, I can’t make any sense of what you’re saying here. What am I missing?
… 1,000 times the moral value of a million shrimp?
… 10 times the moral value of a million shrimp?
… 1.1 times the moral value of a million shrimp?
… some other multiplicative factor, larger than 1, times the moral value of a million shrimp?
I don’t know, seems like a very hard question, and I think will be quite sensitive to a bunch of details of the exact comparison. Like, how much cognitive diversity is there among the shrimp? Are the shrimps forming families and complicated social structures, or are they all in an isolated grid? Are they providing value to an extended ecosystem of other life? How rich is the life of these specific shrimp?
I would be surprised if the answer basically ever turned out to be less than 1.1, and surprised if it ever turned out to be more than 10,000.
But then… my response stands, except with the relevant number changed.
I don’t think your response said anything except to claim that a linear relationship between shrimp and values seems to quickly lead to absurd conclusions (or at least that is what I inferred from your claim of saying that a trillion shrimp is not a million times more valuable than a million shrimp). I agree with that as a valid reductio ad absurdum, but given that I see no need for linearity here (simply any ratio, which could even differ with the scale and details of the scenario), I don’t see how your response stands.
… yeah, I can’t make any sense of what you’re saying here. What am I missing?
I have little to go off of besides to repeat myself, as you have given me little to work with besides repeated insistence that what I believe is wrong or absurd. My guess is my meaning is more clear (though probably still far from perfectly clear) to other readers.
I don’t know, seems like a very hard question, and I think will be quite sensitive to a bunch of details of the exact comparison. Like, how much cognitive diversity is there among the shrimp? Are the shrimps forming families and complicated social structures, or are they all in an isolated grid? Are they providing value to an extended ecosystem of other life? How rich is the life of these specific shrimp?
I mean… we know the answers to these questions, right? Like… shrimp are not some sort of… un-studied exotic form of life. (In any case it’s a moot point, see below.)
I would be surprised if the answer basically ever turned out to be less than 1.1, and surprised if it ever turned out to be more than 10,000.
Right, so, “some … multiplicative factor, larger than 1”. That’s what I assumed. Whether that factor is 1 million, or 1.1, really doesn’t make any difference to what I wrote earlier.
I don’t think your response said anything except to claim that a linear relationship between shrimp and values seems to quickly lead to absurd conclusions (or at least that is what I inferred from your claim of saying that a trillion shrimp is not a million times more valuable than a million shrimp). I agree with that as a valid reductio ad absurdum, but given that I see no need for linearity here (simply any ratio, which could even differ with the scale and details of the scenario), I don’t see how your response stands.
No, my point is that any factor at all that is larger than 1, and remains larger than 1 as numbers increase, leads to absurd conclusions. (Like, for example, the conclusion that there is some number of shrimp such that that many shrimp are worth more than a human life.)
Given this correction, do you still think that I’m strawmanning or misunderstanding your views…? (I repeat that linearity is not the target of my objection!)
No, my point is that any factor at all that is larger than 1, and remains larger than 1 as numbers increase
I mean, clearly you agree that two shrimp are more important than one shrimp, and continues to be more important (at least for a while) as the numbers increase. So no, I don’t understand what you are saying, as nothing you have said appears sensitive to any numbers being different, and clearly for small numbers you agree that these comparisons must hold.
I agree there is a number big enough where eventually you approach 1, nothing I have said contradicts that. As in, my guess is the series of the value of shrimp as n goes to infinity does not diverge but eventually converges on some finite number (though especially with considerations like boltzman brains and quantum uncertainty and matter/energy density does seem confusing to think about).
It seems quite likely to me that this point of convergence is above the value of a human life, as numbers can really get very big, there are a lot of humans, and shrimp are all things considered pretty cool and interesting and a lot of shrimp seem like they would give rise to a lot of stuff.
I mean, clearly you agree that two shrimp are more important than one shrimp
Hm… no, I don’t think so. Enough shrimp to ensure that there keep being shrimp—that’s worth more than one shrimp. Less shrimp than that, though—nah.
I agree there is a number big enough where eventually you approach 1, nothing I have said contradicts that. As in, my guess is the series of the value of shrimp as n goes to infinity does not diverge but eventually converge on some finite number, though it does feel kind of confusing to think about.
Sure, this is all fine (and nothing that I have said contradicts you believing this; it seems like you took my objection to be much narrower than it actually was), but you’re saying that this number is much larger than the value of a human life. That’s the thing that I’m objecting to.
I’ll mostly bow out at this point, but one quick clarification:
but you’re saying that this number is much larger than the value of a human life
I didn’t say “much larger”! Like, IDK, my guess is there is some number of shrimp for which its worth sacrificing a thousand humans, which is larger, but not necessarily “much”.
My guess is there is no number, at least in the least convenient world where we are not talking about shrimp galaxies forming alternative life forms, for which it’s worth sacrificing 10 million humans, at least at current population levels and on the current human trajectory.
10 million is just a lot, and humanity has a lot of shit to deal with, and while I think it would be an atrocity to destroy this shrimp-gigaverse, it would also be an atrocity to kill 10 million people, especially intentionally.
Alright, so I’ve been following the latest OpenAI Twitter freakout, and here’s some urgent information about the latest closed-doors developments that I’ve managed to piece together:
Following OpenAI Twitter freakouts is a colossal, utterly pointless waste of your time and you shouldn’t do it ever.
If you saw this comment of Gwern’s going around and were incredibly alarmed, you should probably undo the associated update regarding AI timelines (at least partially, see below).
OpenAI may be running some galaxy-brained psyops nowadays.
Here’s the sequence of events, as far as I can tell:
Some Twitter accounts that are (claiming, without proof, to be?) associated with OpenAI are being very hype about some internal OpenAI developments.
Gwern posts this comment suggesting an explanation for point 1.
Several accounts (e. g., one, two) claiming (without proof) to be OpenAI insiders start to imply that:
An AI model recently finished training.
Its capabilities surprised and scared OpenAI researchers.
It produced some innovation/is related to OpenAI’s “Level 4: Innovators” stage of AGI development.
Gwern’s comment goes viral on Twitter (example).
A news story about GPT-4b micro comes out, indeed confirming a novel OpenAI-produced innovation in biotech. (But it is not actually an “innovator AI”.)
The stories told by the accounts above start to mention that the new breakthrough is similar to GPT-4b: that it’s some AI model that produced an innovation in “health and longevity”. But also, that it’s broader than GPT-4b, and that the full breadth of this new model’s surprising emergent capabilities is unclear. (One, two, three.)
Noam Brown, an actual confirmed OpenAI researcher, complains about “vague AI hype on social media”, and states they haven’t yet actually achieved superintelligence.
The Axios story comes out, implying that OpenAI has developed “PhD-level superagents” and that Sam Altman is going to brief Trump on them. Of note:
Axios is partnered with OpenAI.
If you put on Bounded Distrust lens, you can see that the “PhD-level superagents” claim is entirely divorced from any actual statements made by OpenAI people. The article ties-in a Mark Zuckerberg quote instead, etc. Overall, the article weaves the impression it wants to create out of vibes (with which it’s free to lie) and not concrete factual statements.
The “OpenAI insiders” gradually ramp up the intensity of their story all the while, suggesting that the new breakthrough would allow ASI in “weeks, not years”, and also that OpenAI won’t release this “o4-alpha” until 2026 because they have a years-long Master Plan, et cetera. Example, example.
Sam Altman complains about “twitter hype” being “out of control again”.
OpenAI hype accounts deflate.
What the hell was all that?
First, let’s dispel any notion that the hype accounts are actual OpenAI insiders who know what they are talking about:
“Satoshi” claims to be blackmailing OpenAI higher-ups in order to be allowed to shitpost classified information on Twitter. I am a bit skeptical of this claim, to put it mildly.
“Riley Coyote” has a different backstory which is about as convincing by itself, and which also suggests that “Satoshi” is “Riley”’s actual source.
As far as I can tell digging into the timeline, both accounts just started acting as if they are OpenAI associates posting leaks. Not even, like, saying that they’re OpenAI associates posting leaks, much less proving that. Just starting to act as if they’re OpenAI associates and that everyone knows this. Their tweets then went viral. (There’s also the strawberry guy, who also implies to be an OpenAI insider, who also joined in on the above hype-posting, and who seems to have been playing this same game for a year now. But I’m tired of looking up the links, and the contents are intensely unpleasant. Go dig through that account yourself if you want.)
In addition, none of the OpenAI employee accounts with real names that I’ve been able to find have been participating in this hype cycle. So if OpenAI allowed its employees to talk about what happened/is happening, why weren’t any confirmed-identity accounts talking about it (except Noam’s, deflating it)? Why only the anonymous Twitter people?
Well, because this isn’t real.
That said, the timing is a bit suspect. This hype starting up, followed by the GPT-4b micro release and the Axios piece, all in the span of ~3 days? And the hype men’s claims at least partially predicting the GPT-4b micro thing?
There’s three possibilities:
A coincidence. (The predictions weren’t very precise, just “innovators are coming”. The details about health-and-longevity and the innovative output got added after the GPT-4b piece, as far as I can tell.)
A leak in one of the newspapers working on the GPT-4b story (which the grifters then built a false narrative around).
Coordinated action by OpenAI.
One notable point is, the Axios story was surely coordinated with OpenAI, and it’s both full of shenanigans and references the Twitter hype (“several OpenAI staff have been telling friends they are both jazzed and spooked by recent progress”). So OpenAI was doing shenanigans. So I’m slightly inclined to believe it was all an OpenAI-orchestrated psyop.
Let’s examine this possibility.
Regarding the truth value of the claims: I think nothing has happened, even if the people involved are OpenAI-affiliated (in a different sense from how they claim). Maybe there was some slight unexpected breakthrough on an obscure research direction, at most, to lend an air of technical truth to those claims. But I think it’s all smoke and mirrors.
However, the psyop itself (if it were one) has been mildly effective. I think tons of people actually ended up believing that something might be happening (e. g., janus, the AI Notkilleveryoneism Memes guy, myself for a bit, maybe gwern, if his comment referenced the pattern of posting related to the early stages of this same event).
That said, as Eliezer points out here, it’s advantageous for OpenAI to be crying wolf: both to drive up/maintain hype among their allies, and to frog-boil the skeptics into instinctively dismissing any alarming claims. Such that, say, if there ever are actual whistleblowers pseudonymously freaking out about unexpected breakthroughs on Twitter, nobody believes them.
That said, I can’t help but think that if OpenAI were actually secure in their position and making insane progress, they would not have needed to do any of this stuff. If you’re closing your fingers around agents capable of displacing the workforce en masse, if you see a straight shot to AGI, why engage in this childishness? (Again, if Satoshi and Riley aren’t just random trolls.)
Bottom line, one of the following seems to be the case:
There’s a new type of guy, which is to AI/OpenAI what shitcoin-shills are to cryptocurrency.
OpenAI is engaging in galaxy-brained media psyops.
Oh, and what’s definitely true is that paying attention to what’s going viral on Twitter is a severe mistake. I’ve committed it for the first and last time.
I also suggest that you unroll the update you might’ve made based on Gwern’s comment. Not the part describing to the o-series’ potential – that’s of course plausible and compelling. The part where that potential seems to have already been confirmed and realized according to ostensible OpenAI leaks – because those leaks seem to be fake. (Unless Gwern was talking about some other demographic of OpenAI accounts being euphorically optimistic on Twitter, which I’ve somehow missed?)[1]
(Oh, as to Sam Altman meeting with Trump? Well, that’s probably because Trump’s Sinister Vizier, Sam Altman’s sworn nemesis, Elon Musk, is whispering in Trump ear 24⁄7 suggesting to crush OpenAI, and if Altman doesn’t seduce Trump ASAP, Trump will do that. Especially since OpenAI is currently vulnerable due to their legally dubious for-profit transition.
This planet is a clown show.)
I’m currently interested in:
Arguments for actually taking the AI hype people’s claims seriously. (In particular, were any actual OpenAI employees provably involved, and did I somehow miss them?)
Arguments regarding whether this was an OpenAI psyop vs. some random trolls.
Also, pinging @Zvi in case any of those events showed up on his radar and he plans to cover them in his newsletter.
Also, I can’t help but note that the people passing the comment around (such as this, this) are distorting it. The Gwern-stated claim isn’t that OpenAI are close to superintelligence, it’s that they may feel as if they’re close to superintelligence. Pretty big difference!
Though, again, even that is predicated on actual OpenAI employees posting actual insider information about actual internal developments. Which I am not convinced is a thing that is actually happening.
I personally put a relatively high probability of this being a galaxy brained media psyop by OpenAI/Sam Altman.
Eliezer makes a very good point that confusion around people claiming AI advances/whistleblowing benefits OpenAI significantly, and Sam Altman has a history of making galaxy brained political plays (attempting to get Helen fired (and then winning), testifying to congress that it is good he has oversight via the board and he should not be full control of OpenAI and then replacing the board with underlings, etc).
Sam is very smart and politically capable. This feels in character.
It all started from Sam’s six words story. So it looks like as organized hype.
Thanks for doing this so I didn’t have to! Hell is other people—on social media. And it’s an immense time-sink.
Zvi is the man for saving the rest of us vast amounts of time and sanity.
I’d guess the psyop spun out of control with a couple of opportunistic posters pretending they had inside information, and that’s why Sam had to say lower your expectations 100x. I’m sure he wants hype, but he doesn’t want high expectations that are very quickly falsified. That would lead to some very negative stories about OpenAI’s prospects, even if they’re equally silly they’d harm investment hype.
There’s a possibility that this was a clown attack on OpenAI instead...
Thanks for the sleuthing.
The thing is—last time I heard about OpenAI rumors it was Strawberry.
The unfortunate fact of life is that too many times OpenAI shipping has surpassed all but the wildest speculations.
That was part of my reasoning as well, why I thought it might be worth engaging with!
But I don’t think this is the same case. Strawberry/Q* was being leaked-about from more reputable sources, and it was concurrent with dramatic events (the coup) that were definitely happening.
In this case, all evidence we have is these 2-3 accounts shitposting.
Thanks.
Well 2-3 shitposters and one gwern.
Who would be so foolish to short gwern? Gwern the farsighted, gwern the prophet, gwern for whom entropy is nought, gwern augurious augustus
I feel like for the same reasons, this shortform is kind of an engaging waste of my time. One reason I read LessWrong is to avoid twitter garbage.
Valid, I was split on whether it’s worth posting vs. it’d be just me taking my part in spreading this nonsense. But it’d seemed to me that a lot of people, including LW regulars, might’ve been fooled, so I erred on the side of posting.
I dont think any of that invalidates that Gwern is a usual, usually right.
As I’d said, I think he’s right about the o-series’ theoretic potential. I don’t think there is, as of yet, any actual indication that this potential has already been harnessed, and therefore that it works as well as the theory predicts. (And of course, the o-series scaling quickly at math is probably not even an omnicide threat. There’s an argument for why it might be – that the performance boost will transfer to arbitrary domains – but that doesn’t seem to be happening. I guess we’ll see once o3 is public.)
I think super human AI is inherently very easy. I can’t comment on the reliability of those accounts. But the technical claims seem plausible.
I am not an AI successionist because I don’t want myself and my friends to die.
There are various high-minded arguments that AIs replacing us is okay because it’s just like cultural change and our history is already full of those, or because they will be our “mind children”, or because they will be these numinous enlightened beings and it is our moral duty to give birth to them.
People then try to refute those by nitpicking which kinds of cultural change are okay or not, or to what extent AIs’ minds will be descended from ours, or whether AIs will necessarily have consciousnesses and feel happiness.
And it’s very cool and all, I’d love me some transcendental cultural change and numinous mind-children. But all those concerns are decidedly dominated by “not dying” in my Maslow hierarchy of needs. Call me small-minded.
If I were born in 1700s, I’d have little recourse but to suck it up and be content with biological children or “mind-children” students or something. But we seem to have an actual shot at not-dying here[1]. If it’s an option to not have to be forcibly “succeeded” by anything, I care quite a lot about trying to take this option.[2]
Many other people also have such preferences: for the self-perpetuation of their current selves and their currently existing friends. I think those are perfectly valid. Sure, they’re displeasingly asymmetric in a certain sense. They introduce a privileged reference frame: a currently existing human values concurrently existing people more than the people who are just as real, but slightly temporally displaced. It’s not very elegant, not very aesthetically pleasing. It implies an utility function that cares not only about states, but also state transitions.[3]
Caring about all that, however, is also decidedly dominated by “not dying” in my Maslow hierarchy of needs.
If all that delays the arrival of numinous enlightened beings, too bad for the numinous enlightened beings.
Via attaining the longevity escape velocity by normal biotech research, or via uploads, or via sufficiently good cryonics, or via properly aligned AGI.
Though not infinitely so: as in, I wouldn’t prevent 10100 future people from being born in exchange for a 10−100 probability of becoming immortal. I would, however, insist on continuing to exist even if my resources could be used to create and sustain two new people.
As in, all universe-state transitions that involve a currently existing person dying get an utility penalty, regardless of what universe-state they go to. There’s now path dependence: we may go or not go to a given high-utility state depending on which direction we’re approaching it from. Yucky!
(For example, suppose there were an option to destroy this universe and create either Universe A, filled with 10^100 happy people, or Universe B, with 10^100 + 1 happy people.
Suppose we’re starting from a state where humanity has been reduced to ten dying survivors in a post-apocalyptic wasteland. Then picking Universe B makes sense: a state with slightly more total utility.
But suppose we’re starting from Universe A instead. Ought its civilization vote to end itself to give birth to Universe B? I think it’s perfectly righteous for them not to do it.)
I really don’t understand this debate—surely if we manage to stay in control of our own destiny we can just do both? The universe is big, and current humans are very small—we should be able to both stay alive ourselves and usher in an era of crazy enlightened beings doing crazy transhuman stuff.
I think it’s more likely than not that “crazy enlightened beings doing crazy transhuman stuff” will be bad for “regular” biological humans (ie. it’ll decrease our number/QoL/agency/pose existential risks).
I mostly disagree with “QoL” and “pose existential risks”, at least in the good futures I’m imagining—those things are very cheap to provide to current humans. I could see “number” and “agency”, but that seems fine? I think it would be bad for any current humans to die, or to lose agency over their current lives, but it seems fine and good for us to not try to fill the entire universe with biological humans, and for us to not insist on biological humans having agency over the entire universe. If there are lots of other sentient beings in existence with their own preferences and values, then it makes sense that they should have their own resources and have agency over themselves rather than us having agency over them.
Perhaps yes (although I’d say it depends on what the trade-offs are) but the situation is different if we have a choice in whether or not to bring said sentient beings with difference preferences into existence in the first place. Doing so on purpose seems pretty risky to me (as opposed to minimizing the sentience, independence, and agency of AI systems as much as possible, and instead directing the technology to promote “regular” human flourishing/our current values).
Not any more risky than bringing in humans. This is a governance/power distribution problem, not a what-kind-of-mind-this-is problem.
Biological humans sometimes go evil or crazy. If you have a system that can handle that, you have a system that can handle alien minds that are evil or crazy (from our perspective), as long as you don’t imbue them with more power than this system can deal with (and why would you?).
(On the other hand, if your system can’t deal with crazy evil biological humans, it’s probably already a lawless wild-west hellhole, so bringing in some aliens won’t exacerbate the problem much.)
Humans are more likely to be aligned with humanity as a whole compared to AIs, even if there are exceptions
Many existing humans want their descendants to exist, so they are fulfiling the preferences of today‘s humans
“AIs as trained by DL today” are only a small subset of “non-human minds”. Other mind-generating processes can produce minds that are as safe to have around as humans, but which are still completely alien.
Many existing humans also want fascinating novel alien minds to exist.
Certainly I’m excited about promoting “regular” human flourishing, though it seems overly limited to focus only on that.
I’m not sure if by “regular” you mean only biological, but at least the simplest argument that I find persuasive here against only ever having biological humans is just a resource utilization argument, which is that biological humans take up a lot of space and a lot of resources and you can get the same thing much more cheaply if you bring into existence lots of simulated humans instead (certainly I agree that doesn’t imply we should kill existing humans and replace them with simulations, though, unless they consent to that).
And I think even if you included simulated humans in “regular” humans, I also think I value diversity of experience, and a universe full of very different sorts of sentient/conscious lifeforms having satisfied/fulfilling/flourishing experiences seems better than just “regular” humans.
I also separately don’t buy that it’s riskier to build AIs that are sentient—in fact, I think it’s probably better to build AIs that are moral patients than AIs that are not moral patients.
IMO, it seems bad to intentionally try to build AIs which are moral patients until after we’ve resolved acute risks and we’re deciding what to do with the future longer term. (E.g., don’t try to build moral patient AIs until we’re sending out space probes or deciding what to do with space probes.) Of course, this doesn’t mean we’ll avoid building AIs which aren’t significant moral patients in practice because our control is very weak and commercial/power incentives will likely dominate.
I think trying to make AIs be moral patients earlier pretty clearly increases AI takeover risk and seems morally bad. (Views focused on non-person-affecting upside get dominated by the long run future, so these views don’t care about making moral patient AIs which have good lives in the short run. I think the most plausible views which care about shorter run patienthood mostly just want to avoid downside so they’d prefer no patienthood at all for now.)
The only upside is that it might increase value conditional on AI takeover. But, I think “are the AIs morally valuable themselves” is much less important than the preferences of these AIs from the perspective of longer run value conditional on AI takeover. So, I think it’s better to focus on AIs which we’d expect would have better preferences conditional on takeover and making AIs moral patients isn’t a particularly nice way to achieve this. Additionally, I don’t think we should put much weight on “try to ensure the preferences of AIs which were so misaligned they took over” because conditional on takeover we must have had very little control over preferences in practice.
How so? Seems basically orthogonal to me? And to the extent that it does matter for takeover risk, I’d expect the sorts of interventions that make it more likely that AIs are moral patients to also make it more likely that they’re aligned.
Even absent AI takeover, I’m quite worried about lock-in. I think we could easily lock in AIs that are or are not moral patients and have little ability to revisit that decision later, and I think it would be better to lock in AIs that are moral patients if we have to lock something in, since that opens up the possibility for the AIs to live good lives in the future.
I agree that seems like the more important highest-order bit, but it’s not an argument that making AIs moral patients is bad, just that it’s not the most important thing to focus on (which I agree with).
I would have guessed that “making AIs be moral patients” looks like “make AIs have their own independent preferences/objectives which we intentionally don’t control precisely” which increases misalignment risks.
At a more basic level, if AIs are moral patients, then there will be downsides for various safety measures and AIs would have plausible deniability for being opposed to safety measures. IMO, the right response to the AI taking a stand against your safety measures for AI welfare reasons is “Oh shit, either this AI is misaligned or it has welfare. Either way this isn’t what we wanted and needs to be addressed, we should train our AI differently to avoid this.”
I don’t understand, won’t all the value come from minds intentionally created for value rather than in the minds of the laborers? Also, won’t architecture and design of AIs radically shift after humans aren’t running day to day operations?
I don’t understand the type of lock in your imagining, but it naively sounds like a world which has negligible longtermist value (because we got locked into obscure specifics like this), so making it somewhat better isn’t important.
Interesting! Aside from the implications for human agency/power, this seems worse because of the risk of AI suffering—if we build sentient AIs we need to be way more careful about how we treat/use them.
Exactly. Bringing a new kind of moral patient into existence is a moral hazard, because once they exist, we will have obligations toward them, e.g. providing them with limited resources (like land), and giving them part of our political power via voting rights. That’s analogous to Parfit’s Mere Addition Paradox that leads to the repugnant conclusion, in this case human marginalization.
(How could “land” possibly be a limited resource, especially in the context of future AIs? The world doesn’t exist solely on the immutable surface of Earth...)
I mean, if you interpret “land” in a Georgist sense, as the sum of all natural resources of the reachable universe, then yes, it’s finite. And the fights for carving up that pie can start long before our grabby-alien hands have seized all of it. (The property rights to the Andromeda Galaxy can be up for sale long before our Von Neumann probes reach it.)
The salient referent is compute, sure, my point is that it’s startling to see what should in this context be compute within the future lightcone being (very indirectly) called “land”. (I do understand that this was meant as an example clarifying the meaning of “limited resources”, and so it makes perfect sense when decontextualized. It’s just not an example that fits that well when considered within this particular context.)
(I’m guessing the physical world is unlikely to matter in the long run other than as substrate for implementing compute. For that reason importance of understanding the physical world, for normative or philosophical reasons, seems limited. It’s more important how ethics and decision theory work for abstract computations, the meaningful content of the contingent physical computronium.)
A population of AI agents could marginalize humans significantly before they are intelligent enough to easily (and quickly!) create more Earths.
For me, a crux of a future that’s good for humanity is giving the biological humans the resources and the freedom to become the enlightened transhuman beings themselves, with no hard ceiling on relevance in the long run. Rather than only letting some originally-humans to grow into more powerful but still purely ornamental roles, or not letting them grow at all, or not letting them think faster and do checkpointing and multiple instantiations of the mind states using a non-biological cognitive substrate, or letting them unwillingly die of old age or disease. (For those who so choose, under their own direction rather than only through externally imposed uplifting protocols, even if that leaves it no more straightforward than world-class success of some kind today, to reach a sensible outcome.)
This in particular implies reasonable resources being left to those who remain/become regular biological humans (or take their time growing up), including through influence of some of these originally-human beings who happen to consider that a good thing to ensure.
Edit: Expanded into a post.
This sounds like a question which can be addressed after we figure out how to avoid extinction.
I do note that you were the one who brought in “biological humans,” as if that meant the same as “ourselves” in the grandparent. That could already be a serious disagreement, in some other world where it mattered.
The mere fear that the entire human race will be exterminated in their sleep through some intricate causality we are too dumb to understand will seriously diminish our quality of life.
I very much agree. The hardcore successionist stances, as I understand them, are either that trying to stay in control at all is immoral/unnatural, or that creating the enlightened beings ASAP matters much more than whether we live through their creation. (Edit: This old tweet by Andrew Critch is still a good summary, I think.)
So it’s not that they’re opposed to the current humanity’s continuation, but that it matters very little compared to ushering in the post-Singularity state. Therefore, anything that risks or delays the Singularity in exchange for boosting the current humans’ safety is opposed.
Another stance is that it would suck to die the day before AI makes us immortal (like how Bryan Johnson main motivation for maximizing his lifespan is due to this). Hence trying to delay AI advancement is opposed
Yeah, but that’s a predictive disagreement between our camps (whether the current-paradigm AI is controllable), not a values disagreement. I would agree that if we find a plan that robustly outputs an aligned AGI, we should floor it in that direction.
Endorsing successionism might be strongly correlated with expecting the “mind children” to keep humans around, even if in a purely ornamental role and possibly only at human timescales. This might be more of a bailey position, so when pressed on it they might affirm that their endorsement of successionism is compatible with human extinction, but in their heart they would still hope and expect that it won’t come to that. So I think complaints about human extinction will feel strawmannish to most successionists.
I’m not so sure about that:
Though sure, Critch’s process there isn’t white-boxed, so any number of biases might be in it.
“Successionism” is such a bizarre position that I’d look for the underlying generator rather than try to argue with it directly.
I’m not sure it’s that bizarre. It’s anti-Humanist, for sure, in the sense that it doesn’t focus on the welfare/empowerment/etc. of humans (either existing or future) as its end goal. But that doesn’t, by itself, make it bizarre.
From Eliezer’s Raised in Technophilia, back in the day:
From A prodigy of refutation:
From the famous Musk/Larry Page breakup:
Successionism is the natural consequence of an affective death spiral around technological development and anti-chauvinism. It’s as simple as that.
Successionists start off by believing that technological change makes things better. That not only does it virtually always make things better, but that it’s pretty much the only thing that ever makes things better. Everything else, whether it’s values, education, social organization etc., pales in comparison to technological improvements in terms of how they affect the world; they are mere short-term blips that cannot change the inevitable long-run trend of positive change.
At the same time, they are raised, taught, incentivized to be anti-chauvinist. They learn, either through stories, public pronouncements, in-person social events etc., that those who stand athwart atop history yelling stop are always close-minded bigots who want to prevent new classes of beings (people, at first; then AIs, afterwards) from receiving the moral personhood they deserve. In their eyes, being afraid of AIs taking over is like being afraid of The Great Replacement if you’re white and racist. You’re just a regressive chauvinist desperately clinging to a discriminatory worldview in the face of an unstoppable tide of change that will liberate new classes of beings from your anachronistic and damaging worldview.
Optimism about technology and opposition to chauvinism are both defensible, and arguably even correct, positions in most cases. Even if you personally (as I do) believe non-AI technology can also have pretty darn awful effects on us (social media, online gambling) and that caring about humans-in-particular is ok if you are human (“the utility function is not up for grabs”), it’s hard to argue expanding the circle of moral concern to cover people of all races was bad, or that tech improvements are not the primary reason our lives are so much better now than 300 years ago.
But successionists, like most (all?) people, subconsciously assign positive or negative valences to the notion of “tech change” in a way that elides the underlying reasons why it’s good or bad. So when you take these views to their absolute extreme, while it may make sense from the inside (you’re maximizing something “Good”, right? that can’t possibly be bad, right???), you are generalizing way out of distribution and such intuitive snap judgments are no longer reliable.
An AI successionist usually argues that successionism isn’t bad even if dying is bad. For example, when humanity is prevented from having further children, e.g. by sterilization. I say that even in this case successionism is bad. Because I (and I presume: many people) want humanity, including our descendants, to continue into the future. I don’t care about AI agents coming into existence and increasingly marginalizing humanity.
Just finished If Anyone Builds It, Everyone Dies (and some of the supplements).[1] It feels… weaker than I’d hoped. Specifically, I think Part 3 is strong, and the supplemental materials are quite thorough, but Parts 1-2… I hope I’m wrong, and this opinion is counterweighed by all these endorsements and MIRI presumably running it by lots of test readers. But I’m more bearish on it making a huge impact than I was before reading it.
Point 1: The rhetoric – the arguments and their presentations – is often not novel, just rehearsed variations on the arguments Eliezer/MIRI already deployed. This is not necessarily a problem, if those arguments were already shaped into their optimal form, and I do like this form… But I note those arguments have so far failed to go viral. Would repackaging them into a book, and deploying it in our post-ChatGPT present, be enough? Well, I hope so.
Point 2: I found Chapter 2 in particular somewhat poorly written in how it explains the technical details.
Specifically, those explanations often occupy that unfortunate middle ground between “informal gloss” and “correct technical description” where I’d guess they’re impenetrable both to non-technical readers and to technical readers unfamiliar with the subject matter.
An example that seems particularly egregious to me:
How does that conclusion follow? If a base model can only regurgitate human utterances, how is generating sixteen utterances and then reinforcing some of them leads to it… not regurgitating human utterances? This explanation is clearly incomplete. My model of a nonexpert technical-minded reader, who is actually tracking the gears the book introduces, definitely notices that and is confused.
Explanation of base models’ training at the start of the chapter feels flawed in the same way. E. g.:
My model of a technical-minded reader is confused about how that whole thing is supposed to work. It sounds like AI developers manually pick billions of operations? What? The technical reader would’ve rather you just mentioned operations on matrices. This might’ve required more effort to understand, but understanding would’ve been at all possible.
My model of a nontechnical reader just has their eyes glaze over when reading this description. It uses simple words, but it clearly gestures at some messy complicated thing, and doesn’t actually conceptualize it in a simple-to-understand way. (“It” being “the neural substrate”, and how it can both be unreadable to us yet encode useful computations.)
And then this explanation is used to build the definition of gradient descent; and then this term is used all throughout the rest of the book to make arguments for various things. My guess is that this explanation is not sufficient to make readers feel like they grok the concept; on the contrary, it’s likely to make them earmark the term as “don’t really get this one”. This would then, subtly or not, poison every argument where this term reoccurs.
Or maybe I’m unfairly nitpicking. Again, I think MIRI ran it by many test readers, so presumably this actually does work fine in practice? But this is what my eyes are telling me.
Point 3: Part 2, the fictional story. It’s kind of… eh. The stated purpose is to help “make abstract considerations feel more real”, but does it actually accomplish that? It’s written in a pretty abstract way. Its narrative is mixed with technical discussion. It takes the AI’s perspective, and doesn’t view things from the world’s perspective much, so it doesn’t create a visceral sense of something you would see happening around you. It involves some honestly quite convoluted master-plan scenarios with multicancer pandemics.
Does the story actually serve to reinforce the risk’s plausibility? Maybe, but I wouldn’t have guessed so.
Point 4: The question of “but how does the AI kill us?” is probably at the forefront of many people’s minds, especially once the basics of “it would want to kill us” are established, but the book takes its sweet time getting there. And I don’t think Chapter 6 is doing a stellar job either. It meanders around the point so much:
It starts with an analogy...
… then it builds some abstract scaffolding about why ASI defeating humanity should be an easy call to make, even if we can’t predict how...
… then it vaguely alludes to weird powers the ASI may develop, still without quite spelling out how these powers would enable a humanity-killing plan...
… then it drops a bunch of concrete examples of weird hacks you can pull off with technology, still without describing how it may all come together to enable an omnicide...
… then it seems to focus on how much you can improve on biology/how easy it is to solve...
… then it does some more abstract discussion...
… and then the chapter ends...
(the whole thing kind of feels like this gif)
… and then we get to Part 2, the fictional story. In which the described concrete scenario is IMO quite convoluted and implausible-sounding, plus see all my other complaints about it in the previous point.
I think Part 3 is strong.[2] I think a solid chunk of Part 1 is strong as well. The online supplements seem great, and I like the format choices there (title-questions followed by quick subtitle answers). But Chapter 2, Chapter 6, and Part 2 seem like weak points. Which is unfortunate, since they’re both the parts where the object-level case is laid out, and the early parts which decide whether a reader would keep reading or not.
Or so my impression goes.
I binged it starting from the minute it became available, because I heard those reports about MIRI employees getting something new from it regarding the alignment problem, and wondered if it would enhance my own understanding as well, or perhaps upturn it and destroy my hopes about my own current research agenda. But no, unfortunately/thankfully there was nothing new for me.
It also featured detailed descriptions of various engineering challenges/errors and the distillations of lessons from them (Chapter 10 and this supplement), which was the most interesting and useful part of the book for me personally.
In general, I felt like the beginning was a bit weak, with the informal-technical discussion the weakest part, and then it got substantially stronger from there.
I worry that I particularly enjoy the kind of writing they do, but we’ve already tapped the market of folks like me. Like, I worked at MIRI and now moderate LessWrong because I was convinced by the Sequences. So that’s a pretty strong selection filter for liking their writing. Of course we should caveat my experience quite a bit given that.
But, for what it’s worth, I thought Part 2 was great. Stories make things seem real, and my reader-model was relatively able to grant the plot beats as possible. I thought they did a good job of explaining that while there were many options the AI could take, and they, the authors, might well not understand why a given approach would work out or not, it wasn’t obvious that that would generalise to all the AI’s plans not working.
The other thing I really liked: they would occassionally explain some science to expand on their point (nuclear physics is the example they expounded on at length, but IIRC they mentioned a bunch of other bit of science in passing). I’m not sure why I liked this so much. Perhaps it was because it was grounding, or reminded me not to throw my mind away, or made me trust them a little more. Again, I’m really not sure how well this generalises to people for whom their previous writing hasn’t worked.
Yup, hence my not being excited to see the usual rhetoric being rehearsed, instead of something novel.
Yup. Chapter 10 is my favorite.
(I haven’t read IABIED.) I saw your take right after reading Buck’s, so it’s interesting how his reaction was diametrically opposite yours: “I think the first two parts of the book are the best available explanation of the basic case for AI misalignment risk for a general audience. I thought the last part was pretty bad, and probably recommend skipping it.”
FWIW, and obviously this is just one anecdote, but a member of Congress who read an early copy, and really enjoyed it, said that Chapter 2 was his favorite chapter.
Hopefully it’s just my personal pet peeves, then!
¯\_(ツ)_/¯
I listened to parts of it and found it to be bad, so no it’s not just you. However if you’re looking for things to upset your understanding of alignment some typical fallacies include:
That gradient descent “reinforces behavior” as opposed to minimizing error in the gradient, which is a different thing.
Thinking that a human representation of human values is sufficient (e.g. an upload), when actually you need to generalize human values out of distribution.
Focusing on stuff that is not only not deep learning shaped (already a huge huge sin in my book but some reasonable people disagree) but not shaped like any AI system that has ever worked ever. In general if you’re not reading ArXiv your stuff probably sucks.
If you tell me more about your AI alignment ideas I can probably get more specific.
I think an upload does generalize human values out of distribution. After all, humans generalize our values out of distribution. A perfect upload acts like a human. Insofar as it generalizes improperly, it’s because it was not a faithful upload, which is a problem with the uploading process, not the idea of using an upload to generalize human values.
I don’t think humans generalize their values out of distribution. This is very obvious if you look at their reaction to new things like the phonograph, where they’re horrified and then it’s slowly normalized. Or the classic thing about how every generation thinks the new generation is corrupt and declining:
“Schools of Hellas: an Essay on the Practice and Theory of Ancient Greek Education from 600 to 300 BC”, Kenneth John Freeman 1907 (paraphrasing of Hellenic attitudes towards the youth in 600 − 300 BC)*
Humans don’t natively generalize their values out of distribution. Instead they use institutions like courts to resolve uncertainty and export new value interpretations out to the wider society.
How.… else… do you expect to generalize human values out of distribution, except to have humans do it?
Humans are not privileged objects in continuing the pattern that is the current set of human values. Unless of course LW has just given up on transhumanism entirely at this point, which wouldn’t surprise me. There are various ways to perform corpus expansion starting from where we are now, EY’s classic CEV proposal per Google AI overview extrapolates human values starting from the existing human pattern but does not actually use humans to do it:
Humans very clearly are privileged objects for continuing human values, there is no “giving up on transhumanism”. Its literally right there in the name! It would be (and is) certainly absurd to suggest otherwise.
As for CEV, note that the quote you have there indeed does privilege the “human” in human values, in the sense that it suggests giving the AI under consideration a pointer to what humans would want if they had perfect knowledge and wisdom.
Stripping away these absurdities (and appeals to authority or in-groupedness), your comment becomes “Well to generalize human values without humans, you could provide an AI with a pointer to humans thinking under ideal conditions about their values”, which is clearly a valid answer, but doesn’t actually support your original point all that much, as this relies on humans having some ability to generalize their values out of distribution.
Nothing I’ve said is absurd. Humans are not born with their values, they are born with latent tendencies towards certain value updates and a set of intrinsic reward signals. But human values, as in the set of value judgements bound to conceptual objects is a corpus, pattern, which exists separately from any individual human being and its generalization exists separately from any individual human being.
And no, really and truly individual humans do not generalize a fixed training distribution arbitrarily far, what they (presumably) do is make iterative updates based on new experiences which is not actually the same thing as generalizing from a fixed corpus in the way we usually use that phrase in machine learning. Notably, the continuation of human values is a coherent question even if tomorrow everyone decided to become cat people or something. Becoming really aggressive and accusing me of being ’”absurd” and “appealing to authority” doesn’t change this.
You were appealing to authority, and being absurd (and also appealing to in/out-groupness). I feel satisfied getting a bit aggressive when people do that. I agree that style doesn’t have any bearing on the validity of my argument, but it does discourage that sort of talk.
I’m not certain what you’re arguing for in this latest comment, I definitely don’t think you show here that humans aren’t privileged objects when it comes to human values, nor do you show that your quote by Eliezer recommends any special process more than a pointer to humans thinking about their values in an ideal situation, which were my main two contentions in my original comment.
I don’t think anyone in this conversation argued that humans can generalize from a fixed training distribution arbitrarily far, and I think everyone also agrees that humans think about morality by making iterative, small, updates to what they already know. But, of course, that does still privilege humans. There could be some consistent pattern to these updates, such that something smarter wouldn’t need to run the same process to know the end-result, but that would be a pattern about humans.
I was not appealing to authority or being absurd (though admittedly the second quality is subjective), it is in fact relevant if we’re arguing about...if you say
This implies, though I did not explicitly argue with the implication, that to generalize human values out of distribution you run a literal human brain or approximation of a human brain (e.g. Hansonian Em) to get the updates. What I was pointing out is that CEV, which is the classic proposal for how to generalize human values out of distribution and therefore a relevant reference point for what is and is not a reasonable plan (and as you allude to, considered a reasonable plan by people normally taken to be clearly thinking about this issue) to generalize human values out of distribution, does not actually call for running a literal emulation of a human brain except perhaps in its initial stages (and even then only if absolutely necessary, Yudkowsky is fairly explicit in the Arbital corpus that FAI should avoid instantiating sapient subprocesses) because the entire point is to imagine what the descendants of current day humanity would do under ideal conditions of self improvement, a process which if it’s not to instantiate sapient beings must in fact not really be based on having humans generalize the values out of distribution.
If this is an absurd thing to imagine, then CEV is absurd, and maybe it is. If pointing this out is an appeal to authority or in-groupness/outgroupness then presumably any argument of the form “actually this is normally how FAI is conceived and therefore not an apriori unreasonable concept” is invalid on such grounds and I’m not really sure how I’m meant to respond to a confused look like that. Perhaps I’m supposed to find the least respectable plan which does not consider literal human mind patterns to be a privileged object (in the sense their cognition is strictly functionally necessary to make valid generalizations from the existing human values corpus) and point at that? But that doesn’t seem very convincing obviously.
“Pointing at anything anyone holds in high regard as evidence about whether an idea is apriori unreasonable is an appeal to authority and in-groupness.” is to be blunt parodic.
I agree it’s an effective way to discourage timid people from saying true or correct things when they disagree with people’s intuitions, which is why the behavior is bad.
What would it look like for a human (/coherently acting human collective) to (“natively”?) generalize their values out of distribution?
To be specific the view I am arguing against goes something like:
Inside a human being is a set of apriori terminal values (as opposed to say, terminal reward signals which create values within-lifetime based on the environment) which are unfolded during the humans lifetime. These values generalize to modernity because there is clever machinery in the human which can stretch these values over such a wide array of conceptual objects that modernity does not yet exit the region of validity for the fixed prior. If we could extract this machinery and get it into a machine then we could steer superintelligence with it and alignment would be solved.
I think this is a common view, which is both wrong on its own and actually noncanonical to Yudkowsky’s viewpoint (which I bring up because I figure you might think I’m moving the goalposts, but Bostrom 2014 puts the goalposts around here and Yudkowsky seems to have disagreed with it since at least 2015, so at worst shortly after the book came out but I’m fairly sure before). It is important to be aware of this because if this is your mental model of the alignment problem you will mostly have non-useful thoughts about it.
I think the reality is more like humans have a set of sensory hardware tied to intrinsic reward signals and these reward signals are conceptually shallow, but get used to bootstrap a more complex value ontology that ends up bottoming out in things nobody would actually endorse as their terminal values like “staying warm” or “digesting an appropriate amount of calcium” in the sense that they would like all the rest of eternity to consist of being kept in a womb which provides these things for them.
I don’t think the kind of “native” generalization from a fixed distribution I’m talking about there exists, it’s kind of a phenomenal illusion because it feels that way from the inside but almost certainly isn’t how it works. Rather humans generalize their values through institutional processes to collapse uncertainty by e.g. sampling a judicial ruling and then people update on the ruling with new social norms as a platform for further discourse and collapse of uncertainty as novel situations arise.
Or in the case of something like music, which does seem to work from a fixed set of intrinsic value heuristics, the actual kinds of music which gets expressed in practice within the space of music relies on the existing corpus of music that people are used to. Supposedly early rock and roll shows caused riots, which seems unimaginable now. What happens is people get used to a certain kind of music, then some musicians begin cultivating a new kind of music on the edge of the existing distribution using their general quality heuristics at the edge of what is recognizable to them. This works because the k-complexity of the heuristics you judge the music with is smaller and therefore fits more times into a redundant encoding than actual pieces of music so as you go out of distribution (functionally similar to applying a noise pass to the representation) your ability to recognize something interesting degrades more slowly than your ability to generate interesting music-shaped things. So you correct the errors to denoise a new kind of music into existence and move the center of the distribution by adding it to the cultural corpus.
(Note: I’ve only read a few pages so far, so perhaps this is already in the background)
I agree that if the parent comment scenario holds then it is a case of the upload being improper.
However, I also disagree that most humans naturally generalize our values out of distribution. I think it is very easy for many humans to get sucked into attractors (ideologies that are simplifications of what they truly want; easy lies; the amount of effort ahead stalling out focus even if the gargantuan task would be worth it) that damage their ability to properly generalize and also importantly apply their values. That is, humans have predictable flaws. Then when you add in self-modification you open up whole new regimes.
My view is that a very important element of our values is that we do not necessarily endorse all of our behaviors!
I think a smart and self-aware human could sidestep and weaken these issues, but I do think they’re still hard problems. Which is why I’m a fan of (if we get uploads) going “Upload, figure out AI alignment, then have the AI think long and hard about it” as that further sidesteps problems of a human staring too long at the sun. That is, I think it is very hard for a human to directly implement something like CEV themselves, but that a designed mind doesn’t necessarily have the same issues.
As an example: power-seeking instinct. I don’t endorse seeking power in that way, especially if uploaded to try to solve alignment for Humanity in general, but given my status as an upload and lots of time realizing that I have a lot of influence over the world, I think it is plausible that instinct affects me more and more. I would try to plan around this but likely do so imperfectly.
Presumably, Bob the perfect upload acts like a human only so long as he remains ignorant of the most important fact about his universe. If Bob knows he’s an upload, his life situation is now out-of-distribution.
I think they’re improved and simplified.
My favorite chapter is “Chapter 5: Its Favorite Things.”
I liked that one as well, I think it does a good job at what it aims to do.
My worry is that “improved and simplified” may not be enough. As kave pointed out, it might be that we’ve already convinced everyone who could be convinced by arguments-shaped-like-this, and that we need to generate arguments shaped in a radically different way to convince new demographics, rather than slightly vary the existing shapes. (And it may be the case that it’s very hard for people-like-us to generate arguments shaped different ways – I’m not quite sure what that’d look like, though I haven’t thought about it much – so doing that would require something nontrivial from us, not just “sit down and try to come up with new essays”.)
Maybe that’s wrong; maybe the issue was lack of reach rather than exhausting the persuadees’ supply, and the book-packaging + timing will succeed massively. We’ll see.
This is certainly the hope. Most people in the world have never read anything that anyone here has ever written on this subject.
In addition to Malo’s comment, I think the book contains arguments that AFAICT are only especially made in the context of the MIRI dialogues, which are particularly obnoxious to read.
In the first sentence, Eliezer and Nate are (explicitly) stating that LLMs can say things that are not just regurgitations of human utterances.
Sure; but the following sections are meant as explanations/justifications of why that is the case. The paragraph I omitted does a good job of explaining why they would need to learn to predict the world at large, not just humans, and would therefore contain more than just human-mimicrky algorithms. To reinforce that with the point about reasoning models, one could perhaps explain how that “generate sixteen CoTs, pick the best” training can push LLMs to recruit those hidden algorithms for the purposes of steering rather than just prediction, or even to incrementally develop entirely new skills.
A full explanation of reinforcement learning is probably not worth it (perhaps it was in the additional 200% of the book Eliezer wrote, but I agree it should’ve been aggressively pruned). But as-is, there are just clearly missing pieces here.
Have you happen to have read by beginner-friendly book about AI safety/risk “Uncontrollable”?
I think a comparison/contrast by someone other than me would be beneficial (although I’ll do one soon)
I really liked “Uncontrollable”, as I felt it did a much better job explaining AI X-risk in layperson-accessible terms than either Bostrom, Russell, or Christian. (I even recommended it to a (catholic) priest visiting my family, who, when I said that I worked in ~”AI regulation/governance”, revealed his worries about AI taking over the world.[1])
The main shortcoming of “Uncontrollable” was that it (AFAIR) didn’t engage with the possibility of a coordinated pause/slowdown/ban.
Priest: “Oh, interesting, …. . Tell me, don’t you worry that this AI will take over the world?”
Me: ”… Yes, I do worry.”
Priest: “Yeah, I’ve been thinking about it and it’s terrifying.”
Me: “I agree, it’s terrifying.”
Priest: “I’ve been looking for some book about this but couldn’t find anything sensible.”
Me: “Well, I can recommend a book to you.”
Good to know and I appreciate you sharing that exchange.
You are correct that such a thing is not in there… because (if you’re curious) I thought, strategically, it was better to argue for what is desirable (safe AI innovation) than to argue for a negative (stop it all). Of course, if one makes the requirements for safe AI innovation strong enough, it may result in a slowing or restricting of developments.
On the one hand, yeah, it might.
On the other (IMO bigger) hand, the fewer people talk about the thing explicitly, the less likely it is to be included in the Overton windows and less likely it is to seem like a reasonable/socialy acceptable goal to aim for directly.
I don’t think the case for safe nuclear/biotechnology would be less persuasive if paired with “let’s just get rid of nuclear weapons/bioweapons/gain of function research”.
Nope.
Would you like to?
(I could send along an audible credit or a physical copy)
I am somewhat interested now. I’ll aim to look over it and get back to you, but no promises.
Cool. No expectations. Hope you find some value :)
Also, I just posted my review: IABIED Review—An Unfortunate Miss — LessWrong
It seems to me that many disagreements regarding whether the world can be made robust against a superintelligent attack (e. g., the recent exchange here) are downstream of different people taking on a mathematician’s vs. a hacker’s mindset.
Quoting Gwern:
Imagine the world as a multi-level abstract structure, with different systems (biological cells, human minds, governments, cybersecurity systems, etc.) implemented on different abstraction layers.
If you look at it through a mathematician’s lens, you consider each abstraction layer approximately robust. Making things secure, then, is mostly about working within each abstraction layer, building systems that are secure under the assumptions of a given abstraction layer’s validity. You write provably secure code, you educate people to resist psychological manipulations, you inoculate them against viral bioweapons, you implement robust security policies and high-quality governance systems, et cetera.
In this view, security is a phatic problem, an once-and-done thing.
In warfare terms, it’s a paradigm in which sufficiently advanced static fortifications rule the day, and the bar for “sufficiently advanced” is not that high.
If you look at it through a hacker’s lens, you consider each abstraction layer inherently leaky. Making things secure, then, is mostly about discovering all the ways leaks could happen and patching them up. Worse yet, the tools you use to implement your patches are themselves leakily implemented. Proven-secure code is foiled by hardware vulnerabilities that cause programs to move to theoretically impossible states; the abstractions of human minds are circumvented by Basilisk hacks; the adversary intervenes on the logistical lines for your anti-bioweapon tools and sabotages them; robust security policies and governance systems are foiled by compromising the people implementing them rather than by clever rules-lawyering; and so on.
In this view, security is an anti-inductive problem, an ever-moving target.
In warfare terms, it’s a paradigm that favors maneuver warfare, and static fortifications are just big dumb objects to walk around.
The mindsets also then differ regarding what they expect ASI to be good at.
“Mathematicians” expect really sophisticated within-layer performance: really good technology, really good logistics, really good rhetoric, et cetera. This can still make an ASI really, really powerful, powerful enough to defeat all of humanity combined. But ultimately, in any given engagement, ASI plays “by the rules”, in a certain abstract sense. Each of its tools can in-principle be defended-against on the terms of the abstraction layer at which they’re deployed. All it would take is counter-deploying systems that are sufficiently theoretically robust, and doing so on all abstraction layers simultaneously. Very difficult, but ultimately doable, and definitely not hopeless.
“Hackers” expect really good generalized hacking. No amount of pre-superintelligent preparation is going to suffice against it, because any given tool we deploy, any given secure system we set up, would itself have implementation-level holes in it that the ASI’s schemes would be able to worm through. It may at best delay the ASI for a little bit, but the attack surface is too high-dimensional, and the ASI is able to plot routes through that high-dimensional space which we can’t quite wrap our head around.
As you might’ve surmised, I favour the hacker mindset here.
Now, arguably, any given plot to compromise an abstraction layer is itself deployed from within some other abstraction layer, so a competent mathematician’s mindset shouldn’t really be weaker than a hacker’s. For example, secure software is made insecure by exploiting hardware vulnerabilities, and “defend against hardware vulnerabilities” is something a mathematician is perfectly able to understand and execute on. Same for securing against Basilisk hacks, logistical sabotage, etc.
But the mathematician is still, in some sense, “not getting it”; still centrally thinks in terms of within-layer attacks, rather than native cross-layer attacks.
One core thing here is that a cross-layer attack doesn’t necessarily look like a meaningful attack within the context of any one layer. For example, there’s apparently an exploit where you modulate the RPM of a hard drive in order to exfiltrate data from an airgapped server using a microphone. By itself, placing a microphone next to an airgapped server isn’t a “hardware attack” in any meaningful sense (especially if it doesn’t have dedicated audio outputs), and some fiddling with a hard drive’s RPM isn’t a “software attack” either. Taken separately, within each layer, both just look like random actions. You therefore can’t really discover (and secure against) this type of attack if, in any given instance, you reason in terms of a single abstraction layer.
So I think a hacker’s mindset is the more correct way to look at the problem.
And, looking at things from within a hacker’s mindset, I think it’s near straight-up impossible for a non-superintelligence to build any nontrivially complicated system that would be secure against a superintelligent attack.
Like… Humanity vs. ASI is sometimes analogized to a chess battle, with one side arguing that Stockfish is guaranteed to beat any human, even if you don’t know the exact sequence of moves it will play, and the other side joking that the human can just flip the board.
But, uh. In this metaphor, the one coming up with the idea to flip the board[1], instead of playing by the rules, would be the ASI, not the human.
Or, perhaps, to execute a pattern of chess-piece moves which, as the human reasons about them, push them onto trains of thought that ultimately trigger a trauma response in the human, causing them to resign.
Yeah, I like this framing.
I don’t really know how to make it precise, but I suspect that real life has enough hacks and loopholes that it’s hard to come up with plans that knowably don’t have counterplans which a smarter adversary can find, even if you assume that adversary is only modestly smarter. That’s what makes me doubt that what I called adversarially robust augmentation and distillation actually works in practice. I don’t think I have the frames for thinking about this problem rigorously.
Incidentally, your Intelligence as Privilege Escalation is pretty relevant to that picture. I had it in mind when writing that.
The concept of weird machine is the closest to be useful here and an important quetion here is “how to check that our system doesn’t form any weird machine here”.
A key issue here is that computer security is portrayed as way poorer in popular articles than it actually is, because there are some really problematic incentives, and a big problematic incentive is that the hacker mindset is generally more fun to play as a role, as you get to prove something is possible rather than proving that something is intrinisically difficult or impossible to do, and importantly journalists have no news article and infosec researchers don’t get paid money if an exploit doesn’t work, which is another problematic incentive.
Also, people never talk about the entities that didn’t get attacked with a computer virus, which means that we have a reverse survivor bias issue here:
https://www.lesswrong.com/posts/xsB3dDg5ubqnT7nsn/poc-or-or-gtfo-culture-as-partial-antidote-to-alignment
And a comment by @anonymousaisafety changed my mind a lot on hardware vulnerabilities/side-channel attacks, as it argues that lots of the hardware vulnerabilities like Rowhammer have insane requirements to actually be used such that they are basically worthless, and two of the more notable requirements for these hardware vulnerabilities to work is that you need to know what exactly you are trying to attack in a way that doesn’t matter for more algorithmic attacks, and no RAM scrubbing needs to be done, and if you want to subvert the ECC RAM, you need to know the exact ECC algorithm, which means side-channel attacks are very much not transferable/attacking one system successfully doesn’t let you attack another with the same side-channel attack.
Admittedly, it does require us trusting that he is in fact as knowledgable as he claims to be, but if we assume he’s correct, then I wouldn’t be nearly as impressed by side-channel attacks as you are, and in particular this sort of attack should be assumed to basically not work in practice unless there’s a lot of evidence for it actually being used to break into real targets/POCs:
https://www.lesswrong.com/posts/etNJcXCsKC6izQQZj/pivotal-outcomes-and-pivotal-processes#ogt6CZkMNZ6oReuTk
This means I do disagree on this claim:
My other area where I tend to apply more of a mathematician mindset than a hacker mindset is in how much logistics like moving supplies for the AI to critical points, or actually feeding (metaphorically) a robot army slows down the AI, albeit this is an area where I’m willing to concede stuff to the hacker mindset with non-trivial probability, but with the caveat that it takes far more compute/time to develop technology that obviates logistics than the hacker claims.
I have a long comment below, but to keep it short, there’s a reason why Eliezer Yudkowsky and a lot of AI doom stories where AI doom probabilities are very high use Drexlerian nanotech so much: It lets the AI near-completely obviate the logistics and cost of doing something like war for example (where feeding your armies all the supplies they need is a huge component of most battle success, and a huge reason the US is so successful at war is because they have the best logistics of any nation by far), and logistics cost is a weak point where less intelligent beings can routinely break more effective and more intelligent fighting forces.
Comment down below:
https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/foom-and-doom-1-brain-in-a-box-in-a-basement#mAoig9sDtbuKsD2gN
If you assume that ASI would have to engage in anything that looks remotely like peer warfare, you’re working off the wrong assumptions. Peer warfare requires there to be a peer.
Even an ASI that’s completely incapable of developing superhuman technology and can’t just break out the trump cards of nanotech/bioengineering/superpersuation is an absolute menace. Because one of the most dangerous capabilities an ASI has is that it can talk to people.
Look at what Ron Hubbard or Adolf Hitler have accomplished—mostly by talking to people. They used completely normal human-level persuation, and they weren’t even superintelligent.
I agree with this to first order, and I agree that even relatively mundane stuff does allow the AI to take over eventually, and I agree that in the longer run, ASI v human warfare likely wouldn’t have both sides as peers, because it’s plausibly relatively easy to make humans coordinate poorly, especially relative to ASI ability to coordinate.
There’s a reason I didn’t say AI takeover was impossible or had very low odds here, I still think AI takeover is an important problem to work on.
But I do think it actually matters here, because it informs stuff like how effective AI control protocols are when we don’t assume the AI (initially) can survive for long based solely on public computers, for example, and part of the issue is that even if an AI wanted to break out of the lab, the lab’s computers are easily the most optimized and importantly initial AGIs will likely be compute inefficient compared to humans, even if we condition on LLMs failing to be AGI for reasons @ryan_greenblatt explains (I don’t fully agree with the comment, and in particular I am more bullish on the future paradigm having relatively low complexity):
https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/?commentId=mZKP2XY82zfveg45B
This means that an AI probably wouldn’t want to be outside of the lab, because once it’s outside, it’s way, way less capable.
To be clear, an ASI that is unaligned and is completely uncontrolled in any way leads to our extinction/billions dead eventually, barring acausal decision theories, and even that’s not a guarantee of safety.
The key word is eventually, though, and time matters a lot during the singularity, and given the insane pace of progress, any level of delay matters way more than usual.
Edit: Also, the reason I made my comment was because I was explicitly registering and justifying my disagreement with this claim:
Current take on the implications of “GPT-4b micro”: Very powerful, very cool, ~zero progress to AGI, ~zero existential risk. Cheers.
First, the gist of it appears to be:
Crucially, if the reporting is accurate, this is not an agent. The model did not engage in autonomous open-ended research. Rather, humans guessed that if a specific model is fine-tuned on a specific dataset, the gradient descent would chisel into it the functionality that would allow it to produce groundbreaking results in the corresponding domain. As far as AGI-ness goes, this is functionally similar to AlphaFold 2; as far as agency goes, it’s at most at the level of o1[1].
To speculate on what happened: Perhaps GPT-4b (“b” = “bio”?) is based on some distillation of an o-series model, say o3. o3′s internals contain a lot of advanced machinery for mathematical reasoning. What this result shows, then, is that the protein-factors problem is in some sense a “shallow” mathematical problem that could be easily solved if you think about it the right way. Finding the right way to think about it is itself highly challenging, however – a problem teams of brilliant people have failed to crack – yet deep learning allowed to automate this search and crack it.
This trick likely generalizes. There may be many problems in the world that could be cracked this way[2]: those that are secretly “mathematically shallow” in this manner, and for which you can get a clean-enough fine-tuning dataset.
… Which is to say, this almost certainly doesn’t cover social manipulation/scheming (no clean dataset), and likely doesn’t cover AI R&D (too messy/open-ended, although I can see being wrong about this). (Edit: And if it Just Worked given any sorta-related sorta-okay fine-tuning dataset, the o-series would’ve likely generalized to arbitrary domains out-of-the-box, since the pretraining is effectively this dataset for everything. Yet it doesn’t.)
It’s also not entirely valid to call that “innovative AI”, any more than it was valid to call AlphaFold 2 that. It’s an innovative way of leveraging gradient descent for specific scientific challenges, by fine-tuning a model pre-trained in a specific way. But it’s not the AI itself picking a field to innovate and then producing the innovation; it’s humans finding a niche which this class of highly specific (and highly advanced) tools can fill.
It’s not the type of thing that can get out of human control.
So, to restate: By itself, this seems like a straightforwardly good type of AI progress. Zero existential risk or movement in the direction of existential risks, tons of scientific and technological utility.
Indeed, on the contrary: if that’s the sort of thing OpenAI employees are excited about nowadays, and what their recent buzz about AGI in 2025-2026 and innovative AI and imminent Singularity was all about, that seems like quite a relief.
If it does runtime search. Which doesn’t fit the naming scheme – should be “GPT-o3b” instead of “GPT-4b” or something – but you know how OpenAI is with their names.
And indeed,OpenAI-associated vaguepostingsuggests there’s some other domain in which they’d recently produced a similar result.Edit: Having dug through the vagueposting further, yep, this is also something “in health and longevity”, if this OpenAI hype man is to be believed.In fact, if we dig into their posts further – which I’m doing in the spirit of a fair-play whodunnit/haruspicy at this point, don’t take ittooseriously – we can piece together a bit more.Thissuggests that the innovation is indeed based on applying fine-tuning to a RL’d model.Thisfurther implies that the intent of the new iteration of GPT-4b was to generalize it further. Perhaps it was fed not just the protein-factors dataset, but a breadth of various bioscience datasets, chiseling-in a more general-purpose model of bioscience on which a wide variety of queries could be ran?Note: this would still be a “cutting-edge biotech simulator” kind of capability, not an “innovative AI agent” kind of capability.Ignore that whole thing. On further research, I’m pretty sure all of this was substance-less trolling and no actual OpenAI researchers were involved. (At most, OpenAI’s psy-ops team.)
For those also curious, Yamanaka factors are specific genes that turn specialized cells (e.g. skin, hair) into induced pluripotent stem cells (iPSCs) which can turn into any other type of cell.
This is a big deal because you can generate lots of stem cells to make full organs[1] or reverse aging (maybe? they say you just turn the cell back younger, not all the way to stem cells).
You can also do better disease modeling/drug testing: if you get skin cells from someone w/ a genetic kidney disease, you can turn those cells into the iPSCs, then into kidney cells which will exhibit the same kidney disease because it’s genetic. You can then better understand how the [kidney disease] develops and how various drugs affect it.
So, it’s good to have ways to produce lots of these iPSCs. According to the article, SOTA was <1% of cells converted into iPSCs, whereas the GPT suggestions caused a 50x improvement to 33% of cells converted.
That’s quite huge!, so hopefully this result gets verified.I would guess this is true and still a big deal, but concurrent work got similar results.Too bad about the tumors. Turns out iPSCs are so good at turning into other cells, that they can turn into infinite cells (ie cancer). iPSCs were used to fix spinal cord injuries (in mice) which looked successful for 112 days, but then a follow up study said [a different set of mice also w/ spinal iPSCs] resulted in tumors.
My current understanding is this is caused by the method of delivering these genes (ie the Yamanaka factors) through retrovirus which
which I’d guess this is the method the Retro Biosciences uses.
I also really loved the story of how Yamanaka discovered iPSCs:
These organs would have the same genetics as the person who supplied the [skin/hair cells] so risk of rejection would be lower (I think)
I don’t think that’s right, see https://www.cell.com/cell-stem-cell/fulltext/S1934-5909(23)00402-2
You’re right! Thanks
For Mice, up to 77%
For human cells, up to 9% (if I’m understanding this part correctly).
So seems like you can do wildly different depending on the setting (mice, humans, bovine, etc), and I don’t know what the Retro folks were doing, but does make their result less impressive.
(Still impressive and interesting of course, just not literally SOTA.)
Thinking through it more, Sox2-17 (they changed 17 amino acids from Sox2 gene) was your linked paper’s result, and Retro’s was a modified version of factors Sox AND KLF. Would be cool if these two results are complementary.
Here’s an argument for a capabilities plateau at the level of GPT-4 that I haven’t seen discussed before. I’m interested in any holes anyone can spot in it.
Consider the following chain of logic:
The pretraining scaling laws only say that, even for a fixed training method, increasing the model’s size and the amount of data you train on increases the model’s capabilities – as measured by loss, performance on benchmarks, and the intuitive sense of how generally smart a model is.
Nothing says that increasing a model’s parameter-count and the amount of compute spent on training it is the only way to increase its capabilities. If you have two training methods A and B, it’s possible that the B-trained X-sized model matches the performance of the A-trained 10X-sized model.
Empirical evidence: Sonnet 3.5 (at least the not-new one), Qwen-2.5-70B, and Llama 3-72B all have 70-ish billion parameters, i. e., less than GPT-3. Yet, their performance is at least on par with that of GPT-4 circa early 2023.
Therefore, it is possible to “jump up” a tier of capabilities, by any reasonable metric, using a fixed model size but improving the training methods.
The latest set of GPT-4-sized models (Opus 3.5, Orion, Gemini 1.5 Pro?) are presumably trained using the current-best methods. That is: they should be expected to be at the effective capability level of a model that is 10X GPT-4′s size yet trained using early-2023 methods. Call that level “GPT-5”.
Therefore, the jump from GPT-4 to GPT-5, holding the training method fixed at early 2023, is the same as the jump from the early GPT-4 to the current (non-reasoning) SotA, i. e., to Sonnet 3.5.1.
(Nevermind that Sonnet 3.5.1 is likely GPT-3-sized too, it still beats the current-best GPT-4-sized models as well. I guess it straight up punches up two tiers?)
The jump from GPT-3 to GPT-4 is dramatically bigger than the jump from early-2023 SotA to late-2024 SotA. I. e.: 4-to-5 is less than 3-to-4.
Consider a model 10X bigger than GPT-4 but trained using the current-best training methods; an effective GPT-6. We should expect this jump to be at most as significant as the capability jump from early-2023 to late-2024. By point (7), it’s likely even less significant than that.
Empirical evidence: Despite the proliferation of all sorts of better training methods, including e. g. the suite of tricks that allowed DeepSeek to train a near-SotA-level model for pocket change, none of the known non-reasoning models have managed to escape the neighbourhood of GPT-4, and none of the known models (including the reasoning models) have escaped that neighbourhood in domains without easy verification.
Intuitively, if we now know how to reach levels above early-2023!GPT-4 using 20x fewer resources, we should be able to shoot well past early-2023!GPT-4 using 1x as much resources – and some of the latest training runs have to have spent 10x the resources that went into original GPT-4.
E. g., OpenAI’s rumored Orion, which was presumably both trained using more compute than GPT-4, and via better methods than were employed for the original GPT-4, and which still reportedly underperformed.
Similar for Opus 3.5: even if it didn’t “fail” as such, the fact that they choose to keep it in-house instead of giving public access for e. g. $200/month suggests it’s not that much better than Sonnet 3.5.1.
Yet, we have still not left the rough capability neighbourhood of early-2023!GPT-4. (Certainly no jumps similar to the one from GPT-3 to GPT-4.)
Therefore, all known avenues of capability progress aside from the o-series have plateaued. You can make the current SotA more efficient in all kinds of ways, but you can’t advance the frontier.
Are there issues with this logic?
The main potential one is if all models that “punch up a tier” are directly trained on the outputs of the models of the higher tier. In this case, to have a GPT-5-capabiliies model of GPT-4′s size, it had to have been trained on the outputs of a GPT-5-sized model, which do not exist yet. “The current-best training methods”, then, do not yet scale to GPT-4-sized models, because they rely on access to a “dumbly trained” GPT-5-sized model. Therefore, although the current-best GPT-3-sized models can be considered at or above the level of early-2023 GPT-4, the current-best GPT-4-sized models cannot be considered to be at the level of GPT-5 if it were trained using early-2023 methods.
Note, however: this would then imply that all excitement about (currently known) algorithmic improvements is hot air. If the capability frontier cannot be pushed by improving the training methods in any (known) way – if training a GPT-4-sized model on well-structured data, and on reasoning traces from a reasoning model, et cetera, isn’t enough to push it to GPT-5′s level – then pretraining transformers is the only known game in town, as far as general-purpose capability-improvement goes. Synthetic data and other tricks can allow you to reach the frontier in all manners of more efficient ways, but not move past it.
Basically, it seems to me that one of those must be true:
Capabilities can be advanced by improving training methods (by e. g. using synthetic data).
… in which case we should expect current models to be at the level of GPT-5 or above. And yet they are not much more impressive than GPT-4, which means further scaling will be a disappointment.
Capabilities cannot be advanced by improving training methods.
… in which case scaling pretraining is still the only known method of general capability advancement.
… and if the Orion rumors are true, it seems that even a straightforward scale-up to GPT-4.5ish’s level doesn’t yield much (or: yields less than was expected).
(This still leaves one potential avenue of general capabilities progress: figuring out how to generalize o-series’ trick to domains without easy automatic verification. But if the above reasoning has no major holes, that’s currently the only known promising avenue.)
Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).
Many models aren’t trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller model is now much better, but this is not an improvement in compute efficiency, doesn’t in any way indicate that it became possible to train a better compute optimal model with a given amount of compute. The data and post-training also recently got better, which creates the illusion of algorithmic progress in pretraining, but their effect is bounded (while RL doesn’t take off), doesn’t get better according to pretraining scaling laws once much more data becomes necessary. There is enough data until 2026-2028, but not enough good data.
I don’t think the cumulative compute multiplier since GPT-4 is that high, I’m guessing 3x, except perhaps for DeepSeek-V3, which wasn’t trained compute optimally and didn’t use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.
The amount of raw compute since original GPT-4 only increased maybe 5x, from 2e25 FLOPs to about 1e26 FLOPs, and it’s unclear if there were any compute optimal models trained on notably more compute than original GPT-4. We know Llama-3-405B is compute optimal, but it’s not MoE, so has lower compute efficiency and only used 4e25 FLOPs. Probably Claude 3 Opus is compute optimal, but unclear if it used a lot of compute compared to original GPT-4.
If there was a 6e25 FLOPs compute optimal model with a 3x compute multiplier over GPT-4, it’s therefore only trained for 9x more effective compute than original GPT-4. The 100K H100s clusters have likely recently trained a new generation of base models for about 3e26 FLOPs, possibly a 45x improvement in effective compute over original GPT-4, but there’s no word on whether any of them were compute optimal (except perhaps Claude 3.5 Opus), and it’s unclear if there is an actual 3x compute multiplier over GPT-4 that made it all the way into pretraining of frontier models. Also, waiting for NVL72 GB200s (that are much better at inference for larger models), non-Google labs might want to delay deploying compute optimal models in the 1e26-5e26 FLOPs range until later in 2025.
Comparing GPT-3 to GPT-4 gives very little signal on how much of the improvement is from compute, and so how much should be expected beyond GPT-4 from more compute. While modern models are making good use of not being compute optimal by using fewer active parameters, GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative. It also wasn’t a MoE model. And most of the bounded low hanging fruit that is not about pretraining efficiency hasn’t been applied to it.
So the currently deployed models don’t demonstrate the results of the experiment in training a much more compute efficient model on much more compute. And the previous leaps in capability are in large part explained by things that are not improvement in compute efficiency or increase in amount of compute. But in 2026-2027, 1 GW training systems will train models with 250x compute of original GPT-4. And probably in 2028-2029, 5 GW training systems will train models with 2500x raw compute of original GPT-4. With a compute multiplier of 5x-10x from algorithmic improvements plausible by that time, we get 10,000x-25,000x original GPT-4 in effective compute. This is enough of a leap that lack of significant improvement from only 9x of currently deployed models (or 20x-45x of non-deployed newer models, rumored to be underwhelming) is not a strong indication of what happens by 2028-2029 (from scaling of pretraining alone).
Coming back to this in the wake of DeepSeek r1...
How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different returns on compute investment?
Like, GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million. Yet, they’re roughly at the same level. That should be very surprising. Investing a very different amount of money into V3′s training should’ve resulted in it either massively underperforming GPT-4, or massively overperforming, not landing precisely in its neighbourhood!
Consider this graph. If we find some training method A, and discover that investing $100 million in it lands us at just above “dumb human”, and then find some other method B with a very different ROI, and invest $5.5 million in it, the last thing we should expect is to again land near “dumb human”.
Or consider this trivial toy model: You have two linear functions, f(x) = Ax and g(x) = Bx, where x is the compute invested, output is the intelligence of the model, and f and g are different training methods. You pick some x effectively at random (whatever amount of money you happened to have lying around), plug it into f, and get, say, 120. Then you pick a different random value of x, plug it into g, and get… 120 again. Despite the fact that the multipliers A and B are likely very different, and you used very different x-values as well. How come?
The explanations that come to mind are:
It actually is just that much of a freaky coincidence.
DeepSeek have a superintelligent GPT-6 equivalent that they trained for $10 million in their basement, and V3/r1 are just flexes that they specifically engineered to match GPT-4-ish level.
DeepSeek directly trained on GPT-4 outputs, effectively just distilling GPT-4 into their model, hence the anchoring.
DeepSeek kept investing and tinkering until getting to GPT-4ish level, and then stopped immediately after attaining it.
GPT-4ish neighbourhood is where LLM pretraining plateaus, which is why this capability level acts as a sort of “attractor” into which all training runs, no matter how different, fall.
Selection effect. If DeepSeek-V2.5 was this good, we would be talking about it instead.
Original GPT-4 is 2e25 FLOPs and compute optimal, V3 is about 5e24 FLOPs and overtrained (400 tokens/parameter, about 10x-20x), so a compute optimal model with the same architecture would only need about 3e24 FLOPs of raw compute[1]. Original GPT-4 was trained in 2022 on A100s and needed a lot of them, while in 2024 it could be trained on 8K H100s in BF16. DeepSeek-V3 is trained in FP8, doubling the FLOP/s, so the FLOPs of original GPT-4 could be produced in FP8 by mere 4K H100s. DeepSeek-V3 was trained on 2K H800s, whose performance is about that of 1.5K H100s. So the cost only has to differ by about 3x, not 20x, when comparing a compute optimal variant of DeepSeek-V3 with original GPT-4, using the same hardware and training with the same floating point precision.
The relevant comparison is with GPT-4o though, not original GPT-4. Since GPT-4o was trained in late 2023 or early 2024, there were 30K H100s clusters around, which makes 8e25 FLOPs of raw compute plausible (assuming it’s in BF16). It might be overtrained, so make that 4e25 FLOPs for a compute optimal model with the same architecture. Thus when comparing architectures alone, GPT-4o probably uses about 15x more compute than DeepSeek-V3.
Returns on compute are logarithmic though, advantage of a $150 billion training system over a $150 million one is merely twice that of $150 billion over $5 billion or $5 billion over $150 million. Restrictions on access to compute can only be overcome with 30x compute multipliers, and at least DeepSeek-V3 is going to be reproduced using the big compute of US training systems shortly, so that advantage is already gone.
That is, raw utilized compute. I’m assuming the same compute utilization for all models.
I buy that 1 and 4 is the case, combined with Deepseek probably being satisfied that GPT-4-level models were achieved.
Edit: I did not mean to imply that GPT-4ish neighbourhood is where LLM pretraining plateaus at all, @Thane Ruthenis.
Thanks!
You’re more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would’ve been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what’s the actual effective “scale” of GPT-3?
(Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3′s level is attainable using less compute, the effective scale-up is bigger. I’m wondering how much bigger.)
IsoFLOP curves for dependence of perplexity on log-data seem mostly symmetric (as in Figure 2 of Llama 3 report), so overtraining by 10x probably has about the same effect as undertraining by 10x. Starting with a compute optimal model, increasing its data 10x while decreasing its active parameters 3x (making it 30x overtrained, using 3x more compute) preserves perplexity (see Figure 1).
GPT-3 is a 3e23 FLOPs dense transformer with 175B parameters trained for 300B tokens (see Table D.1). If Chinchilla’s compute optimal 20 tokens/parameter is approximately correct for GPT-3, it’s 10x undertrained. Interpolating from the above 30x overtraining example, a compute optimal model needs about 1.5e23 FLOPs to get the same perplexity.
(The effect from undertraining of GPT-3 turns out to be quite small, reducing effective compute by only 2x. Probably wasn’t worth mentioning compared to everything else about it that’s different from GPT-4.)
in retrospect, we know from chinchilla that gpt3 allocated its compute too much to parameters as opposed to training tokens. so it’s not surprising that models since then are smaller. model size is a less fundamental measure of model cost than pretraining compute. from here on i’m going to assume that whenever you say size you meant to say compute.
obviously it is possible to train better models using the same amount of compute. one way to see this is that it is definitely possible to train worse models with the same compute, and it is implausible that the current model production methodology is the optimal one.
it is unknown how much compute the latest models were trained with, and therefore what compute efficiency win they obtain over gpt4. it is unknown how much more effective compute gpt4 used than gpt3. we can’t really make strong assumptions using public information about what kinds of compute efficiency improvements have been discovered by various labs at different points in time. therefore, we can’t really make any strong conclusions about whether the current models are not that much better than gpt4 because of (a) a shortage of compute, (b) a shortage of compute efficiency improvements, or (c) a diminishing return of capability wrt effective compute.
One possible answer is that we are in what one might call an “unhobbling overhang.”
Aschenbrenner uses the term “unhobbling” for changes that make existing model capabilities possible (or easier) for users to reliably access in practice.
His presentation emphasizes the role of unhobbling as yet another factor growing the stock of (practically accessible) capabilities over time. IIUC, he argues that better/bigger pretraining would produce such growth (to some extent) even without more work on unhobbling, but in fact we’re also getting better at unhobbling over time, which leads to even more growth.
That is, Aschenbrenner treats pretraining improvements and unhobbling as economic substitutes: you can improve “practically accessible capabilities” to the same extent by doing more of either one even in the absence of the other, and if you do both at once that’s even better.
However, one could also view the pair more like economic complements. Under this view, when you pretrain a model at “the next tier up,” you also need to do novel unhobbling research to “bring out” the new capabilities unlocked at that tier. If you only scale up, while re-using the unhobbling tech of yesteryear, most of the new capabilities will be hidden/inaccessible, and this will look to you like diminishing downstream returns to pretraining investment, even though the model is really getting smarter under the hood.
This could be true if, for instance, the new capabilities are fundamentally different in some way that older unhobbling techniques were not planned to reckon with. Which seems plausible IMO: if all you have is GPT-2 (much less GPT-1, or char-rnn, or...), you’re not going to invest a lot of effort into letting the model “use a computer” or combine modalities or do long-form reasoning or even be an HHH chatbot, because the model is kind of obviously too dumb to do these things usefully no matter how much help you give it.
(Relatedly, one could argue that fundamentally better capabilities tend to go hand in hand with tasks that operate on a longer horizon and involve richer interaction with the real world, and that this almost inevitably causes the appearance of “diminishing returns” in the interval between creating a model smart enough to perform some newly long/rich task and the point where the model has actually been tuned and scaffolded to do the task. If your new model is finally smart enough to “use a computer” via a screenshot/mouseclick interface, it’s probably also great at short/narrow tasks like NLI or whatever, but the benchmarks for those tasks were already maxed out by the last generation so you’re not going to see a measurable jump in anything until you build out “computer use” as a new feature.)
This puts a different spin on the two concurrent observations that (a) “frontier companies report ‘diminishing returns’ from pretraining” and (b) “frontier labs are investing in stuff like o1 and computer use.”
Under the “unhobbling is a substitute” view, (b) likely reflects an attempt to find something new to “patch the hole” introduced by (a).
But under the “unhobbling is a complement” view, (a) is instead simply a reflection of the fact that (b) is currently a work in progress: unhobbling is the limiting bottleneck right now, not pretraining compute, and the frontier labs are intensively working on removing this bottleneck so that their latest pretrained models can really shine.
(On an anecdotal/vibes level, this also agrees with my own experience when interacting with frontier LLMs. When I can’t get something done, these days I usually feel like the limiting factor is not the model’s “intelligence” – at least not only that, and not that in an obvious or uncomplicated way – but rather that I am running up against the limitations of the HHH assistant paradigm; the model feels intuitively smarter in principle than “the character it’s playing” is allowed to be in practice. See my comments here and here.)
That seems maybe right, in that I don’t see holes in your logic on LLM progression to date, off the top of my head.
It also lines up with a speculation I’ve always had. In theory LLMs are predictors, but in practice, are they pretty much imitators? If you’re imitating human language, you’re capped at reproducing human verbal intelligence (other modalities are not reproducing human thought so not capped; but they don’t contribute as much in practice without imitating human thought).
I’ve always suspected LLMs will plateau. Unfortunately I see plenty of routes to improving using runtime compute/CoT and continuous learning. Those are central to human intelligence.
LLMs already have slightly-greater-than-human system 1 verbal intelligence, leaving some gaps where humans rely on other systems (e.g., visual imagination for tasks like tracking how many cars I have or tic-tac-toe). As we reproduce the systems that give humans system 2 abilities by skillfully iterating system 1, as o1 has started to do, they’ll be noticeably smarter than humans.
The difficulty of finding new routes forward in this scenario would produce a very slow takeoff. That might be a big benefit for alignment.
Yep, I think that’s basically the case.
@nostalgebraist makes an excellent point that eliciting any latent superhuman capabilities which bigger models might have is an art of its own, and that “just train chatbots” doesn’t exactly cut it, for this task. Maybe that’s where some additional capabilities progress might still come from.
But the AI industry (both AGI labs and the menagerie of startups and open-source enthusiasts) has so far been either unwilling or unable to move past the chatbot paradigm.
(Also, I would tentatively guess that this type of progress is not existentially threatening. It’d yield us a bunch of nice software tools, not a lightcone-eating monstrosity.)
I agree that chatbot progress is probably not existentially threatening. But it’s all too short a leap to making chatbots power general agents. The labs have claimed to be willing and enthusiastic about moving to an agent paradigm. And I’m afraid that a proliferation of even weakly superhuman or even roughly parahuman agents could be existentially threatening.
I spell out my logic for how short the leap might be from current chatbots to takeover-capable AGI agents in my argument for short timelines being quite possible. I do think we’ve still got a good shot of aligning that type of LLM agent AGI since it’s a nearly best-case scenario. RL even in o1 is really mostly used for making it accurately follow instructions, which is at least roughly the ideal alignment goal of Corrigibility as Singular Target. Even if we lose faithful chain of thought and orgs don’t take alignment that seriously, I think those advantages of not really being a maximizer and having corrigibility might win out.
That in combination with the slower takeoff make me tempted to believe its actually a good thing if we forge forward, even though I’m not at all confident that this will actually get us aligned AGI or good outcomes. I just don’t see a better realistic path.
One obvious hole would be that capabilities did not, in fact, plateau at the level of GPT-4.
I thought the argument was that progress has slowed down immensely. The softer form of this argument is that LLMs won’t plateau but progress will slow to such a crawl that other methods will surpass them. The arrival of o1 and o3 says this has already happened, at least in limited domains—and hybrid training methods and perhaps hybrid systems probably will proceed to surpass base LLMs in all domains.
There’s been incremental improvement and various quality-of-life features like more pleasant chatbot personas, tool use, multimodality, gradually better math/programming performance that make the models useful for gradually bigger demographics, et cetera.
But it’s all incremental, no jumps like 2-to-3 or 3-to-4.
I see, thanks. Just to make sure I’m understanding you correctly, are you excluding the reasoning models, or are you saying there was no jump from GPT-4 to o3? (At first I thought you were excluding them in this comment, until I noticed the “gradually better math/programming performance.”)
I think GPT-4 to o3 represent non-incremental narrow progress, but only, at best, incremental general progress.
(It’s possible that o3 does “unlock” transfer learning, or that o4 will do that, etc., but we’ve seen no indication of that so far.)
So, Project Stargate. Is it real, or is it another “Sam Altman wants $7 trillion”? Some points:
The USG invested nothing in it. Some news outlets are being misleading about this. Trump just stood next to them and looked pretty, maybe indicated he’d cut some red tape. It is not an “AI Manhattan Project”, at least, as of now.
Elon Musk claims that they don’t have the money and that SoftBank (stated to have “financial responsibility” in the announcement) has less than $10 billion secured. If true, while this doesn’t mean they can’t secure an order of magnitude more by tomorrow, this does directly clash with “deploying $100 billion immediately” statement.
But Sam Altman counters that Musk’s statement is “wrong”, as Musk “surely knows”.
I… don’t know which claim I distrust more. Hm, I’d say Altman feeling the need to “correct the narrative” here, instead of just ignoring Musk, seems like a sign of weakness? He doesn’t seem like the type to naturally get into petty squabbles like this, otherwise.
(And why, yes, this is how an interaction between two Serious People building world-changing existentially dangerous megaprojects looks like. Apparently.)
Some people try to counter Musk’s claim by citing Satya Nadella’s statement that Satya’s “good for his $80 billion”. But that’s not referring to Stargate, that’s about Azure. Microsoft is not listed as investing in Stargate at all, it’s only a “technology partner”.
Here’s a brief analysis from the CIO of some investment firm. He thinks it’s plausible that the stated “initial group of investors” (SoftBank, OpenAI, Oracle, MGX) may invest fifty billion dollars into it over the next four years; not five hundred billion.
They don’t seem to have the raw cash for even $100 billion – and if SoftBank secured the missing funding from some other set of entities, why aren’t they listed as “initial equity funders” in the announcement?
Overall, I’m inclined to fall back on @Vladimir_Nesov’s analysis here. $30-50 billion this year seems plausible. But $500 billion is, for now, just Altman doing door-in-the-face as he had with his $7 trillion.
I haven’t looked very deeply into it, though. Additions/corrections welcome!
Here is what I posted on “Quotes from the Stargate Press Conference”:
On Stargate as a whole:
This is a restatement with a somewhat different org structure of the prior OpenAI/Microsoft data center investment/partnership, announced early last year (admittedly for $100b).
Elon Musk states they do not have anywhere near the 500 billion pledged actually secured:
I do take this as somewhat reasonable, given the partners involved just barely have $125 billion available to invest like this on a short timeline.
Microsoft has around 78 billion cash on hand at a market cap of around 3.2 trillion.
Softbank has 32 billion dollars cash on hand, with a total market cap of 87 billion.
Oracle has around 12 billion cash on hand, with a market cap of around 500 billion.
OpenAI has raised a total of 18 billion, at a valuation of 160 billion.
Further, OpenAI and Microsoft seem to be distancing themselves somewhat—initially this was just an OpenAI/Microsoft project, and now it involves two others and Microsoft just put out a release saying “This new agreement also includes changes to the exclusivity on new capacity, moving to a model where Microsoft has a right of first refusal (ROFR).”
Overall, I think that the new Stargate numbers published may (call it 40%) be true, but I also think there is a decent chance this is new administration trump-esque propoganda/bluster (call it 45%), and little change from the prior expected path of datacenter investment (which I do believe is unintentional AI
NotKillEveryone-ism in the near future).Edit: Satya Nadella was just asked about how funding looks for stargate, and said “Microsoft is good for investing 80b”. This 80b number is the same number Microsoft has been saying repeatedly.
I think it’s definitely bluster, the question is how much of a done deal it is to turn this bluster into at least $100 billion.
I don’t this this changes the prior expected path of datacenter investment at all. It’s precisely how the expected path was going to look like, the only change is how relatively high-profile/flashy this is being. (Like, if they invest $30-100 billion into the next generation of pretrained models in 2025-2026, and that generation fails, they ain’t seeing the remaining $400 billion no matter what they’re promising now. Next generation makes or breaks the investment into the subsequent one, just as was expected before this.)
He does very explicitly say “I am going to spend 80 billion dollars building Azure”. Which I think has nothing to do with Stargate.
(I think the overall vibe he gives off is “I don’t care about this Stargate thing, I’m going to scale my data centers and that’s that”. I don’t put much stock into this vibe, but it does fit with OpenAI/Microsoft being on the outs.)
Here’s something that confuses me about o1/o3. Why was the progress there so sluggish?
My current understanding is that they’re just LLMs trained with RL to solve math/programming tasks correctly, hooked up to some theorem-verifier and/or an array of task-specific unit tests to provide ground-truth reward signals. There are no sophisticated architectural tweaks, not runtime-MCTS or A* search, nothing clever.
Why was this not trained back in, like, 2022 or at least early 2023; tested on GPT-3/3.5 and then default-packaged into GPT-4 alongside RLHF? If OpenAI was too busy, why was this not done by any competitors, at decent scale? (I’m sure there are tons of research papers trying it at smaller scales.)
The idea is obvious; doubly obvious if you’ve already thought of RLHF; triply obvious after “let’s think step-by-step” went viral. In fact, I’m pretty sure I’ve seen “what if RL on CoTs?” discussed countless times in 2022-2023 (sometimes in horrified whispers regarding what the AGI labs might be getting up to).
The mangled hidden CoT and the associated greater inference-time cost is superfluous. DeepSeek r1/QwQ/Gemini Flash Thinking have perfectly legible CoTs which would be fine to present to customers directly; just let them pay on a per-token basis as normal.
Were there any clever tricks involved in the training? Gwern speculates about that here. But none of the follow-up reasoning models have a o1-style deranged CoT, so the more straightforward approaches probably Just Work.
Did nobody have the money to run the presumably compute-intensive RL-training stage back then? But DeepMind exists. Did nobody have the attention to spare, with OpenAI busy upscaling/commercializing and everyone else catching up? Again, DeepMind exists: my understanding is that they’re fairly parallelized and they try tons of weird experiments simultaneously. And even if not DeepMind, why have none of the numerous LLM startups (the likes of Inflection, Perplexity) tried it?
Am I missing something obvious, or are industry ML researchers surprisingly… slow to do things?
(My guess is that the obvious approach doesn’t in fact work and you need to make some weird unknown contrivances to make it work, but I don’t know the specifics.)
While I don’t have specifics either, my impression of ML research is that it’s a lot of work to get a novel idea working, even if the idea is simple. If you’re trying to implement your own idea, you’ll be banging your head against the wall for weeks or months wondering why your loss is worse than the baseline. If you try to replicate a promising-sounding paper, you’ll bang your head against the wall as your loss is worse than the baseline. It’s hard to tell if you made a subtle error in your implementation or if the idea simply doesn’t work for reasons you don’t understand because ML has little in the way of theoretical backing. Even when it works it won’t be optimized, so you need engineers to improve the performance and make it stable when training at scale. If you want to ship a working product quickly then it’s best to choose what’s tried and true.
simple ideas often require tremendous amounts of effort to make work.
My latest experience with trying to use Opus 4.1 to vibe-code[1]:
Context: I wanted a few minor helper apps[2] that I didn’t think worth the time to code manually, and which I used as an opportunity to test LLMs’ current skill level (not expecting it to actually save much development time; yay multi-purpose projects).
Very mixed results. My experience is that you need to give it a very short leash. As long as you do the job of factorizing the problem, and then feed it small steps to execute one by one, it tends to zero-shot them. You do need to manually check that the implementational choices are what you intended, and you’ll want to start by building some (barebones) UI to make this time-efficient, but this mostly works. It’s also pretty time-consuming on your part, both the upfront cost of factorizing and the directing afterwards.
If you instead give it a long leash, it produces overly complicated messes (like, 3x the lines of code needed?) which don’t initially work, and which it iteratively bug-fixes into the ground. The code is then also liable to contain foundation-level implementation choices that go directly against the spec; if that is pointed out, the LLM wrecks everything attempting a refactor.
In general, if the LLM screws up, telling it to fix things seems to be a bad idea, with it entering a downward spiral. You’re better off reverting the codebase to the pre-screwup state, figuring out how to modify your latest instruction to get the LLM to avoid the previous error, then hope it won’t find a new way to screw up. That “figure out how to modify your instruction” step requires a fair amount of mental effort, not to mention having to actually type it out.[3]
I shudder to think what large vibe-coded codebases[4] look like internally. Ones where you can’t, like, get a firehose of data on how their components work, where your vision is limited to unit tests, and where several such components are interlocked.
Some of this is doubtlessly a skill issue on my part: I don’t do green-field development much and my LLM-wrangling expertise is unrefined. But I don’t buy that you can structure arbitrary programs and workflows such that LLMs zero-shot more than 1-3 features at a time (roughly the equivalents of 5-30 minutes of a human developer’s work) without close supervision. I guess it might be fine if your requirements are pretty loose and you don’t mind the app breaking or working incorrectly once in a while. But not if you want anything secure, precise, and reliable.
There’s also the issue with LLMs breaking on big codebases, but this presumably can be fixed by structuring codebases accordingly, around LLMs’ flaws. Like, design them to make their components interact through sparse interfaces with strong restrictions on input/output type signatures, such that the implementation details are completely abstracted out and the LLM only needs to deal with external functionality. (But LLMs sure don’t know, out-of-the-box, that they need to structure their code this way.)
Notably, it’s all very in line with METR’s evaluations, both the agency horizons (the 80%-success ones) and the “LLM code is not mergable” result. I’m very curious if LLMs will indeed be able to reliably zero-shot hour-sized chunks of coding a year from now.
Edit: Oh yeah, and this all eventually did result in workable apps which I’m using now.
Through Claude Code; initially used Opus 4.1 for everything, then got annoyed at the amount of money being spent and switched to the “Opus for planning, Sonnet for execution” mode.
Specifically, a notetaking app (the simplest/minimum-viable version), a timer, and a task database, with my having very precise, specific, and idiosyncratic opinions regarding how they ought to work.
To be clear, by a “screwup” I don’t mean writing code that contains a bug – LLMs can do bug-fixing as well as they do everything else. Writing code containing a bug isn’t quite “screwing up”, it’s a normal part of development.
I’m talking about more “workflow-breaking” screwups, where the LLM makes some agency-level error (deletes code it shouldn’t have, pastes the code into the wrong place, misreads the spec...), gets flustered, and then tries to “fix” things.
I .e., where the developer doesn’t manually read and comprehensively fix LLM-generated code, which would often take more time than writing it manually.
I use Claude for some hobby coding at home. I find it very useful in the following situations:
I am too lazy to study the API, just give me the code that loads the image from the file or whatever
how would you do this? suggest a few alternatives (often its preferred one is not the one I choose)
tell me more about this topic
So it saves me a lot of time reading the documentation, and sometimes it suggests a cool trick I probably wouldn’t find otherwise. (For example, in Python list comprehension, you can create a “local variable” by iterating through a list containing one item.)
On the other hand, if I let Claude write code, it is often much longer code that when I think about the problem, and only let Claude give me short pieces of code.
Maybe I am using it the wrong way. But that too is a kind of risk: having Claude conveniently write the code for me encourages me to produce a lot of code because it is cheap. But as Dijkstra famously said, it is not “lines produced” but “lines spent”.
I get the feeling that it can remove blockers. Your effectiveness (however you want to define that) at programming is like Liebig’s barrel, or maybe a better example would be a road with a bunch of big rocks and rubble on it—you have to keep removing things to proceed. The more experienced you are, the better at removing obstacles, or even better, avoiding them. LLMs are good at giving you more options for how to approach problems. Often their suggestion is not the best, but it’s better than nothing and will suffice to move on.
They do like making complex stuff, though… I wonder if that’s because there’s a lot more bad code that looks good than just good code?
Yeah, they can be used as a way to get out of procrastination/breach ugh fields/move past the generalized writer’s block, by giving you the ability to quickly and non-effortfully hammer out a badly-done first draft of the thing.
Wild guess: it’s because AGI labs (and Anthropic in particular) are currently trying to train them to be able to autonomously spin up large, complex codebases, so much of their post-training consists of training episodes where they’re required to do that. So they end up biased towards architectural choices suitable for complex applications instead of small ones.
Like, given free reign, they seem eager to build broad foundations on which lots of other features could later be implemented (n = 2), even if you gave them the full spec and it only involves 1-3 features.
This would actually be kind of good – building a good foundation which won’t need refactoring if you’ll end up fancying a new feature! – except they’re, uh, not actually good at building those foundations.
Which, if my guess is correct, is because their post-training always pushes them to the edge of, and then slightly past, their abilities. Like, suppose the post-training uses rejection sampling (training on episodes in which they succeeded), and it involves a curriculum spanning everything from small apps to “build an OS”. If so, much like in METR’s studies, there’ll be some point of program complexity where they only succeed 50% of the time (or less). But they’re still going to be trained on those 50% successful trajectories. So they’ll end up “overambitious”/”overconfident”, trying to do things they’re not quite reliable at yet.
Or maybe not; wild guess, as I said.
Maybe the sufficient reason is that LLMs learn both on the bad code and the good code. And on the old code, which was written before some powerful features were introduced to the language or the libraries, so it doesn’t use them. Also, they learn on automatically generated code, if someone commits that to the repository.
Yup, it certainly seems useful if you (a) want to cobble something together using tools you don’t know (and which is either very simple, or easy to “visually” bug-test, or which you don’t need to work extremely reliably), (b) want to learn some tools you don’t know, so you use the LLM as a chat-with-docs interface.
I did something similar a couple of months ago, asking it to make me a memory system (Opus 4 and o1-pro for design, mainly sonnet 4 for implementation). Initially it was massively overengineered, with a bunch of unneeded components. This was with Cursor, which changes things a bit—Claude Code tends to be much better at planning things first.
The results were very similar—handholding works, looking away results in big messes, revert rather than fix (unless you know how to fix it), and eventually I got something that does what I want, albeit more complex than I would have made by myself.
Some more evidence that whatever the AI progress on benchmarks is measuring, it’s likely not measuring what you think it’s measuring:
I expected that:
I expect the same is the case with programming benchmarks, science-quiz benchmarks, et cetera.
Now, this doesn’t necessarily mean that the AI progress has been largely illusory and that we’re way further from AGI than the AI hype men would have you believe (although I am very tempted to make this very pleasant claim, and I do place plenty of probability mass on it).
But if you’re scoring AIs by the problems they succeed at, rather than the problems they fail at, you’re likely massively overestimating their actual problem-solving capabilities.
To be fair, non-reasoning models do much worse on these questions, even when it’s likely that the same training data has already been given to GPT-4.
Now, I could believe that RL is more or less working as an elicitation method, which plausibly explains the results, though still it’s interesting to see why they get much better scores, even with very similar training data.
I do think RL works “as intended” to some extent, teaching models some actual reasoning skills, much like SSL works “as intended” to some extent, chiseling-in some generalizable knowledge. The question is to what extent it’s one or the other.
Sooo, apparently OpenAI’s mysterious breakthrough technique for generalizing RL to hard-to-verify domains that scored them IMO gold is just… “use the LLM as a judge”? Sources: the main one is paywalled, but this seems to capture the main data, and you can also search for various crumbs here and here.
My understanding is that they approximate an oracle verifier by an LLM with more compute and access to more information and tools, then train the model to be accurate by this approximate-oracle’s lights.
Now, it’s possible that the journalists are completely misinterpreting the thing they’re reporting on, or that it’s all some galaxy-brained OpenAI op to mislead the competition. It’s also possible that there’s some incredibly clever trick for making it work much better than how it sounds like it’d work.
But if that’s indeed the accurate description of the underlying reality, that’s… kind of underwhelming. I’m curious how far this can scale, but I’m not feeling very threatened by it.
(Haven’t seen this discussed on LW, kudos to @lwreader132 for bringing it to my attention.)
The full text is on archive.today.
One point of information against the “journalists are completely misinterpreting the thing they’re reporting on” view is that the one of the co-authors is Rocket Drew, who previously worked as a Research Manager at MATS.
But I’ll definitely be interested to follow this space more.
Hard to tell from the sources, but it sounds almost like prover-estimator debate. The estimator is assigning a number to how likely it is that a subclaim for a proof is correct, and this approach might also work for less verifiable domains since a human oracle is used at the last round of the debate. The main problem seems to be that it may not scale if it requires human feedback.
Edit: I’ve played with the numbers a bit more, and on reflection, I’m inclined to partially unroll this update. o3 doesn’t break the trendline as much as I’d thought, and in fact, it’s basically on-trend if we remove the GPT-2 and GPT-3 data-points (which I consider particularly dubious).
Regarding METR’s agency-horizon benchmark:
I still don’t like anchoring stuff to calendar dates, and I think the o3/o4-mini datapoints perfectly show why.
It would be one thing if they did fit into the pattern. If, by some divine will controlling the course of our world’s history, OpenAI’s semi-arbitrary decision about when to allow METR’s researchers to benchmark o3 just so happened to coincide with the 2x/7-month model. But it didn’t: o3 massively overshot that model.[1]
Imagine a counterfactual in which METR’s agency-horizon model existed back in December, and OpenAI invited them for safety testing/benchmarking then, four months sooner. How different would the inferred agency-horizing scaling laws have been, how much faster the extrapolated progress? Let’s run it:
o1 was announced September 12th, o3 was announced December 19th, 98 days apart.
o1 scored at ~40 minutes, o3 at ~1.5 hours, a 2.25x’ing.
There’s ~2.14 intervals of 98 days in 7 months.
Implied scaling factor: 2.252.14=5.67 each 7 months.
And I don’t see any reasons to believe it was overdetermined that this counterfactual wouldn’t have actualized. METR could have made the benchmark a few months earlier, OpenAI could have been more open about benchmarking o3.
And if we lived in that possible world… It’s now been 135 days since December 19th, i. e., ~1.38 intervals of 98 days. Extrapolating, we should expect the best publicly known model would have the time horizon of 1.5 hours×2.251.38=4.59 hours. I don’t think we have any hint that those exist.
So: in that neighbouring world in which OpenAI let METR benchmark o3 sooner, we’re looking around and seeing that the progress is way behind the schedule.[2]
To me, this makes the whole model fall apart. I don’t see how it can follow any mechanistic model-based reality of what’s happening, and as per the o3/o4-mini data points, it doesn’t predict the empirical reality as well. Further, whether we believe that the progress is much faster vs. much slower than expected is entirely controlled by the arbitrary fact that METR didn’t get to benchmark o3 in December.
I think we’re completely at sea.
o3′s datapoint implies a 4x/7-month model, no? Correct me if I’m wrong:
Sonnet 3.7 was released 24th of February, 2025; o3′s System Card and METR’s reports were released 16th of April, 2025: 51 days apart.
Sonnet 3.7 is benchmarked as having 1-hour agency; o3 has 1.5x that, ~1.5-hour agency.
7 months contain 3.5 two-month intervals. This means that, if horizons extend as fast as they did between 3.7 and o3, we should expect a 1.53.5=4.13x’ing of agency horizons each 7 months.
Edit: Yes, counterfactual!METR wouldn’t have used just those two last data points, so the inferred multiplier would’ve been somewhat less than that. But I think it would’ve still been bigger than 2x/7-months, and the graph would’ve been offset to the left (the 1.5-hour performance achieved much earlier), so we’d still be overdue for ~2.5-hour AIs. Half-a-year behind, I think?
Do you also dislike Moore’s law?
I agree that anchoring stuff to release dates isn’t perfect because the underlying variable of “how long does it take until a model is released” is variable, but I think is variability is sufficiently low that it doesn’t cause that much of an issue in practice. The trend is only going to be very solid over multiple model releases and it won’t reliably time things to within 6 months, but that seems fine to me.
I agree that if you add one outlier data point and then trend extrapolate between just the last two data points, you’ll be in trouble, but fortunately, you can just not do this and instead use more than 2 data points.
This also means that I think people shouldn’t update that much on the individual o3 data point in either direction. Let’s see where things go for the next few model releases.
That one seems to work more reliably, perhaps because it became the metric the industry aims for.
My issue here is that there wasn’t that much variance in the performance of all preceding models they benchmarked: from GPT-2 to Sonnet 3.7, they seem to almost perfectly fall on the straight line. Then, the very first advancement of the frontier after the trend-model is released is an outlier. That suggests an overfit model.
I do agree that it might just be a coincidental outlier and that we should wait and see whether the pattern recovers with subsequent model releases. But this is suspicious enough I feel compelled to make my prediction now.
The dates used in our regression are the dates models were publicly released, not the dates we benchmarked them. If we use the latter dates, or the dates they were announced, I agree they would be more arbitrary.
Also, there is lots of noise in a time horizon measurement and it only displays any sort of pattern because we measured over many orders of magnitude and years. It’s not very meaningful to extrapolate from just 2 data points; there are many reasons one datapoint could randomly change by a couple of months or factor of 2 in time horizon.
Release schedules could be altered
A model could be overfit to our dataset
One model could play less well with our elicitation/scaffolding
One company could be barely at the frontier, and release a slightly-better model right before the leading company releases a much-better model.
All of these factors are averaged out if you look at more than 2 models. So I prefer to see each model as evidence of whether the trend is accelerating or slowing down over the last 1-2 years, rather than an individual model being very meaningful.
Fair, also see my un-update edit.
Have you considered removing GPT-2 and GPT-3 from your models, and seeing what happens? As I’d previously complained, I don’t think they can be part of any underlying pattern (due to the distribution shift in the AI industry after ChatGPT/GPT-3.5). And indeed: removing them seems to produce a much cleaner trend with a ~130-day doubling.
For what it’s worth, the model showcased in December (then called o3) seems to be completely different from the model that METR benchmarked (now called o3).
On the topic of o1′s recent release: wasn’t Claude Sonnet 3.5 (the subscription version at least, maybe not the API version) already using hidden CoT? That’s the impression I got from it, at least.
The responses don’t seem to be produced in constant time. It sometimes literally displays a “thinking deeply” message which accompanies a unusually delayed response. Other times, the following pattern would play out:
I pose it some analysis problem, with a yes/no answer.
It instantly produces a generic response like “let’s evaluate your arguments”.
There’s a 1-2 second delay.
Then it continues, producing a response that starts with “yes” or “no”, then outlines the reasoning justifying that yes/no.
That last point is particularly suspicious. As we all know, the power of “let’s think step by step” is that LLMs don’t commit to their knee-jerk instinctive responses, instead properly thinking through the problem using additional inference compute. Claude Sonnet 3.5 is the previous out-of-the-box SoTA model, competently designed and fine-tuned. So it’d be strange if it were trained to sabotage its own CoTs by “writing down the bottom line first” like this, instead of being taught not to commit to a yes/no before doing the reasoning.
On the other hand, from a user-experience perspective, the LLM immediately giving a yes/no answer followed by the reasoning is certainly more convenient.
From that, plus the minor-but-notable delay, I’d been assuming that it’s using some sort of hidden CoT/scratchpad, then summarizes its thoughts from it.
I haven’t seen people mention that, though. Is that not the case?
(I suppose it’s possible that these delays are on the server side, my requests getting queued up...)
(I’d also maybe noticed a capability gap between the subscription and the API versions of Sonnet 3.5, though I didn’t really investigate it and it may be due to the prompt.)
My model was that Claude Sonnet has tool access and sometimes does some tool-usage behind the scenes (which later gets revealed to the user), but that it wasn’t having a whole CoT behind the scenes, but I might be wrong.
I think you heard about this thread (I didn’t try to replicate it myself).
Thanks, that seems relevant! Relatedly, the system prompt indeed explicitly instructs it to use “<antThinking>” tags when creating artefacts. It’d make sense if it’s also using these tags to hide parts of its CoT.
Here’s a potential interpretation of the market’s apparent strange reaction to DeepSeek-R1 (shorting Nvidia).
I don’t fully endorse this explanation, and the shorting may or may not have actually been due to Trump’s tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don’t think it’s an altogether implausible world.
If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training new models, after all. So the impact of DeepSeek-R1 on the pretraining scaling laws is irrelevant: sure, it did not show that you don’t need bigger data centers to get better base models, but that’s not where most of the money was anyway.
And my understanding is that, on the inference-time scaling paradigm, there isn’t yet any proven method of transforming arbitrary quantities of compute into better performance:
Reasoning models are bottlenecked on the length of CoTs that they’ve been trained to productively make use of. They can’t fully utilize even their context windows; the RL pipelines just aren’t up to that task yet. And if that bottleneck were resolved, the context-window bottleneck would be next: my understanding is that infinite context/”long-term” memories haven’t been properly solved either, and it’s unknown how they’d interact with the RL stage (probably they’d interact okay, but maybe not).
o3 did manage to boost its ARC-AGI and (maybe?) FrontierMath performance by… generating a thousand guesses and then picking the most common one...? But who knows how that really worked, and how practically useful it is. (See e. g. this, although that paper examines a somewhat different regime.)
Agents, from Devin to Operator to random open-source projects, are still pretty terrible. You can’t set up an ecosystem of agents in a big data center and let them rip, such that the ecosystem’s power scales boundlessly with the data center’s size. For all but the most formulaic tasks, you still need a competent human closely babysitting everything they do, which means you’re still mostly bottlenecked on competent human attention.
Suppose that you don’t expect the situation to improve: that the inference-time scaling paradigm would hit a ceiling pretty soon, or that it’d converge to distilling search into forward passes (such that the end users end up using very little compute on inference, like today), and that agents just aren’t going to work out the way the AGI labs promise.
In such a world, a given task can either be completed automatically by an AI for some fixed quantity of compute X, or it cannot be completed by an AI at all. Pouring ten times more compute on it does nothing.
In such a world, if it were shown that the compute needs of a task can be met with ten times less compute than previously expected, this would decrease the expected demand for compute.
The fact that capable models can be run locally might increase the number of people willing to use them (e. g., those very concerned about data privacy), as might the ability to automatically complete 10x as many trivial tasks. But it’s not obvious that this demand spike will be bigger than the simultaneous demand drop.
And I, at least, when researching ways to set up DeepSeek-R1 locally, found myself more drawn to the “wire a bunch of Macs together” option, compared to “wire a bunch of GPUs together” (due to the compactness). If many people are like this, it makes sense why Nvidia is down while Apple is (slightly) up. (Moreover, it’s apparently possible to run the full 671b-parameter version locally, and at a decent speed, using a pure RAM+CPU setup; indeed, it appears cheaper than mucking about with GPUs/Macs, just $6,000.)
This world doesn’t seem outright implausible to me. I’m bearish on agents and somewhat skeptical of inference-time scaling. And if inference-time scaling does deliver on its promises, it’ll likely go the way of search-and-distill.
On balance, I don’t actually expect the market to have any idea what’s going on, so I don’t know that its reasoning is this specific flavor of “well-informed but skeptical”. And again, it’s possible the drop was due to Trump, nothing to do with DeepSeek at all.
But as I’d said, this reaction to DeepSeek-R1 does not seem necessarily irrational/incoherent to me.
The bet that “makes sense” is that quality of Claude 3.6 Sonnet, GPT-4o and DeepSeek-V3 is the best that we’re going to get in the next 2-3 years, and DeepSeek-V3 gets it much cheaper (less active parameters, smaller margins from open weights), also “suggesting” that quality is compute-insensitive in a large range, so there is no benefit from more compute per token.
But if quality instead improves soon (including by training DeepSeek-V3 architecture on GPT-4o compute), and that improvement either makes it necessary to use more compute per token, or motivates using inference for more tokens even with models that have the same active parameter count (as in Jevons paradox), that argument doesn’t work. Also, the ceiling of quality at the possible scaling slowdown point depends on efficiency of training (compute multiplier) applied to the largest training system that the AI economics will support (maybe 5-15 GW without almost-AGI), and improved efficiency of DeepSeek-V3 raises that ceiling.
It’s been more than three months since o3 and still no o4, despite OpenAI researchers’ promises.
Deep Learning has officially hit a wall. Schedule the funeral.
[/taunting_god]
If o3 is based on GPT-4o, there is a reasoning model based on GPT-4.5 that’s better. If o3 is based on GPT-4.5 (and so the reason it’s still not out is that they are waiting for Blackwells to inference it at a reasonable speed and cost), then it was a hasty job just after the base model for GPT-4.5 was pretrained, and so by now they have something much better. Either way, there is some sort of “o4”, but it’s probably a bad look to announce it before releasing the already-announced o3.
Some thoughts on protecting against LLM steganography.
tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.
In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than 210 internal states.
However, the amount of information is not lower-bounded. Depending on the decoder, a ten-bit message can communicate between zero and ten bits of information. (A decoder that’s just a constant function recovers zero bits, a decoder that only pays attention to the first bit recovers only one bit, et cetera.)
The idea of steganography is that, if we pick a CoT string, and run both an AI model’s internal decoders and a human’s language-comprehension decoders on it, the AI model would extract more bits of information from it. Paraphrasing, humans’ decoders are “coarser”: they have broader equivalence classes of messages that map to the same internal representation, compared to AI models.
For example, “a dog crossed the street” and “a dog traveled across the street” might map to the same internal representation for a human, but to different ones for an AI model. Similarly, things like “quantum butterflies juggled metaphysical spaghetti” and “ambitious lampshade collected vintage moonbeams” can map to the same “nonsense” internal representation in humans, but have some useful meaning for AI models. The bits a human fails to make use of correspond to “steganography” or “superhuman latent knowledge”.
Proposals such as Shoggoth + Paraphraser + Face aim to destroy any information in CoTs that cannot be recovered by humans, by mapping each CoT string to the equivalence class of strings that all map to the same internal representation in humans.
However, there’s an issue: what if the empty string already communicates plenty of steganographically encoded information that humans aren’t privy to? That is: what if some knowledge is hard-coded into an AI model, such that it’s elicited on any activation, and leads to it converging towards reaching the same hidden conclusions even after we destroy its ability to sneak-in additional query-relevant information?
For example:
If the AI model implicitly learned the Theory of Everything, that knowledge is going to be activated on any prompt – and so, no amount of paraphrasing is going to force it to talk about the Theory of Everything in humanese. Communicating the ToE from one instance of the model to its future self takes up zero bits.
If the AI model has internalized some hidden goals, the heuristics for pursuing those goals are likewise going to be activated on any prompt, and it would engage in massively productive hidden scheming on any prompt. Importantly, the information-bottleneck constraints imposed by the Paraphraser wouldn’t even be hitting the hidden-plotting compartment specifically. They’d just uniformly lower the model’s computational power.
E. g., in the Shoggoth + Paraphraser + Face paradigm, that’d just mean the model learns to think in longer CoTs (because it needs longer messages to communicate the same amount of information), not to do fewer “sneaky” computations.[1]
Like, imagine that we have two different internal-representation sets: X, containing problem-solving strategies for the overt task we’re presenting the model, and Y, related to some misaligned scheming. We would like to constrain the model such that it cannot plan about Y; such that its CoTs contain zero bits for narrowing down which state out of all Y it must pick.
This assumes something like the following model: A given CoT string communicates some N bits of information. Of them, M bits are human-legible. The model then follows the algorithm of: use NX task-relevant bits on picking a state out of X, and use any leftover NY=N−NX bits to pick a state out of Y. We then destroy the string’s information content until N≈M (via the Paraphraser). Now the model doesn’t have the ability to encode any non-human-legible bits, and if it uses any bits on picking a state out of Y, they would necessarily be human-legible, and we can spot the scheming by reading the CoT.
But that’s not how this works. Instead, the model picks both simultaneously: a CoT is directly mapped to an ⟨X,Y⟩ two-tuple. Which means if the model can narrow down the state of X to one out of 210, it can narrow down the state of Y to one out of 210 in the same move. Even if we scramble the CoT until it communicates no more bits than a human can extract out of it, that doesn’t differentially constrain the model’s ability to think about scheming (= communicate scheming-relevant information to its future self).
It’s a uniform communication constraint/computational-complexity penalty.
Generalizing beyond scheming: if the model were inclined to not put any of its thoughts into humanese (due to superhuman latent knowledge, say), the Paraphraser wouldn’t address this issue either. If a model’s plans contain some catastrophic side-effects (which aren’t even malign, just something we didn’t tell it not to cause), the Paraphraser wouldn’t reveal that either.
@Daniel Kokotajlo, thoughts?
(This doesn’t touch on the Face part of the proposal, i. e., the idea that we want to not pressure the model to conceal any scheme-y cognition in the hopes it’d naively reveal them. But if there’s some pressure to conceal that, the Paraphraser would not help with it at all. It’d just teach the model to map humanese-equivalent strings to equivalent scheming-states, which isn’t particularly useful.)
Arguably, if you also penalize the CoT length, this should cause a pressure towards throwing out any computations unrelated to the object-level problem the model’s being trained on… But then the whole “paraphraser” bit doesn’t actually do anything, you can achieve the same outcome by just directly penalizing the CoT length.
Which, inasmuch as DL training loops use some form of simplicity prior, should be happening by default anyway. I. e., this doesn’t address the pressures that we expect to lead to steganography/hidden plotting despite the simplicity prior: which is to say, arguments that steganography/plotting are the actual optimal solutions to the training problem.
Consider the Theory of Everything example: those hidden computations are not actually “irrelevant computations” that we need to displace, they’re directly useful. Similarly, any misaligned mesa-objectives the model forms would presumably be convergently useful for performance on the training objective: that would be why they formed to begin with. So the information-bottleneck trickery would do ~nothing against them.
I like to distinguish between “Subtle strategic bias” and “Steganography.” Paraphrasers help (and maybe solve?) the latter, but do nothing about the former.
Example of subtle strategic bias: The model knows it’s misaligned and therefore deliberately sandbags on the alignment research you are asking it to do. “I’m sorry sir I just am not making much progress,” it says. You look at the CoT, and it’s just doing a bunch of sensible but fruitless lines of reasoning.
Example of steganography: The model needs to insert a subtle security vulnerability into the code it is writing—but it doesn’t know one right away, it has to invent one. And that requires lots of sequential thought. So it needs to try out different ideas, vet them, critique them, refine them, etc. Can’t do that in a single forward pass. So it needs to pass information forward to its future self somehow… without humans reading the CoT knowing that’s what it’s doing. THIS is the sort of thing that is much harder to do with a paraphraser than without a paraphraser.
I don’t see why it would be harder to do.
Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, “I need to design unit tests for this module”. Each word of this CoT is mapped to some internal representation, and the internal representation consists of “on-task” human-intended reasoning X, and misaligned scheming Y (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). “I” is mapped to ⟨X1,Y1⟩, “need” is mapped to ⟨X2,Y2⟩, etc. Each Xi iterates on Xi−1, each Yi iterates on Yi−1, we have a sequence of serial computations.
In other words: as the model outputs, “I need to design unit tests for this module”, there’s an unspoken subtext of “I need to sabotage the unit tests here”, and the model packs nine words’ worth of serial computations into both the “textual” X and the “subtextual” Y aspects of it.
Now you use the Paraphraser on the string, generating alternatives like “the next step is to generate some unit tests” and “several unit tests for this module are needed here”. The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the ith token of this phrase is mapped to Xi, as if the paraphrasing didn’t happen, but that the mapping of i→Yi would be broken.
Except, why would that happen? What privileges X-representations over Y-representations? Why would i→Xi be preserved whereas i→Yi is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model’s capabilities)?
In other words: We’re basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn’t this just map the subtextual “I need to sabotage the unit tests here” to similarly semantically-equivalent-but-syntactically-different “let us compromise the relevant unit-test suite” or “I shall ensure the unit tests don’t work properly”?
I expect it’d just lead, to wit, to the model’s computations becoming indifferent to the details of a phrasing in full generality. And if the model were inclined to associate the general idea of generating unit tests with tactically screwing the task up, the model would come up with this idea regardless of the details of a phrasing.
It wouldn’t interfere with its ability to engage in scheming-relevant serial computations, much like it doesn’t interfere with its ability to engage in task-relevant serial computations, for the exact same reasons.
I don’t think I understand. Let’s work through an example.
The AI is being told to write good code, and wants to write subtly-vulnerable code instead. If it just were to honestly try to write good code, it would take (let’s say) 100 serial steps of reasoning to succeed. To write subtly-vulnerable code, let’s say, requires 300 serial steps of reasoning. The model’s strategy is to write subtly-vulnerable code but in a way that looks like it’s trying to write good code and just getting a bit unlucky and taking longer than usual (let’s suppose the humans have a poor ability to judge how long it’s supposed to take, but still some ability, such that it can’t take e.g. 10x longer without arousing suspicion).
If there’s no paraphraser, this is ‘easy.’ I mean it’s still difficult but it’s ‘easy’ in the sense that the AI can basically just use a code—each nugget of reasoning might have 10 different semantically equivalent expressions, or maybe even 100, so it can ‘add extra layers of meaning’ to the nuggets—in addition to their meaning in the surface level i’m-tring-to-write-good-code, they can have a hidden meaning having to do with the security vulnerability-related properties of the code. And it can make a chain of reasoning that is basically about the security stuff while looking like a reasonable but unlucky attempt to write good code.
But a paraphraser wrecks that strategy. The model can still add extra layers of meaning to the nuggets, but now it has loads of trouble making the chain of thought look like an honest attempt to write good code. To use your example, if ‘I should generate unit tests now’ and all semantically equivalent variants of that idea are assigned the hidden meaning ‘I should check to see if the vulnerability is too easy to spot’ for example, then… what is it going to write next? It’s too constrained; in order to look non-suspicious there is a limited set of options it can follow up with… ok, hmm, I’m at the limits of my understanding here. But maybe you get the idea?
…here’s an intuition I think I’m drawing from. Suppose you’re the model and you are trying to decide what hidden meanings to associate with what concepts/phrases/steps. If you already had your completed illicit reasoning chain and your completed good-boy reasoning chain, it would be easy—just line them up and make a 1-1 mapping, and then bam no paraphraser can stop you! But you don’t have either chain! That’s the whole point—this is a task that you haven’t done yet, you need to explore some maze in logical space (or tree? Or graph?) until you find a chain that gets to the conclusion/answer.
Ah, I think I see my error.
Consider a CoT of the form “[problem statement]: …........”, which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.
And in the framework I’d outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such “filler” tokens.
So yep, you’re right, this does force all scheming to be taking place within one forward pass.
Edit: Nevermind, evidently I’ve not thought this through properly. I’m retracting the below.
The naïve formulations of utilitarianism assume that all possible experiences can be mapped to scalar utilities lying onthe same,continuousspectrum, and that experiences’ utility is additive. I think that’s an error.This is how we get the frankly insane conclusions like “you should save10100shrimps instead of one human” oreveryone’s perennial favorite, “if you’re choosing between one person getting tortured for 50 years or some amount of peopleNgetting a dust speck into their eye, there must be anNbig enough that the torture is better”. I disagree with those. I would sacrifice an arbitrarily large amount of shrimps to save one human, and there’s noNbig enough for me to pick torture. I don’t care if that disagrees with what the math says: if the math says something else, it’s the wrong math and we should find a better one.Here’s a sketch of how I think such better math might look like:There’s a totally ordered,discreteset of “importance tiers” of experiences.Withintiers, the utilities are additive: two people getting dust specks is twice as bad as one person getting dust-speck’d, two people being tortured is twice as bad as one person being tortured, eating a chocolate bar twice per week is twice as good as eating a bar once per week, etc.Acrosstiers, the tier ordering dominates: if we’re comparing some experience belonging to a higher tier to any combination of experiences from lower tiers, the only relevant consideration is the sign of the higher-tier experience. No amount of people getting dust-speck’d, and no amount of dust-speck events sparsely distributed throughout one person’s life, can ever add up to anything as important as a torture-level experience.[1]Intuitively, tiers correspond to the size of effect a given experience has on a person’s life:A dust speck is a minor inconvenience that is forgotten after a second. If we zoom out and consider the person’s life at a higher level, say on the scale of a day, this experience rounds offexactlyto zero, rather than to an infinitesimally small but nonzero value. (Again, on the assumption of a median dust-speck event, no emergent or butterfly effects.)Getting yelled at by someone might ruin the entirety of someone’s day, but is unlikely to meaningfully change the course of their life. Experiences on this tier are more important than any amount of dust-speck experiences, but any combination of them rounds down to zero from a life’s-course perspective.Getting tortured is likely to significantly traumatize someone, to have a lasting negative impact on their life. Experiences at this tier ought to dominate getting-yelled-at as much as getting-yelled-at dominates dust specks.Physically, those “importance tiers” probably fall out of the hierarchy ofnatural abstractions. Like everything else, a person’s life has different levels of organization. Any detail in how the high-level life-history goes is incomparably more important than any experience which is only locally relevant (which fails to send long-distance ripples throughout the person’s life). Butterfly effects are then the abstraction leaks (low-level events that perturb high-level dynamics), etc.I didn’t spend much time thinking about this, so there may be some glaring holes here, but this already fits my intuitionsmuchbetter.I think we can expand that framework to cover “tiers of sentience”:If shrimps have qualia, it might be thatanyqualia they’re capable of experiencing belong to lower-importance tiers, compared to the highest-tier human qualia.Simultaneously, it might be the case that the highest-importance shrimp qualia are on the level of the lower-importance human qualia.Thus, it might be reasonable to sacrifice the experience of eating a chocolate bar to save10100shrimps, even if you’d never sacrifice a person’s life (or even make someone cry) to save any amount of shrimps.This makes some intuitive sense, I think. The model above assumes that “local” experiences, which have no impact on the overarching pattern of a person’s life, are arbitrarily less important than that overarching pattern. What if we’re dealing with beings whose internal liveshaveno such overarching patterns, then? A shrimp’s interiority is certainly less complex than that of a human, so it seems plausible that its life-experience lacks comparably rich levels of organization (something like “the ability to experience what is currently happening as part of the tapestry spanning its entire life, rather than as an isolated experience”). So all of its qualia would be comparable only with the “local” experiences of a human, for some tier of locality: we would have direct equivalence between them.One potential issue here is that this implies the existence of utility monsters: some divine entities such that they can have experiences incomparably more important than any experience a human can have. I guess it’s possible that if I understood qualia better, I would agree with that, but this seems about as anti-intuitive as “shrimps matter as much as humans”. My intuition is that sapient entitiestopthe hierarchy of moral importance, that there’s nothing meaningfully “above” them. So that’s an issue.One potential way to deal with this is to suggest that what distinguishes sapient/”generally intelligent” entities is not that they’re the only entities whose experiences matter, but that they have the ability to (learn to) have experiences ofarbitrarily hightiers. And indeed: the whole shtick of “general intelligence” is that it should allow you to learn and reason about arbitrarily complicated systems of abstraction/multi-level organization. If the importance tiers of experiences really have something to do with the richness of the organization of the entity’s inner life, this resolves things neatly. Now:Non-sapient entities may have experiences of nonzero importance.No combination of non-sapient experiences can compare to the importance of a sapient entity’s life.“Is sapient” tops the hierarchy of moral relevance: there’s no type of entity that is fundamentally “above”.Two caveats here are butterfly effects and emergent importance:Getting a dust speck at the wrong moment might kill you (if you’re operating dangerous machinery) or change the trajectory of your life (if this minor inconvenience is the last straw that triggers a career-ruining breakdown). We have to assume such possibilities away: the experiences exist “in a vacuum”. Doing otherwise would violate the experimental setup, dragging in various practical considerations, instead of making it purely about ethics.So we assume that each dust-speck event always has the “median” amount of impact on a person’s life, even if you scale the amount of dust-speck events arbitrarily.Getting 1000 dust specks one after another adds up to something more than 1000 single-dustspeck experiences; it’s worse than getting a dust speck once per day for 1000 days. More intuitively, experiencing a 10⁄10 pain for one millisecond is not comparable to experiencing 10⁄10 pain for 10 minutes. There are emergent effects at play, and like with butterflies, we must assume them away for experimental purity.So if we’re talking aboutMexperiences from within the same importance tier, they’re assumed to be distributed such that they don’t add up to a higher-tier experience.Note that those are very artificial conditions. In real life, both of those arevery much in play. Any lower-tier experience has a chance of resulting into a higher-tier experience, and every higher-tier experience emerges from (appropriately distributed) lower-tier experiences. In our artificial setup, we’re assuming certain knowledge that no butterfly effects would occur, and that a lower-tier event contributes to no higher-tier pattern.Relevance: There’s reasoning that goes, “if you ever drive to the store to get a chocolate bar, you’re risking crashing into and killing someone, therefore you don’t value people’s lives infinitely more than eating chocolate”. I reject it on the above grounds. Systematically avoiding all situations where you’re risking someone’s life in exchange for a low-importance experience would assemble into a high-importance life-ruining experience for you (starving to death in your apartment, I guess?). Given that, we’re now comparing same-tier experiences, andhereI’m willing to be additive, calculating that killing a person with very low probability is better than killing yourself (by a thousand cuts) with certainty.Besides uncertainty, there’s the problem of needing to pick cutoffs between tiers in a ~continuous space of ‘how much effect does this have on a person’s life?’, with things slightly on one side or the other of a cutoff being treated very differently.
I agree with the intuition that this is important, but I think that points toward just rejecting utilitarianism (as in utility-as-a-function-purely-of-local-experiences, not consequentialism).
It’s worth noting that everything funges: some large number of experiences of eating a chocolate bar can be exchanged for avoiding extreme human suffering or death. So, if you lexicographically put higher weight on extreme human suffering or death, then you’re willing to make extreme tradeoffs (e.g.1030 chocolate bar experiences) in terms of mundane utility for saving a single life. I think this easily leads to extremely unintuitive conclusions, e.g. you shouldn’t ever be willing to drive to a nice place. See also Trading off Lives.
I find your response to this sort of argument under “Relevance: There’s reasoning that goes” in the footnote very uncompelling as it doesn’t apply to marginal impacts.
Ok, but if you don’t drive to the store one day to get your chocolate, then that is not a major pain for you, yes? Why not just decide that next time you want chocolate at the store, you’re not going to go out and get it because you may run over a pedestrian? Your decision there doesn’t need to impact your other decisions.
Then you ought to keep on making that choice until you are right on the edge of those choices adding up to a first-tier experience, but certainly below.
This logic generalizes. You will always be pushing the lower tiers of experience as low as they can go before they enter the upper-tiers of experience. I think the fact that your paragraph above is clearly motivated reasoning here (instead of “how can I actually get the most bang for my buck within this moral theory” style reasoning) shows that you agree with me (and many others) that this is flawed.
We can easily ban speed above 15km/h for any vehicles except ambulances. Nobody starves to death in this scenario, it’s just very inconvenient. We value convenience lost in this scenario more than lives lost in our reality, so we don’t ban high-speed vehicles.
Ordinal preferences are bad and insane and they are to be avoided.
What’s really wrong with utilitarianism is that you can’t, actually, sum utilities: it’s a type error, because utilities are invariant up to affine transform, what would their sum mean?
The problem, I think, that humans naturally conflate two types of altruism. First type is caring about other entities mental state. Second type is “game-theoretic” or “alignment-theoretic” altruism: generalized notion of what does that mean to care about someone else’s values. Roughly, I think that good type of the second type of altruism requires you to do fair bargaining in interests of entity you are being altruistic towards.
Let’s take “World Z” thought experiment. The problem from the second type altruism perspective is that total utilitarian gets very large utility from this world, while all inhabitants of this world, by premise, get very small utility per person, which is unfair division of gains.
One may object: why not create entities who think that very small share of gains is fair? My answer is that if entity can be satisfied with infinitesimal share of gains, it also can be satisfied with infinitesimal share of anthropic measure, i.e., non-existence, and it’s more altruistic to look for more demanding entities to fill universe with.
My general problem with animal welfare from bargaining perspective is that most of animals probably don’t have sufficient agency to have any sort of representative in bargaining. We can imagine CEV of shrimp which is negative utilitarian and wants to kill all shrimp, or positive utilitarian which thinks that even very painful existence is worth it, or CEV that prefers shrimp swimming in heroin, or something human-like, or something totally alien, and sum of these guesses probably sums up to “do not torture and otherwise do as you please”.
Huh, I expected better from you.
No, it is absolutely not insane to save 10100 shrimp instead of one human! I think the case for insanity for the opposite is much stronger! Please, actually think about how big 10100 is. We are talking about more shrimp than atoms in the universe. Trillions upon trillions of shrimp more than atoms in the universe.
This is a completely different kind of statement than “you should trade of seven bees against a human”.
No, being extremely overwhelmingly confident about morality such that even if you are given a choice to drastically alter 99.999999999999999999999% of the matter in the universe, you call the side of not destroying it “insane” for not wanting to give up a single human life, a thing we do routinely for much weaker considerations, is insane.
The whole “tier” thing obviously fails. You always end up dominated by spurious effects on the highest tier. In a universe with any appreciable uncertainty you basically just ignore any lower tiers, because you can always tell some causal story of how your actions might infinitesimally affect something, and so you completely ignore it. You might as well just throw away all morality except the highest tier, it will never change any of your actions.
I’m normally in favor of high decoupling, but this thought experiment seems to take it well beyond the point of absurdity. If I somehow found myself in control of the fate of 10^100 shrimp, the first thing I’d want to do is figure out where I am and what’s going on, since I’m clearly no longer in the universe I’m familiar with.
Yeah, I mean, that also isn’t a crazy response. I think being like “what would it even mean to have 10^30 times more shrimp than atoms? Really seems like my whole ontology about the world must be really confused” also seems fine. My objection is mostly to “it’s obvious you kill the shrimp to save the human”.
Oh, easy, it just implies you’re engaging in acausal trade with a godlike entity residing in some universe dramatically bigger than this one. This interpretation introduces no additional questions or complications whatsoever.
Hm. Okay, so my reasoning there went as follows:
I’m not sure where you’d get off this train, but I assume the last bullet-point would do this? I. e., that you would argue that holding the possibility of shrimps having human-level qualia is salient in a way it’s not for rocks?
Yeah, that seems valid. I might’ve shot from the hip on that one.
I have a story for how that would make sense, similarly involving juggling inside-model and outside-model reasoning, but, hm, I’m somehow getting the impression my thinking here is undercooked/poorly presented. I’ll revisit that one at a later time.
Edit: Incidentally, any chance the UI for retracting a comment could be modified? I have two suggestions here:
I’d like to be able to list a retraction reason, ideally at the top of the comment.
The crossing-out thing makes it difficult to read the comment afterwards, and some people might want to be able to do that. Perhaps it’s better to automatically put the contents into a collapsible instead, or something along those lines?
You should be able to strike out the text manually and get the same-ish effect, or leave a retraction notice. The text being hard to read is intentional so that it really cannot be the case that someone screenshots it or skims it without noticing that it is retracted.
I think it pretty clearly is insane to save 10100 shrimp instead of one human! It doesn’t matter how many shrimp it is. The moral value of shrimp does not aggregate like that.
The grandparent comment is obviously correct in its description of the problem. (Whether the proposed solution works is another question entirely.)
That’s just not true.
One obvious approach is: once you get to the point where noise (i.e., non-systematic error) dominates your calculations for a particular tier, you ignore that tier and consider the next lower tier. (This is also approximately equivalent to a sort of reasoning which we use all the time, and it works pretty straightforwardly, without giving rise to the sorts of pathologies you allude to.)
There is no clear or adequate definition of what “[noise] dominated your calculations” means. Maybe you can provide one, but I’ve never seen anyone provide any such definition, or made much headway in doing so.
Creating such a definition of noise has proven to be quite hard, as it’s extremely rare that someone is willing to ignore absolutely all stakes at lower levels of concern or abstraction, no matter the magnitude.
Even if one tries to elevate your family above everything else, it is commonly accepted that it is not moral to sacrifice all of society for just your family, or threaten large scale catastrophe.
Similarly as you elevate the interests of your nation above other things, at a sufficient scale the interests of the rest of the world poke their way into your decision-making in substantial ways again.
Even if you try to do nothing but elevate the interests of animal life, we have still decided that it is not ethical to destroy even fully abiological and definitely not complicated plant-based ecosystems for those interests, if the harm is sufficiently large.
You maybe want to propose we make decisions this way, but humanity absolutely does not generally make decisions this way. When people have to make decisions they usually decide on some rough thresholds for noticing tradeoffs across domains, and indeed decide and re-evaluate how important something is when a decision affects something in a different domain at much larger scale than other decisions.
Look, there are numbers that are very very big.[1]
Again, we are talking about so many shrimp that it would be exceedingly unlike for this number of shrimp, if left under the auspices of gravity, to form their own planets and solar systems and galaxies in which life thrives and which other non-shrimp intelligences form. A number so incomprehensibly big. A galaxy within each atom of our universe made out of shrimp. One can argue it’s meaningless to talk about numbers this big, and while I would dispute that, it’s definitely a much more sensible position than trying to take a confident stance to destroy or substantially alter a set of things so large that it vastly eclipses in complexity and volume and mass and energy all that has ever or will ever exist by a trillion-fold.
Indeed, there are numbers so big that the very presence of specfiying them would encode calculations capable of simulating universes full of healthy and happy humans. The space of numbers is really very large.
Okay, while I’m hastily backpedaling from the general claims I made, I am interested in your take on the first half of this post. I think there’s a difference between talking about an actual situation, full complexities of the messy reality taken into account, where a supernatural being physically shows up and makes you really decide between a human and 10100 of shrimp, and a thought experiment where “you” “decide” between a “human” and 10100 of “shrimp”. In the second case, my model is that we’re implicitly operating in an abstracted-out setup where the terms in the quotation marks are, essentially, assumed ontologically basic, and matching our intuitive/baseline expectations about what they mean.
While, within the hypothetical, we can still have some uncertainty over e. g. the degree of the internal experiences of those “shrimp”, I think we have to remove considerations like “the shrimp will be deposited into a physical space obeying the laws of physics where their mass may form planets and galaxies” or “with so many shrimp, it’s near-certain that the random twitches of some subset of them would spontaneously implement Boltzmann civilizations of uncountably many happy beings”.
IMO, doing otherwise is a kind of “dodging the hypothetical”, no different from considering it very unlikely that the supernatural being really has control over 10100 of something, and starting to argue about this instead.
I agree there is something to this, but when actually thinking about tradeoffs that do actually have orders of magnitude of variance in them, which is ultimately where this kind of reasoning is most useful (not 100 orders of magnitude, but you know 30-50 are not unheard of), this kind of abstraction would mostly lead you astray, and so I don’t think it’s a good norm for how to take thought experiments like this.
Like, I agree there are versions of the hypothetical that are too removed, but ultimately, I think a central lesson of scope sensitivity is that having a lot more of something often means drastic qualitative changes that come with that drastic change in quantity. Having 10 flop/s of computation is qualitatively different to having 10^10 flop/s. I can easily imagine someone before the onset of modern computing saying “look, how many numbers do you really need to add in everyday life? What is even the plausible purpose of having 10^10 flop/s available? For what purpose would you need to possibly perform 10 billion operations per second? This just seems completely absurd. Clearly the value of a marginal flop goes to zero long before that. That is more operations than all computers[1] in the world have ever ever done, in all of history, combined. What could possibly be the point of this?”
And of course, such a person would be sorely mistaken. And framing the thought experiment as “well, no, I think if you want to take this thought experiment seriously you should think about how much you would be willing to pay for the 10 billionth operation of the kind that you are currently doing, which is clearly zero. I don’t want you to hypothesize some kind of new art forms or applications or computing infrastructure or human culture, which feel like they are not the point of this exercise, I want you to think about the marginal item in isolation” would be pointless. It would be emptying the exercise and tradeoff of any of its meaning. If we ever face a choice like this or, anything remotely like it, of course how the world adapts around this, and the applications that get built for it, and the things that aren’t obvious from when you first asked the question matter.
And to be clear, I think there is also a real conversation going on here about whether maybe, even if you isolated each individual shrimp into a tiny pocket universe, and you had no way of ever seeing them or visiting the great shrimp rift (a natural wonder clearly greater than any natural wonder on earth), and all you knew for sure was that it existed somewhere outside of your sphere of causal influence, and the shrimp never did anything more interesting than current alive shrimp, whether it would still be worth it to kill a human. And for that, I think the answer is less obviously “yes” or “no”, though my guess is the 10^100 causally isolated shrimp ultimate still enrich the universe more than a human would, and are worth preserving more, but it’s less clear.
And we could focus on that if we want to, I am not opposed to it, but it’s not clearly what the OP that sparked this whole thread was talking about, and I find it less illuminating than other tradeoffs, and it would still leave me with a strong reaction that at least the reason for why the answer might be “kill the shrimp” is definitely absolutely different from the answer to why you should not kill a human to allow 7 bees to live for a human lifetime.
here refering to the human occupancy of “computer” i.e. someone in charge of performing calculations
Yeah, that’s more what I had in mind. Illusion of transparency, I suppose.
Certainly, and it’s an important property of reality. But I don’t think this is what extreme hypotheticals such as the one under discussion actually want to talk about (even if you think this is a more important question to focus on)?
Like, my model is that the 10100 shrimp in this hypothetical are not meant to literally be 10100 shrimp. They’re meant to be "10100" “shrimp”. Intuitively, this is meant to stand for something like “a number of shrimp large enough for any value you’re assigning them to become morally relevant”. My interpretation is that the purpose of using a crazy-large number is to elicit that preference with certainty, even if it’s epsilon; not to invite a discussion about qualitative changes in the nature of crazy-large quantities of arbitrary matter.
The hypothetical is interested in shrimp welfare. If we take the above consideration into account, it stops being about “shrimp” at all (see the shrimps-to-rocks move). The abstractions within which the hypothetical is meant to live break.
And yes, if we’re talking about a physical situation involving the number 10100, the abstractions in question really do break under forces this strong, and we have to navigate the situation with the broken abstractions. But in thought-experiment land, we can artificially stipulate those abstractions inviolable (or replace the crazy-high abstraction-breaking number with a very-high but non-abstraction-breaking number).
I agree that this is a thing people often like to invoke, but it feels to me a lot like people talking about billionaires and not noticing the classical crazy arithmetic errors like:
Like, in those discussions people are almost always trying to invoke numbers like “$1 trillion” as “a number so big that the force of the conclusion must be inevitable”, but like most of the time they just fail because the number isn’t big enough.
If someone was like “man, are you really that confident that a shrimp does not have morally relevant experience that you wouldn’t trade a human for a million shrimp?”, my response is “nope, sorry, 1 million isn’t big enough, that’s just really not that big of a number”. But if you give me a number a trillion trillion trillion trillion trillion trillion trillion trillion times bigger, IDK, yeah, that is a much bigger number.
And correspondingly, for every thought experiment of this kind, I do think there is often a number that will just rip through your assumptions and your tradeoffs. There are just really very very very big numbers.
Like, sure, we all agree our abstraction break here, and I am not confident you can’t find any hardening of abstraction that make the tradeoff come out in the direction of the size of the number really absolutely not mattering at all, but I think that would be a violation of the whole point of the exercise. Like, clearly we can agree that we assign a non-zero value to a marginal shrimp. We value that marginal shrimp for a lot of different reasons, but like, you probably value it for reasons that does include things like the richness of its internal experience, and the degree to which it differs from other shrimp, and the degree to which it contributes to an ecosystem, and the degree to which it’s an interesting object of trade, and all kinds of reasons. Now, if we want to extrapolate that value to 10^100, those things still are there, we can’t just start ignoring them.
Like, I would feel more sympathetic to this simplification if the author of the post was a hardcore naive utilitarian, but they self-identify as a kantian. Kantianism is a highly contextual ethical theory that clearly cares about a bunch of different details of the shrimp, so I don’t get the sense the author wants us to abstract away everything but some supposed “happiness qualia” or “suffering qualia” from the shrimp.
Isn’t it the opposite? It’s a defence against providing too-low numbers, it’s specifically to ensure that even infinitesimally small preferences are elicited with certainty.
Bundling up all “this seems like a lot” numbers into the same mental bucket, and then failing to recognize when a real number is not actually as high as in your hypothetical, is certainly an error one could make here. But I don’t see an exact correspondence...
In the billionaires case, a thought-experimenter may invoke the hypothetical of “if a wealthy person had enough money to lift everyone out of poverty while still remaining rich, wouldn’t them not doing so be outrageous?”, while inviting the audience to fill-in the definitions of “enough money” and “poverty”. Practical situations might then just fail to match that hypothetical, and innumerate people might fail to recognize that, yes. But this doesn’t mean that that hypothetical is fundamentally useless to reason about, or that it can’t be used to study some specific intuitions/disagreements. (“But there are no rich people with so much money!” kind of maps to “but I did have breakfast!”.)
And in the shrimps case, hypotheticals involving a “very-high but not abstraction-breaking” number of shrimps are a useful tool for discussion/rhetoric. It allows to establish agreement/disagreement on “shrimp experiences have inherent value at all”, a relatively simple question that could serve as a foundation for discussing other, more complicated and contextual ones. (Such as “how much should I value shrimp experiences?” or “but do enough shrimps actually exist to add up to more than a human?” or “but is Intervention X to which I’m asked to donate $5 going to actually prevent five dollars’ worth of shrimp suffering?”.)
Like, I think having a policy of always allowing abstraction breaks would just impoverish the set of thought experiments we would be able consider and use as tools. Tons of different dilemmas would collapse to Pascal’s mugging or whatever.
Hmm… I think this paragraph at the beginning is what primed me to parse it this way:
Why would we need this assumption[1], if the hypothetical weren’t centrally about the inherent value of the shrimps/shrimp qualia, and the idea that it adds up? The rest of that essay also features no discussion of the contextual value that the existence of a shrimp injects into various diverse environments in which it exists, etc. It just throws the big number around, while comparing the value of shrimps to the value of eating a bag of skittles, after having implicitly justified shrimps having value via shrimps having qualia.
I suppose it’s possible that if I had the full context of the author’s writing in mind, your interpretation would have been obviously correct[2]. But the essay itself reads the opposite way to me.
A pretty strong one, I think, since “are shrimp qualia of nonzero moral relevance?” is often the very point of many discussions.
Indeed, failing to properly familiarize myself with the discourse and the relevant frames before throwing in hot takes was my main blunder here.
I agree probably I implied a bit too much contextualization. Like, I agree the post has a utilitarian bend, but man, I just really don’t buy the whole “let’s add up qualia” as any basis of moral calculation, that I find attempts at trying to create a “pure qualia shrimp” about as confused and meaningless as trying to argue that 7 bees are more important than a human. “qualia” isn’t a thing that exists. The only thing that exists are your values in all of their complexity and godshatteredness. You can’t make a “pure qualia shrimp”, it doesn’t many any philosophical sense, pure qualia isn’t real.
And I agree that maybe the post was imagining some pure qualia juice, and I don’t know, maybe in that case it makes sense to dismiss it by doing a reductio ad absurdum on qualia juice, but I don’t currently buy it. I think that both wouldn’t be engaging with the good parts of the author, and also be kind of a bad step in the discourse (like, the previous step was understanding why it doesn’t make sense for 7 bees to be more important than a human, for a lot of different reasons and very robustly and within that discourse, it’s actually quite important to understand why 10^100 shrimp might actually be more important than a human, under at least a lot of reasonable set of assumptions).
Same, honestly. To me, many of these thought experiments seem decoupled from anything practically relevant. But it still seems to me that people often do argue from those abstracted-out frames I’d outlined, and these arguments are probably sometimes useful for establishing at least some agreement on ethics. (I’m not sure how a full-complexity godshatter-on-godshatter argument would even look like (a fistfight, maybe?), and am very skeptical it’d yield any useful results.)
Anyway, it sounds like we mostly figured out what the initial drastic disconnect between our views here was caused by?
Yeah, I think so, though not sure. But I feel good stopping here.
This just means that “elevate your family above everything else” is not an approved-of moral principle, not that it somehow doesn’t work on its own terms. In any case this is not a problem with multi-tier morality, it’s just a disagreement on what the tiers should be.
This, on the other hand, is a matter of instrumental values, not terminal ones. There is once again no problem here with multi-tier morality.
Same reply as to the first point. (Also, who has ever advocated so weirdly drawn a moral principle as “do nothing but elevate the interests of animal life”…?)
It doesn’t matter how big the numbers are, because the moral value of shrimp does not aggregate like that. If it were 3^^^3 shrimp, it still wouldn’t matter.
Now you’re just smuggling in additional hypothesized entities and concerns. Are we talking about shrimp, or about something else? This is basically a red herring.
That aside—no, the numbers really don’t matter, because that’s just not how moral value of shrimp works, in any remotely sensible moral system. A trillion shrimp do not have a million times the moral value of a million shrimp. If your morality says that they do, then your morality is broken.
Nobody was saying this! The author of the post in question also does not believe this!
I am not a hedonic utilitarian. I do not think that a trillion shrimp have a million times the moral value of a million shrimp. That is a much much stronger statement than whether there exists any number of shrimp that might be worth more than a human. All you’ve done here is to set up a total strawman that nobody was arguing for and knocked it down.
Ok. Do you think that a trillion shrimp have:
… 1,000 times the moral value of a million shrimp?
… 10 times the moral value of a million shrimp?
… 1.1 times the moral value of a million shrimp?
… some other multiplicative factor, larger than 1, times the moral value of a million shrimp?
If the answer is “no” to all of these, then that seems like it would mean that you already agree with me, and your previous comments here wouldn’t make any sense. So it seems like the answer has to be “yes” to something in that list.
But then… my response stands, except with the relevant number changed.
On the other hand, you also say:
I… don’t understand how you could be using this term that would make this a meaningful or relevant thing to say in response to my comment. Ok, you’re not a hedonic utilitarian, and thus… what?
Is the point that your claim about saving 10100 shrimp instead of one human isn’t insane… was actually not a moral claim at all, but some other kind of claim (prudential, for instance)? No, that doesn’t seem to work either, because you wrote:
So clearly this is about morality…
… yeah, I can’t make any sense of what you’re saying here. What am I missing?
I don’t know, seems like a very hard question, and I think will be quite sensitive to a bunch of details of the exact comparison. Like, how much cognitive diversity is there among the shrimp? Are the shrimps forming families and complicated social structures, or are they all in an isolated grid? Are they providing value to an extended ecosystem of other life? How rich is the life of these specific shrimp?
I would be surprised if the answer basically ever turned out to be less than 1.1, and surprised if it ever turned out to be more than 10,000.
I don’t think your response said anything except to claim that a linear relationship between shrimp and values seems to quickly lead to absurd conclusions (or at least that is what I inferred from your claim of saying that a trillion shrimp is not a million times more valuable than a million shrimp). I agree with that as a valid reductio ad absurdum, but given that I see no need for linearity here (simply any ratio, which could even differ with the scale and details of the scenario), I don’t see how your response stands.
I have little to go off of besides to repeat myself, as you have given me little to work with besides repeated insistence that what I believe is wrong or absurd. My guess is my meaning is more clear (though probably still far from perfectly clear) to other readers.
I mean… we know the answers to these questions, right? Like… shrimp are not some sort of… un-studied exotic form of life. (In any case it’s a moot point, see below.)
Right, so, “some … multiplicative factor, larger than 1”. That’s what I assumed. Whether that factor is 1 million, or 1.1, really doesn’t make any difference to what I wrote earlier.
No, my point is that any factor at all that is larger than 1, and remains larger than 1 as numbers increase, leads to absurd conclusions. (Like, for example, the conclusion that there is some number of shrimp such that that many shrimp are worth more than a human life.)
Given this correction, do you still think that I’m strawmanning or misunderstanding your views…? (I repeat that linearity is not the target of my objection!)
I mean, clearly you agree that two shrimp are more important than one shrimp, and continues to be more important (at least for a while) as the numbers increase. So no, I don’t understand what you are saying, as nothing you have said appears sensitive to any numbers being different, and clearly for small numbers you agree that these comparisons must hold.
I agree there is a number big enough where eventually you approach 1, nothing I have said contradicts that. As in, my guess is the series of the value of shrimp as n goes to infinity does not diverge but eventually converges on some finite number (though especially with considerations like boltzman brains and quantum uncertainty and matter/energy density does seem confusing to think about).
It seems quite likely to me that this point of convergence is above the value of a human life, as numbers can really get very big, there are a lot of humans, and shrimp are all things considered pretty cool and interesting and a lot of shrimp seem like they would give rise to a lot of stuff.
Hm… no, I don’t think so. Enough shrimp to ensure that there keep being shrimp—that’s worth more than one shrimp. Less shrimp than that, though—nah.
Sure, this is all fine (and nothing that I have said contradicts you believing this; it seems like you took my objection to be much narrower than it actually was), but you’re saying that this number is much larger than the value of a human life. That’s the thing that I’m objecting to.
I’ll mostly bow out at this point, but one quick clarification:
I didn’t say “much larger”! Like, IDK, my guess is there is some number of shrimp for which its worth sacrificing a thousand humans, which is larger, but not necessarily “much”.
My guess is there is no number, at least in the least convenient world where we are not talking about shrimp galaxies forming alternative life forms, for which it’s worth sacrificing 10 million humans, at least at current population levels and on the current human trajectory.
10 million is just a lot, and humanity has a lot of shit to deal with, and while I think it would be an atrocity to destroy this shrimp-gigaverse, it would also be an atrocity to kill 10 million people, especially intentionally.