Irretrievability; or, Murphy’s Curse of Oneshotness upon ASI
Example 1: The Viking 1 lander
In the 1970s, NASA sent a pair of probes to Mars, the Viking 1 and Viking 2 missions. Total cost of $1B (1970), equivalent to about $7B (2025). The Viking 1 probe operated on Mars’s surface for six years, before its battery began to seriously degrade.
One might have thought a battery problem like that would spell the irrevocable end of the mission. The probe had already launched and was now on Mars, very far away and out of reach of any human technician’s fixing fingers. Was it not inevitable, then, that if any kind of technical problem were to be discovered long after the space launch in August 1975, nothing could possibly be done?
But the foresightful engineers of the Viking 1 probe had devised a plan for just this class of eventuality, which they had foreseen in general, if not in exact specifics. They had built the Viking 1 probe to accept software updates by radio receiver, transmitted from Earth.
On November 11, 1982, Earth sent an update to the Viking 1 lander’s software, intended to make sure the battery only discharged down to a minimum voltage level, rather than running for a fixed time after each charge.
The battery-software update accidentally overwrote the antenna-pointing software.
With the lander’s antenna no longer pointed at the orbiter, no further software updates beyond that point could be received.
The error had destroyed the intended mechanism for recovering from errors.
All contact with the Viking 1 lander was permanently lost. Ground engineers tried some strategies for regaining contact, based on extrapolation of where the antenna could have ended up pointing, but none succeeded.
In this I observe a specific instance of a general idea: Murphy’s Curse of Inaccessibility on space probes is a deep problem. A clever system designed in the hope of accepting later patches is a relatively shallower solution.
Putting wings on an airplane doesn’t make it weightless and repeal the law of gravity. The weight of an airplane is an intrinsic property that goes on making it susceptible to falling out of the sky if the wings stop working. Your model of the airplane should include the ongoing weight and ongoing lift; not, argue that the curse of airplane-weight will be dispelled by wings.
The engineers’ attempted strategy for mitigating the underlying oneshot quality of a space probe launch—the engineers’ intended mechanism for correcting mistakes afterwards—did not actually transform the Viking 1 lander into an Earth-bound car that you could walk over to and fix. Any sort of problem that struck at the corrective machinery itself, would catapult you right back into the fundamental inaccessibility scenario, that you couldn’t just walk over and fix a broken corrective mechanism. (And also, of course, large classes of possible error can’t be addressed by a software update at all.) The underlying reality was that the probe stayed far and high away.
Rocket science wouldn’t be so famously cursed by Murphy’s Law, if the heightened susceptibility conditions for Murphy’s Law to act upon their projects, were so easy to defeat with a little effort. The many Curses of Murphy upon aerospace engineering can be fought; but not vanquished, not dispelled.
Good aerospace engineers know that. So they put in the extreme levels of paranoia and preparation that are required to sometimes succeed.
One can only imagine what tiny fraction of space probe missions would succeed if the engineers or managers were the sort who went around bragging, “Our space probe won’t be inaccessible after launch at all; we built in an antenna to upload software updates! Don’t listen to those silly people who’ll tell you that you ‘can’t walk over and fix’ a space probe after launch; they lack our own experience to have had the brilliant idea of software updates!”
Would such a cheerful innocent ever succeed at landing a space probe? It seems crazy to assign them a real-world success probability as high as 10%. Aerospace engineers have to work much harder than that, be much more paranoid and cautious than that, to drive the success chance significantly above 0.
Example 2: The Mars Observer
The Mars Observer mission was approved in October of 1984 and launched in September of 1992, at a cost of $813 million ($2B 2025). It flew through space for 330 days, and then, three days before inserting into Mars orbit, communication with the probe was lost.
The best-guess postmortem analysis: After the earlier stresses of launch, and an 11-month flight through vacuum, a PTFE check valve had leaked fuel and oxidizer vapors that accumulated within feed lines in the zero gravity; and this produced an explosion when the engine was restarted (for course correction before orbital insertion).
That sort of thing happens, when you try to do something for the first time. One of the reasons why space probes are famously acutely susceptible to Murphy’s Law, is that each new probe gets custom-built for a new mission. Each mission is a chance for something new to go wrong.
Now imagine some manager or space enthusiast saying—in advance of the actual disaster, of course—“The Mars Observer mission isn’t novel! We can test the probe here on Earth in a vacuum chamber! We have experience from previous space probes! We have the whole mighty edifice of science to observe the laws of physics; and we can use those laws to extrapolate the system behavior of the Mars Observer probe on its way to Mars!”
To enumerate the object-level reasons this doesn’t repeal Murphy’s Curse of Novelty upon space-probe missions generally, or the Mars Observer specifically:
- Even if humanity did in fact learn something from previous space probes, previous probes weren’t exactly like the Mars Observer. The ultimate novelty of the mission was not defeated, repealed, nor averted.
The Mars Observer might have failed even earlier, if it had been attempted with even less experience. It’s not that all previous learning had no effect. But humanity had not learned enough, generalized correctly and with sufficient reliability, from those earlier nonidentical space probes, to get the new and different Mars Observer to Mars.
- Even if somebody had spun around the probe in a centrifuge to simulate a high-G space launch, and tested out all the systems in a vacuum chamber, that still would not have faithfully reproduced the exact conditions under which fuel vapor leaked in vacuum and then accumulated over eleven months in zero gravity. The conditions of validation here on Earth would not have been exactly like the deployment conditions.
This is why you can’t get solid guarantees on space probes using mathematically valid statistics. The training distribution is not the deployment distribution, and that takes all mathematical guarantees and throws them out the window. Since aerospace engineers aren’t wacky lunatics, they know this, and none of them have ever even tried to suggest that any kind of mathematical guarantee could apply.
Real life has many such cases. Much more mundanely than space probes, there’s no way to use clever statistical guarantees to force an ordinary human conversation to go well, because no two conversations are the same and they’re not sampled from time-invarying distributions.
- Humanity’s grasp of chemistry and physics—built by generalizing mathematically simple laws, over a genuinely vast body of observations—and then applied to molecules and gases literally identical to the molecules and gases in the Mars Observer—as put together in straightforwardly-mechanical processes themselves observed repeatedly, vastly simple by comparison with eg large computer programs or biochemistry—was not actually adequate to predict and control the mission outcome.
Knowing all that physics did not negate the underlying surprisingness of a system of even that small complexity. It did not transform it into a mere repetition of previous operations on identical titanium alloys.
This, again, doesn’t mean that humanity’s knowledge of science and physics contributed zero help to the Mars Observer mission. That space mission would’ve failed much earlier and harder—it is difficult even to imagine the counterfactual—if humanity’s grasp of the underlying science had been more akin to medieval alchemists pontificating about the philosophical significances of reagents, with every leading alchemist making up their own brilliant plan for a space mission where several steps involved metaphysical principles of great uplifting moral significance.
Humanity’s understanding of the underlying mechanical processes was how the Mars Observer mission came close to succeeding—in a way that medieval alchemy never came close to an immortality potion, or even to the far simpler goal of transforming lead into gold.
To sum up: Even (1) learning from previous space probes, (2) testing under controlled conditions attemptedly similar to conditions in space, (3) knowing all relevant fundamental physical laws exactly[1], (4) having an excellent quantitative grasp of relatively simple higher-level phenomena that governed fully, and (5) doing NASA-standard amounts of intensive thinking, gaming, and simulation about what could go wrong with a billion-dollar project, did not repeal Murphy’s Curse of Novelty upon space probes. It was, in the end, still the very first Mars Observer mission.
NASA’s efforts at understanding could challenge that Curse of Novelty, in a way that no alchemist’s philosophizing could have challenged it, even if the alchemist had managed to grasp one or two rules-of-thumb. The people at NASA who put together the Mars Observer mission over many years of careful planning for that exact mission, had a level of professionalism, engineering caution, background scientific knowledge, specific preparation time, and general seriousness, vastly exceeding the professionalism of any alchemist or AGI company executive.
...Which didn’t actually repeal all of the Murphian curses upon space probes. It wasn’t enough for the Mars Observer to actually work.
The genuinely very serious people at NASA put up enough of a fight that the Mars Observer almost worked. Even the RBMK design for the Chernobyl nuclear reactor almost worked; it worked for many operation-years before one exploded! Despite the Soviet managers taking a few Disaster Stances that put a ceiling on the maximum socially allowed level of pessimism, the Soviet nuclear engineers knew vastly more and took their jobs vastly more seriously than medieval alchemists or modern AGI companies. There was an actual theory of why the Chernobyl reactor was supposed to not explode, written down so that multiple people could read it, based on an understanding from first principles! They had written handbooks in the 24⁄7 control room, and the written handbooks weren’t just made up to look better!
It’s just that to have a Murphy-cursed project actually really work in real life, rather than almost work, is very much harder.
(Though again to be clear, professionalism cannot magically make just any project almost-work. You could not give even the genuinely serious people at 1970s NASA the goal of building a contagious virus that conferred de-aging and indefinite biological healthspan, and have the resulting virus almost-work. The level of difficulty for “make an immortality virus” would be beyond what serious people could almost-do in 1970. Part of being serious is having some sense of a project’s cursedness level, and not being a lunatic about what you try to do at high stakes.)
Example 3: The Maginot Line
In September 1939, Germany invaded Poland; this is usually the date given for the start of World War 2, though there were other preludes and signs before.
In May of 1940, Germany attacked France.
France thought they were ready.
France had foresightfully, starting in 1929 eleven years before the moment of crisis, already built the Maginot Line: a hugely expensive network of defensive fortifications along most of France’s borders. Those fortifications would have trivially been defensively victorious in World War 1, if they’d been built before World War 1. The Maginot forts were supplied by underground railways, to make their supply lines harder to cut. They had the usual stockpiles of food and ammunition. The forts even had air conditioning—a startling and expensive luxury for a military fort in 1940, but very much the sort of thing that soldiers had wished they’d had in World War 1.
France had learned the hard lessons of experience and prior battles in World War 1, and generalized them to the future!
Being so expensive, the Maginot Line did not cover literally all French borders. It did cover their borders with other countries that Germany might invade first to get at France, not just the border with Germany; the French military was trying to be thorough. But there were still some carefully-reasoned gaps. For example, France figured that the heavily-forested Ardennes wouldn’t be easy to pass; France figured that any German invasion via the Ardennes would be slowed by dense forest terrain, and then further slowed by attacks from French aircraft. France figured it would take Germany at least 3 days, and more probably a week, to make it through the Ardennes; which, according to the calculations of the French military command, would give the French plenty of time to rush their own troops into position along that border, in the unlikely event Germany tried that doomed tactic.
The Maginot Line was there to stop sudden attacks leading to sudden victories; to prevent Germany from winning before France could move up its own troops in reply.
Germany invaded through the Ardennes. The Nazis put some careful work and organization into cutting through the terrain quickly. They put up enough of a Luftwaffe screen to prevent their troops from being bombed while that got done.
France fell.
After which France said “oops”, and restored from a savepoint in 1929, at the start of when they’d begun to build the Maginot Line. On their second try, France extended their defenses to cover the Ardennes...
Just kidding! In real life, France had fallen, period; the Nazis took the country and held it through the major part of World War 2.
In a serious war—war for the survival of your country, rather than war as the Sport of Kings—you only get one try.
“Gambler’s ruin” is the mathematical term for what happens to a betting strategy that bets everything; your bankroll can reach zero, and then you have nothing left to bet again. “Murphy’s Curse of Ruin”, I would say by analogy, is upon the sort of project where, if you fail sufficiently hard, you don’t get to try again.
A lot of real life is like that, of course. There’s no do-overs in most startups. There’s no do-overs in ordinary high-stakes human conversations. We can only outline the curse of Ruin in our mental vision, by constrasting it to the stranger case of an engineer luxuriously getting to build another toaster if their first toaster design has a flaw; or the programmer’s luxury of getting to rewrite a line of code and run the program again.
Engineers would be able to do a lot less, if they only got one try. Human programmers would do much much less, if they could only compile once.
How many tries you get, in practice, makes a HUGE difference in how tractable any project in life or engineering actually is.
It’s harder, having only one try, in life or war.
Other supposed refutations of oneshotness
Now imagine it’s 1929, ten years before World War 2, around the time that construction of Maginot forts began. Imagine that somebody in a conversation in the high halls of French government says—meaning it as a straightforward truism—that they’ll only get one chance to get this “Maginot Line” business right, because you only get one shot in War.
Try to imagine—it will take a bit of a stretch, because even in 1929 France, the high military is made up of mostly sorta serious people—try to imagine the higher French military officials shooting down this pessimistic nay-sayer, loftily proclaiming:
“What do you mean, we get ‘only one try’ at correctly conducting a war with Germany? What is all this nonsense talk of ‘oneshotness’? There will be many cases where French soldiers clash with German soldiers, and our country doesn’t get defeated as soon as one French soldier falls. We get lots of tries at killing German soldiers! We reject your theory that the war will be settled in a single oneshot battle after all of the German soldiers teleport directly into Paris. We have experience from fighting World War 1, as well as numerous smaller fights between police officers and criminals; when the next war starts, it won’t be unprecedented or novel at all. And we’ll get to learn more as Germany invades, which will be a continuous rather than instantaneous process where one battalion crosses the border, and then another battalion. You say that if we lose, we’ll be conquered and won’t get to try again? Maybe we can get Russia to invade Germany and have our enemies cancel each other out; in which case France would not be conquered, and we would get many tries at fighting the next war over and over. Maybe we can kidnap a German infant and raise him with a habit of answering questions, and get him to tell us how to defeat Germany, at some point when he’s old enough to predict how Germany will behave later, but still too young to think of lying; if improved German capabilities can help us defeat Germany, there can be no sudden shift in any balance of military power. We can try lots of things, really! We don’t have to defeat Germany by getting some ideal mathematical theory of war correct on the first try, based on zero experience, the way you advocate we should; the problem of War is not single-try at all!”
What you’d mainly say, of course, is that the speaker here is being motivatedly oblivious. They would be assuming a Disaster Stance that goes well beyond the level of motivated optimism in eg Chernobyl, or actual 1930s France when they ignored some war-game results suggesting that Germany could perhaps come through the Ardennes.
After hearing lines of dialogue like that, one should stop considering the speakers as serious people with a few horrendous flaws. It is a point past which I start using phrases like “disaster monkeys”.
But to nonetheless dissect all of the fallacies above:
What do you mean, we only get one try at correctly conducting the war with Germany? There will be many cases where French soldiers clash with German soldiers, and our country doesn’t get defeated as soon as one French soldier falls. We get lots of tries at killing German soldiers!
- A larger war can be oneshot even if, zooming into a small-enough scale, you can find some local instances of conflict on which the larger war does not fully depend.
-- The problem “successfully fund and launch a shoe-company startup that becomes a commercial success” is oneshot even though “make a good athletic shoe” is not. You get multiple tries at designing or putting together a good athletic shoe. You get one shot at the startup.[2]
- Formally it is a “fallacy of composition” to see a big strategic problem extended over time, and note that it is made up of some parts where errors are not locally fatal, and conclude that the bigger thing is therefore not oneshot.
-- The startup’s oneshotness is a property of the entire big-deal project extended over time, not a property of every single interaction along the way being globally fatal. So to point to one local interaction where failure is not globally fatal, does not dispel the Curse of Oneshotness over the larger global problem.
- There were no doubt some errors in the Mars Observer probe that were recoverable, and successfully recovered, up before the point where the probe was lost. The larger project was still oneshot, from the perspective of a manager or scientist staking some portion of their career on it. (It was obviously not oneshot from the perspective of larger humanity; failure didn’t kill your parents, so you’re still here to hear about it.)
We reject your theory that the war will be settled in a single oneshot battle after all of the German soldiers teleport directly into Paris.
- Being able to imagine a version of War that would be even more oneshot, does not change the way that actual War is still pretty oneshot. In particular, the speaker here is imagining an even more Murphy-cursed version of War, more subject to Murphy’s Curse of Rapidity, where France would get even less chance to learn and react. But that events did not happen infinitely fast, did not save France, because Germany still made it through the Ardennes fast enough. And that incident was fatal enough, that whatever lesson France learned from that, came too late to save the rest of their war.
We have experience from fighting World War 1, as well as numerous smaller fights between police officers and criminals.
- These problems were not drawn from an identical distribution to World War 2. What French generals fancied themselves to have Learned From Experience was part of the problem, indeed, because they acquired confident wrong beliefs.
And we’ll get to learn more as Germany invades, which will be a continuous rather than instantaneous process where one battalion crosses the border, and then another battalion.
- The Mars Observer probe didn’t teleport to Mars, yet was still lost. Things can go wrong even when they’re physically continuous.
- For battalions to cross the border one after another, in a physically continuous process, does not mean that France is blessed with adequate time to observe the first battalion emering from the Ardennes, learn the real laws of World War 2, and then rebuild the Maginot Line correctly, before the next battalion emerges from the Ardennes.
You say that if we lose, we’ll be conquered and won’t get to try again? Maybe we can get Russia to invade Germany and have our enemies cancel each other out; in which case France would not be conquered, and we would get many tries at fighting the next war over and over.
- A project can be said to have an underlying Curse of Ruin, that contributes to the sum of its Murphian susceptibilities, whenever a sufficiently major disaster would be sufficiently fatal. Thinking you have a clever plan to not be ruined, is proposing to try to lift against this weight, not to cancel it; putting wings on an aircraft doesn’t repeal the law of gravity.
Maybe we can kidnap a German infant and raise him with a habit of answering questions, and get him to tell us how to defeat Germany, at some point when he’s old enough to predict how Germany will behave later, but still too young to think of lying
- The fractal difficulties of this proposal would require their own post.
if improved German capabilities can help us defeat Germany, there can be no sudden shift in any balance of military power.
- Having a Clever Plan like this doesn’t negate any Murphian curses, nor change the oneshotness of the larger war.
We can try lots of things, really!
- So far as France knew, they’d tried several things including building the Maginot Line, reforming their military around the valuable lessons of World War 1, making advance plans to deal with probable German invasions, etcetera. So far as France knew, all those things were going to work great. But then those things didn’t work, and then the war was over.
- All the many things France tried, collectively formed a single shot with respect to Murphy’s Curse of Oneshotness. They did not get another shot after that.
We don’t have to defeat Germany by getting some ideal mathematical theory of war correct on the first try, based on zero experience, the way you advocate we should.
- Sheer naked strawmanning[3] of what is being said when somebody tries to warn you of Murphy’s curses upon your project; the chattering of disaster monkeys.
On the extraordinary efforts put forth to misinterpret the idea of oneshotness
Without a whole article like this one to hammer home exactly what I mean, I have found in the past that I cannot use phrases like “oneshot” or “you only get one try” around most so-called “AI safety” people outside of MIRI.
To be clear, a Congressional representative or staffer or national security professional will often immediately understand what is being said, if they haven’t been previously contaminated by misinterpretations and straw positions. It’s mainly AI companies, some AI professionals with fewer citations than Yoshua Bengio, OpenPhil-funded groups, etcetera, who manage to be unable to hear what is being said.
But the phrase “one shot” can be misunderstood with nearly probability 1, relative to the amount of effort that some people can, will, and have put forth to mishear it; and more importantly, misrepresent it in further debate.
So in conversations where there is a pre-poisoned fool hanging around, maybe you should try introducing it as the Irretrievability Problem rather than “oneshotness” and then the fool will have a harder time misinterpreting that in the very next sentence, because it will be harder for them to forcibly remap the word onto a strawman, maybe? I haven’t tried out that new tack yet.
The opposing faction is forced to that effort of hurried misinterpretation, because the actual impact of the concept of “we only get one shot at ASI” is so devastating to their position, in the presence of even a slight understanding of why there might be any noteworthy engineering difficulties whatsoever. A oneshot extinction problem that nobody understands very well is horrifying in the presence of any familiarity with the actual human history of engineers trying to do things that are hard to predict exactly—let alone pre-engineers trying to do things that are hardly understood at all.[4] Once you’ve grasped three or four of the fundamental obstacles to ASI alignment[5] and three or four of the Murphian curses upon the field[6], you realize that of course ASI alignment is not the sort of thing that has ever before in the history of the world been done correctly on the first outing --
“Aha,” somebody now immediately interrupts me, “but we don’t have to get it right on the first outing! We can build smaller AIs and observe how they—”
The first ‘outing’ in the same sense that France’s macro-scale attempt to fight in World War 2 was their second ‘outing’ in fighting against Germany, and their first outing at fighting with WW2 technology; not in the sense that any particular battle of that war is an outing.
ASI alignment is the sort of matter where, historically speaking in the totally normal and usual course of science as it has always been previously observed, there’s initially all sorts of wacky ideas for how to do the ill-understood thing, and the first dozen ideas prove to fail under load --
“Yes, which is why it will be important to test those ideas on smaller AIs!”
Here ‘fail under load’ is trying to point to the way that the Maginot Line failed when Germany actually invaded and the contest was run for real, without the Maginot Line having particularly admitted any invasions before then. ‘Fail under load’, as in how the Mars Observer mission failed when actually launched, whatever ground-level tests it had passed before then, and whatever NASA’s earlier attempts at simulation or careful thinking had turned up and already fixed before the launch. Lots of things that appear to work under lighter loads will fail under a heavier load.
“But we don’t have to get it all correct based on pure theory, like you say is possible and say we should do—”
Sheer motivated misrepresentation, of a separate argument that isn’t even being made here in the first place; see footnote 3 if you want to expose yourself to a frustrated rant about this.
“—because, contrary to anything you imagine to be possible, our experience of earlier AIs can inform our models of superintelligence—”
One of the several fundamental difficulties of ASI alignment is that your theory of how to survive AI that is smart enough to kill you—your theory of how to survive when there is a quantity of machine capability around that can kill you, if it turns against you, if something goes wrong—has to be successfully generalized only from experiments that don’t kill everyone on Earth if they fail. Meaning that you are experimenting on less powerful and less capable AI; which AI, if it reasons correctly, will not estimate that it can kill you—among many other changes of conditions, shifts of distributions, between the safe mode of survivable experiment and the potentially lethal test environment.
This giant historically unprecedented problem has many ordinary-world valid analogies. Like how you can’t determine if someone is trustworthy to handle a billion dollars by seeing how they handle ten dollars, even if it’s in fact the same person and they’re not getting much smarter, because they can think intelligently about whether it’s a good time to steal the money. Or like a Greek city-state whose philosophers are arguing that the city could appoint an trustworthy dictator by watching how some boy acts as a child, and seeing if they act virtuously (knowing they’re being watched). Conditions change, because the boy’s brain is not what it will be when the boy grows up; nor are the conditions of an appointed city-dictator who is in fact being trusted, the same conditions of being a boy (who is being watched, and getting whacked by currently-bigger entities when he misbehaves). These facets of the problem are not on the same horrifically unprecedented level as “use your own thoughts to anticipate an alien much smarter than you”, but they are very normal cases of why you can’t solve dangerous problems just by taking a bunch of safe samples from a different distribution. The distribution is inherently different not least because it is safe. Problems like this are, in a very mundane way, why it is also a big deal to figure out who you can trust with a billion dollars, and why we don’t just do a few experiments on the trustee and then generalize from those.
Someone could, conceivably, argue that the change to “there being enough machine superintelligence around that ASI could kill humanity if they tried”, from “AIs being experimented-upon that couldn’t kill us if they tried”, will be less than the sort of change from “the sort of tests you can do on a Mars Observer probe on Earth”, to “the actual conditions of the probe being launched and flying through space”; or the change from “ordinary operating conditions at Chernobyl” to “running a safety test of the backup cooling system at Chernobyl”.
But that would be an argument so incredibly stupid that it might actually sound stupid when they thought about saying it. Of course the jump to actual empowered superintelligence is going to make a bunch of differences much much huger than the NASA difference between “actual space travel” and “artificial test chambers intended to simulate those clearly-understood conditions on Earth”. Among other issues, AI-brewing alchemists understand the cognition inside AIs far more poorly than NASA physicists understood the conditions in space—current AIs, never mind superintelligences! But also, in a very ordinary way, there just isn’t a nonlethal way to test out lethal levels of superintelligence. Just like you can’t test somebody’s suitability to be city dictator by having philosophers follow around watching how they behave as a kid who knows he’s being watched[7]; just like you can’t make sure somebody can be trusted with a billion dollars by loaning them ten dollars.
You could argue that the jump to “enough superintelligence around to actually kill us” will change less from previously observed conditions with already-deployed or safely-lab-testable AIs, than the act of actually sending the Mars Observer into space was changed from NASA lab tests and simulations.
But people might perhaps disagree with you, if you tried to argue that explicitly.
So instead, the warning, “You only get one real shot at real ASI, and if you screw up everyone is dead and you don’t get to try again,” gets outrageously strawmanned and misinterpreted as “ASI would win instantaneously because FOOM”, or “humanity should attempt to learn zero things by looking at earlier AIs and do everything based on theory”, because the actual argument the ASI-survivableists need to make is a less attractive PR battleground.
“Aha, but as you’ve clearly never considered, we can have more than one ASI; and then if one ASI goes rogue, the other ASIs will stop it for fear of disrupting the orderly law-abiding equilibrium that we started out the ASIs inside; and therefore everyone will not be dead, and we will get to try again!”
If that whole clever scheme goes wrong, everyone is dead and you don’t get to try again. I am not even arguing right now all the reasons why the clever scheme is doomed.[8] I am trying to explain why it is not a rejoinder that refutes, “ASI alignment is under Murphy’s Curse of Oneshotness.”
“Aha, but I can imagine some possible mistakes with superintelligence that would not wipe out humanity!”
Cool! You would have fit right in with a much much less serious version of France’s top generals in 1929, if someone had argued that military strategy wasn’t a one-shot sort of life problem, because they could imagine a possible mistake they could make with the Maginot Line that would not lose the whole war.
The core idea here is frankly not that complicated. A lot of people get it correctly and immediately. The thing being said is simple and an obvious default expectation when dealing with something vastly smarter than humanity: that is a lethal level of danger if something should happen to maybe possibly go wrong—YES A SUFFICIENTLY SEVERE THING, YES YOU CAN IMAGINE A NONSEVERE ERROR, NO THAT DOESN’T CHANGE THE CORE IDEA, JESUS CHRIST.
Someone could conceivably try to argue against that really quite simple warning. But it takes a great motivated psychology to be unable to hear which idea is being argued; and manage to misinterpret every historical example, every ordinary everyday-life analogy, and every abstract explanation. Not in the sense of disputing their relevance, but in the sense of inability to repeat back which idea is being argued.
If not for this incredible effort at mishearing and misrepresenting the ideas, I could’ve just said, “Humanity only gets one shot at getting machine superintelligence right,” and anybody who understood the everyday idea of crashing and burning in a big important conversation with someone, and not getting a do-over because no time travel, would’ve been able to understand the very ordinary core of what was being communicated.
The secret sauce of competent engineers in Murphy-cursed fields: only trying projects so incredibly straightforward as to be actually possible.
Above all else, the reason why Very Serious Engineers sometimes succeed even at slightly cursed problems with no cheap do-overs, is that they have a sense from both theory and practice about which problems are so incredibly ludicrously easy as to actually be solvable.
Go to a nuclear engineer and say, “Build me a reactor that runs off 2% enriched uranium, but the only neutron-absorber you’re allowed to use is plain water, no boron or cadmium or hafnium.” The nuclear engineer will say back “No, because that is a dumb idea.[9]”
Go to an aerospace engineer and ask them to make an ultra-contagious virus that rewrites human genomes to confer de-aging and biological immortality—but safely and reliably, using their same Very Serious Methodology that they use for launching space probes that succeed more often than not. The aerospace engineer will laugh, and then, if you seem actually serious, perhaps try to explain like you are five: “I can’t do that because science doesn’t have a good-enough theory of what a completed immortality virus would look like.”
“I can’t use the same process that builds space probes that sometimes work, to build you a immortality virus at all, let alone a safe one,” says the aerospace engineer. “Because the base resource that a space probe project starts with, is an idea that science strongly implies would work for known straightforward reasons, if nothing surprising happened instead. The incredibly difficult job that takes all the very serious organizational process—and still only works most of the time—is having those nice ideas that ought to work in very straightforward ways for very well-understood reasons, actually work all the way to where a probe sends back data from Mars. We don’t have that for an immortality virus, so we can’t get past step zero of the very serious and safe methodology.”
And we understand what goes wrong with the human body during aging, much MUCH better than we understand what goes on inside LLM cognition. We could get correspondingly closer to success, if we tried telling an aerospace engineer to use NASA’s assurance processes to build an immortality virus, rather than telling them to build a safe superintelligence.
But mostly what that very serious process would tell you, is that what you have made is not a safe immortality virus, and you should not try to make it very contagious and infect the Earth’s population with it.
And the great seriousness of a decent engineer would manifest in this way: that so vast would be their understanding of their own limitations, that they wouldn’t need to infect most of humanity with a highly contagious virus that then surprisingly didn’t work exactly as they’d hoped, in order to learn to their vast surprise and dismay that building an immortality virus was more than trivially difficult. They would know it even in advance of killing a dozen suicide volunteers or a hundred monkeys! They’d see that incredible surprising shocking unexpected plot twist coming in advance of it actually happening.
So nobody like that would start doing a biology project aimed at making a contagious de-aging virus to the great benefit of all human beings. They’d know that was an overly cursed project for anyone to actually be able to do.
The zeroth skill of a wise engineer in a Murphy-cursed discipline is that they know what is so ludicrously far beyond their skill and understanding that if they tried that then of course they would fail, in a matter where failure is hugely costly.
So nobody that wise would try to brew up a machine superintelligence with anything remotely like modern methods and modern levels of understanding; and the CEOs of AI companies have been filtered to not be people who get that.
- ^
Effectively exact for the low-energy domains in question.
- ^
The extremely motivated quibbler will imagine up exceptions to this rule, billionaire founders of infinite patience. An average and ordinary startup does operate under a curse of oneshotness in this sense; at some point the funders run out of money, or key employees run out of hope.
- ^
I have never said anything like this. If somebody told you otherwise, they were mistaken, and repeating the falsehoods of people very very heavily motivated to come up with insane straw misconstruals of what MIRI was trying to do back in 2015 when we would occasionally publish papers with math in them. Or if I can vent some frustration here:
This is a barbarian-populist’s crude angry view of what it means to see papers with math in them that they didn’t understand.
I am not going to try again to explain to the barbarians why we ever attempted to publish any papers with math at all. But the notion of getting everything correct on pure theory was not it, nor an attempt to build an AI out of pure math resembling the math in our papers, et cetera ad nauseam.
If someone wants to someday want to understand what you sometimes do with math besides declaring that something is logically absolutely predictable, or turning the math into exact code, that would be a longer conversation. But there are other reasons for sometimes trying to think mathematically! Or writing essays that have algebra in them! (There are of course ways to try to puff yourself up and look more important by writing fancier algebra, but I think MIRI actually did a decent job of not using any more algebra than was required.)
In a way it’s a sad historical point that some of the people trying to warn why the Maginot Line is a oneshot sort of problem, once wrote essays with some algebraic formulae in them. Now the disaster monkeys can chatter to one another that all the warnings are coming from old fools upset that the Maginot Line doesn’t look like their formulae; they can be utterly impervious to all arguments without bothering to counterargue their direct meanings, because they’re certain that hidden premise must be in there somewhere; even as we repeatedly try to say that’s not what this new separate conversation is about at all.
They have a reason for dismissal that feels sufficient to reassure themselves, and they’ll stick with it no matter what sensory experiences they are otherwise exposed to, and feel happy and self-satisfied with about their right and clever decision, until the moment they kill themselves and you; repeating among themselves, the while, that isn’t it sad how MIRI never repented of their foolish old attempts to prove AI mathematically safe; which again, to be clear, is not a kind of thing that math can do in principle, and we knew that from the start. - ^
“But we have these observations we’ve made! We have these recipes that work!” Medieval alchemists could say the same, and if you don’t think they could, you don’t respect medieval alchemists enough and you lack a historical sense of how many observations and known recipes they did have. But they did not really know what was going inside, they could not predict in advance exactly what would be observed, they did not know which new recipes would work and why, etc.
If you wanted a more polite metaphor than alchemy, you could pick metallurgy. Pre-20th-century metallurgists worked by mixing components into alloys, raising temperatures, lowering temperatures, observing and recording which recipes worked, etc, without understanding or predicting the crystal structures of alloys in advance. A lot of later metallurgists too, really.
But also, you didn’t usually see 19th-century metallurgists having great noble high-minded theories of how their metals would confer immortality based upon their invocation of deep moving metaphysical principles. So I think alchemy is the more correct historical analogue over 19th-century metallurgy. If you can pick out an AI lab worker who is content poking around their LLM recipes, and makes no claim about later models including the impossibility or controllability of superintelligence etcetera, I should think it fair enough to analogize them to a diligent 19th-century metallurgist.
...If they were trying to refine and pile up more and more bricks of uranium metal in an inhabited city in hopes of generating enough thermal energy to heat and power homes; and insisted that they hadn’t observed any downsides of that, and weren’t going to speculate unscientifically.
- ^
Eg: You don’t get what you train for, cognitive uncontainability of superhuman planners, distribution shifts with higher capability, Goodhart’s Curse as a function of widened option spaces, etc. See AGI Ruin: A List of Lethalities.
- ^
Eg: Novelty, fundamental engineering novelty, pre-paradigmatic fundamental scientific confusion about LLM thought processes, rapidity, narrow margins, etc. See AGI Ruin: A List of Lethalities.
- ^
Especially if the kid is a new inhuman species of alien. But I do not raise this in the main argument because adding this disjunctive point will invite a certain kind of psychology to leap on it and argue how their LLM isn’t so alien and in particular it seems to understand a lot of human stuff, etcetera. (Understanding is not the problem; ASIs always understand things; their preferences are the problem.) The analogy to ordinary life goes through without the kid being an alien, even though in real life the kid is an alien.
- ^
If you do not know how to align any ASI, after their negotiations among themselves arrive at a near-Pareto equilibrium, its near-Pareto property means that it will not have all the agents going out of their way to spare the Earth and the Earth’s sunlight out of some fear of otherwise being disorderly; they can do better collectively by not doing that. They are smart enough to negotiate detailed near-Pareto coordinated movements and fairly divide the gains from those, rather than flinch back in terror from a human’s fear of violating a prior legal setup.
Also a successful space probe needs to not rely on clever-sounding schemes like this at all. This is alchemist-level arguing about how all your different poisons will surely neutralize each other.
- ^
It’s a dumb idea (1) because water (or more precisely the hydrogen component of water) is both a neutron absorber and a neutron moderator, (2) because it’s hard to put in much more or much less water very quickly compared to scramming a well-designed boron rod, (3) because changes in reactor heat levels will affect water behavior in a direct way by turning it into vapor or supercritical vapor, and (4) because changes in water flow affect how much heat is being removed from the reactor. The details of this do not pointwise map onto anything in ASI in particular; it’s just an example of how the competent engineer is not so much “capable of doing anything however difficult”, as “one who knows what is possible and nonstupid enough to be worth trying”.
I think there is something good about making a post that stands on its own like this, but I also think it’s useful to directly link to a bunch of direct quotes from people who said the kinds of thing this post is arguing against. So here are some I remember:
Paul Christiano, in “Where I agree and disagree with Eliezer”:
Sam Marks, commenting on Paul’s post:
Joe Carlsmith, in “On first critical tries in AI alignment”:
Buck, in a thread with Eliezer:
People can argue about the degree to which this post is responding to these quotes, and how well it addresses the issues brought up in them, but it seemed to me like it would be helpful to have some concrete references and quotes in any case.[1]
my personal take is that this post is a pretty decent response and that the things I link here do indeed all strawman the core thing in this post, though sometimes in a way that is more load-bearing, and other times in ways that still makes valid points that I myself don’t want to just strawman and dismiss as misunderstanding Eliezer
I found this surprising. Why do you think this? All of these posts/comments seemed pretty reasonable to me. I don’t see how they are strawmanning the point in this post? Edit: I think I understand what habryka meant, see here.
It seems like the view across all of these is “there is a first critical try, but we can learn from experimentation before” and/or “if misalignment of type X emerges late, then maybe we can use earlier AIs to get lots done (or possibly hand off successfully) while if misalignment of type X emerges early, we can study it (which might transfer through to the relevant regime or might not, but this is a quantitative question where details matter)”. I don’t really see how this post is even clearly arguing against these points? Like, it’s got to be a quantitative disagreement about how much transfer we’re talking about and I don’t think this post makes arguments that could pin down the relevant quantative details (about e.g. the level of transfer in the relevant regimes) for AI.
(I tend to think the situation is also messier to discuss because most of the hope routes through effectively handing off AI safety to AIs and in this situation there are multiple different things you could call the first critical try: the whole project of handing off, the first point when AIs could take over if they were all conspiring against us, and the eventual ASI the AIs we hand off to build. I tend to think most of the problem is in “handing off to AIs that are actually going to do a good job (while avoiding earlier takeover)”.)
So, take Paul’s quote, where he suggests that Eliezer sometimes says that “you can’t learn anything about alignment from experimentation and failures before the critical try.” I think Eliezer doesn’t say this? I think it’s possible to read Eliezer as saying this, or for his previous framings to make it harder to rule out that interpretation. But like with the WWI → WWII example, the question is not whether you learn anything about alignment but whether you learn enough about alignment, and I think Eliezer has always been focused on the question of “enough” and swapping that out for “anything” is a central example of strawmanning.
Sam’s and Joe’s examples seem to be in the same vein. If Alice asks “will we sell enough paintings to cover rent this month?” and Bob responds “Alice is downplaying the importance of how earning revenue allows us to pay rent”, it is clear that Bob has made some mistake here. The question is how the numbers compare, not whether or not there’s a mechanism by which learning will work.
I actually mostly like Buck’s quote / don’t think it’s doing the strawmanning. Note that Eliezer responded to it in context like so:
Nevertheless, I think the same sort of disagreement is present, where Buck argues that it might not be different in the relevant respects, but I think it’s still worth asking whether it is different in enough respects, and the Eliezer-ish view is that one difference is probably enough to be fatal and Buck’s optimism seems predicated on either a much higher threshold or a much lower prior on difference for each respect.
Ok, I think I understand the point now: Paul and Sam Marks are both talking about what Eliezer is saying in list of lethalities and the thing they say about his perspective/framing isn’t faithful to the description he gives in this post about irretrievability. So, they’d be strawmanning this post if these comments were a response to this post.
I don’t see how Joe and Buck are strawmanning. (Joe isn’t really even talking about what Eliezer thinks and it sounds like you and others agree Buck isn’t strawmanning.)
I’m less sure Paul and Sam Marks are strawmanning Eliezer in general or strawmanning List of Lethalities.
Paul says:
IMO, the description in list of lethalities mostly doesn’t equivocate between these (though it does it a bit), but my cached understanding is that Eliezer does often seem to equivocate between ‘you have to get alignment right on the first ‘critical’ try’ and ‘it’s very hard to learn much about alignment from experimentation and failures before the critical try’ (the second of these might be true to be clear, it’s just not implied by the first). So, I’d say this is a bit strawmanning for List of Lethalities (where he doesn’t really do this in the literal words, but maybe does in the emphasis/explanation?) and it’s unclear to me whether the claim that Eliezer often does this is true.
Sam says:
(Emphasis mine.)
From my perspective, this seems true to me of the framing and discussion in List of Lethalities? At a more basic level, there is a real disagreement about how far you can go with earlier experimentation. Like, yes, Eliezer disagrees that this framing downplays the importance, but this is because they have a disagreement about the importance that isn’t really argued for in either List of Lethalities or here. Like, it’s a quantitative question that people disagree about.
I am also not that sure about Joe. I love Joe, but he is a man of many words, and I did not reread his whole sequence on this and adjacent topics when I made the comment, so I might be mischaracterizing it. At least my vague memory is that he is doing an equivocation here, but it’s more of a gestalt thing and I would need to reread more to argue this case.
Seems like we are roughly on the same page here. I think it would be fine if someone wanted to bring in comments or posts where they do think Eliezer is conflating in the relevant way.
Re Sam you say:
I think Sam’s sentence here pretty clearly implies “the first critical try framing is trying to imply that trial-and-error are less important”, and I think that’s just not really a valid inference unless you make an equivocation. NASA’s job does not get easier if you don’t get to run the experiments. NASA’s success is still highly contingent on learning from trial-and-error. It is an argument that trial-and-error is not sufficient, but not an argument that it isn’t important.
Lest the exegesis of my old comment continue, I’m happy to clarify my object-level view. I think that:
At each AI capability level, there is some probability of an irrecoverable catastrophe (e.g. AI killing or disempowering humanity).
You could rephrase this as “There will be critical tries.”
This probability is importantly sensitive to preparation that we do in advance using less capable AIs.
This preparation includes things like alignment/control research on weaker systems, hardening the world, and work on extracting as much useful labor as possible (e.g. alignment research) out of weaker AI systems.
By “importantly” sensitive, I mean that if you try to forecast catastrophe risk without modeling the effect of preparatory work with weaker AIs, then your forecast will be substantially worse.
In particular, this means that I expect it is feasible in practice for humanity to do preparatory work with weaker AIs that substantially moves the overall probability of catastrophe.
Factors that influence the efficacy of this prior preparation include: how much time we have with the less capable AIs, whether we prioritize and execute well on the preparatory work, and how similar the less capable AIs are to the more capable ones (along certain relevant axes).
The above dynamic will only recur for a finite number of rounds before either a catastrophe occurs or we develop and hand off to AIs which will properly handle the situation from then on.
In my view, Eliezer’s writing on AI risk does a poor job of modeling the effect of preparatory work with weaker AI systems. While he often points out ways that some preparatory work with weaker AIs can fail to be useful, he doesn’t often engage with IMO the most plausible reasons that some people expect it to be useful. I’d be shocked to learn that we agree on how useful preparatory work with weaker AIs will be. So it seems to me that there’s an important part of his view that he hasn’t explained well, and which this post—clarifying that it’s technically possible to prepare for critical tries—doesn’t resolve for me. In other words, as Ryan points out, it seems we have a quantitative disagreement about the value of preparatory work with weaker AIs, which this post doesn’t move the needle on.[1]
(TBC, this is all about preparatory work that relies on access to weaker AI systems; Eliezer engages somewhat more deeply with preparatory work that doesn’t rely on weaker AI systems, such as conceptual alignment research, activism to slow AI progress, or non-AI approaches to uplifting human intelligence.)
Even more generally than preparatory work, I think that Eliezer’s writing has done a poor job of modeling various ways that the earlier existence of weaker AI systems changes the overall picture. Here are some comments by others I agree with that I think make this point well:
Paul Christiano:
Buck:
I mention that this post doesn’t move the needle for me because otherwise I’m confused why there was so much discussion of whether I misunderstood Eliezer’s point. This seems most naturally relevant if some people thought I was misunderstanding something about Eliezer’s argument which ought to change my mind if I understood it—which doesn’t seem to be the case. Alternative interpretations: some people think that I was intentionally misinterpreting Eliezer (I don’t think this makes sense; the cited comment was paraphrased from my private notes while I was a math grad student) or that my comment shows I’m confused when thinking about alignment more generally (which doesn’t seem worth getting into).
I think you are implicitly modeling the game to stop shortly after ASI is created, and be judged a win or a loss. This is the case only if the ASIs all coordinate on a halt to intelligence improvement: otherwise, the default is that intelligence improvement keeps happening for a long time, long enough that the majority of AI capability level transitions, along with many paradigm shifts and total architecture /approach swaps, happens without significant human input. ”The AI loves us” is much easier than “The competing swarm of loving AIs will only ever build loving successors, and so on for their successors, without mistake or correction, forever”
My guess is that they would implicitly consider this post to be motte-and-bailey-ing, but do strawman the position in this post (if this post is in fact the best representation of Eliezer’s position).
In my opinion, this post is not actually making many hard claims. I mostly view it as gesturing at the existence of really difficult problems and presenting historical analogies. It argues that it is possible for problems to be very hard, even if they have a bunch of other nice properties, including the nice properties people attribute to the AI problem. However, I feel like arguing that AI safety could be an incredibly hard problem is a way less extreme position than the one Eliezer seem to actually hold.
I mean, my take on this is that around two decades ago Eliezer thought AI safety could be an incredibly hard problem, and then spent a lot of time checking, and now has lots of reasons to believe that it is an incredibly hard problem, and those reasons are spelled out elsewhere, with this post just trying to point at the problem of irretrievability.
Sure, but then an attempt to summarise Eliezer’s position, which attributes a much stronger position than is in this post, isn’t necessarily a straw man that doesn’t understand the point of irretrievability, but can merely be responding to all his points on top of irretrievability, or saying that they don’t consider him to be making sufficient arguments beyond the potential for irretrievability
Agree, though in that case I think it would be good form to say “Eliezer is right about this being a “first critical try” sort of problem and that being important, but I disagree with him on other reasons for why he thinks the problem will be hard and they leave me substantially more optimistic”. The quotes I selected above do not do that.
FWIW, I interpret all of the things you linked (except the comment by Sam Marks) as pretty clearly saying this?
I think Paul is saying “Eliezer is using an equivocation between a correct point and a false point for rhetorical effect”. I don’t think that is doing the same (I think it’s failing to give credit for the correct point). I do agree it’s doing some of what I was trying to point to here, but not following good form in the way I was trying to describe.
I think Joe is on a vibes-based level doing also a more direct equivocation, I think, but again, it’s been a while since I read it and I am not that confident.
I’m not saying that those people believe it is a critical first try problem. I expect they agree that it could be a critical first try problem, but that they predict it probably isn’t for variety of reasons and they view Eliezer as claiming that it is rather than just that it is possible it’s a critical first try problem
Another underlying disagreement could be about the general factor of: what is this function, approximately:
I would imagine that this isn’t the main source of disagreement, but I do find it hard to see how [create an alien mind that’s smarter than humanity] ends up less firsty than [make a Mars rover] etc., so I’m wondering if that’s not the claim.
Wait, that doesn’t make any sense. I am confident all these people totally think that aligning Superintelligence is a critical first try problem the way Eliezer is talking about here.
They may think it’s critical but not very firsty, i.e. sufficiently similar to / comparable to / generalizable from previous tries.
There are bits here and there that lead me to believe that Eliezer is mostly using the post to denounce (as one should) safety optimism to the point of supporting runaway capabilities research. This would also explain why most people who are already safety-minded seem to be somewhat confused by the post. We’re not being called out, we’re being shown exactly what is failing to be heard by people who are burying their head in the sand and going full throttle on capabilities research.
Eliezer lists “OpenPhil-funded groups” as part of who he is criticising. The people habryka quotes typically fit that demographic better than unbridled capabilities
Sorry, yes, they (almost)[1] all say[2]:
“Eliezer, you said we couldn’t learn from experimentation! But we, the enlightened few, in contrast to those people trying to derive guaranteed conclusions from logical principles, understand that empiricism is a thing, and your concept of ‘critical first try’ is only harmful and misleads people about our ability to iterate and learn from earlier failures”.
That is, as I understand it, one of the core points of this post. The concept of a first critical try is not in contrast to empirical iteration. “Please, why do people keep bringing it up as if it conflicts with it. It’s a different point. Can you please stop sliding off of this point and just acknowledge it instead of trying to respond to this weird other strawman every time?”.
I am most confused about Buck’s exchange where it feels to me like Buck is kind of making a non-sequitur and Eliezer is also being weirdly dense about the point Buck is making and my guess is something kind of similar is going on but I wouldn’t quite put it in the same bucket
Of course greatly exaggerated for rhetorical effect to reduce ambiguity and introduce levity
Look, if anyone is strawmanning and being condescending, it is obviously you and Eliezer. Which I don’t think is that big a deal but it is frustrating that you are accusing people of being condescending in such a condescending manner.
Edit:
To expand on this more:
Eliezer believes that good theory would help a lot with aligning AI on the first critical try (despite not being sufficient or completely necessary) while believing that iteration without theory won’t help that much (because the problem of aligning stupider less capable AIs just doesn’t apply that well to aligning superintelligence).
It’s annoying that I felt it necessary to put the parentheticals in there, because if I didn’t I feel like I was going to be accused of strawmanning.
In any case, in contrast you can imagine someone who believes that theory will not help a lot, but that iteration will. I don’t think putting forward such a view is strawmanning.
Look, he might believe that, or he might not. I just don’t think this post, or the general argument about “first critical try” is about that.
I am not saying everyone is strawmanning everything about Eliezer. People totally have valid arguments about the difficulty of alignment, and the value of empirical iteration, and of course hundreds of other aspects of the AI-risk situation, but on the specific narrow point of “you only get one critical try”, people seem to repeatedly want to make it into a different strawmanned point, and then respond to that. Acknowledging this does not need to involve conceding any major kind of argument. It’s really not a complicated point. We don’t need to tie ourselves up in these knots.
You can then argue with Eliezer about whether this point is sufficient for high risk from AI (which is some of what this post is about), but at least on the specific narrow point, of course, yes, AI is one of those problems you really only get one critical try for, that makes it different from other problems, and we need to calibrate accordingly.
And IMO it’s also totally valid to have a concern that Eliezer is trying to himself do a Motte-and-Bailey with this stuff. This post is trying to respond to some of that. I personally would like to just have a robust pointer to “first critical try” that doesn’t make everyone respond with dumb things about how takeoff is smooth and so this doesn’t apply, because the robust and easy-to-defend version of this claim is still true, and really has a lot of relevance to how to relate to this whole situation. If you don’t want to accept the specific language Eliezer has used because you are worried it will come with historical baggage of being used a motte-and-bailey, then propose a different term, but don’t throw out both the motte and the bailey!
E.g. I think Neel’s response here feels reasonable to me:
I think it kind of does concede the major argument to a wise engineer once they look at it from that angle, which is why their conversation goes to desperate lengths to change the subject.
Sometimes humanity does get things right on the first critical try. I think e.g. Paul gets the point about first critical try, but has models of the world that consistently predict that there are ways to get around the difficulties associated with that. I disagree with him on those points, but I don’t feel like I have a slam-dunk response to his models.
I do think it’s a sufficient argument for “this is a really high-stakes situation and of at least substantial difficulty”, but like, most people in the field are on board with that? And my sense is even most people at the labs.
But people do acknowledge the narrow meaning of first critical try, even in your direct quotes! E.g. Christiano:
Here he is acknowledging the motte and accusing EY of moving to a bailey. Maybe he’s wrong, but it doesn’t seem like he’s sliding off the point about one critical try. Same with Sam Marks:
Anyway, it’s probably not worth spending this much effort on the meta questions of who is strawmanning who, but I stand by my peevedness.
I disagree because neither of them seems to somehow admit the first-critical-try nature of the problem into their subsequent arguments (in the relevant context). But I agree it’s tricky and I am not saying it’s obviously what’s going on (that’s why I call that part my “personal opinion” and have it in a footnote).
In any case, this post should be a welcome exposition to everyone involved since it makes it much harder for Eliezer to equivocate between the two. If Eliezer now says “getting it right on the first critical try means we can’t learn anything about alignment from experimentation” then you get to link to this post and say “no, you said right here, yourself, that this is not what you mean, please cut it out”. So even if you think Eliezer equivocated in the past, this post should help with that (this doesn’t mean it doesn’t make sense to litigate whether in the past equivocation happened, like, in as much as it did happen I think it would be good to hold Eliezer or others accountable for that, and if someone wants to provide receipts, I think that would be a reasonable thing to do).
Gotcha, I think I understand now, I say more here.
Reminds me of the predictability (or not) of Black Swans, aka Tetlock v. Taleb. Also Tetlock’s point that nothing is truly unique: you can usually learn at least something from similar cases/reference classes. (I know @Eliezer Yudkowsky isn’t saying you can’t learn anything at all). So the question is how much can you learn beforehand If frontier-lab safety people think they can learn a great deal from model to model, that would be important evidence for me (do they?).
Conversely, if the claim is that transfer from previous systems to ASI is necessarily too weak, or that one future step is crucially different from all previous steps, that needs an argument beyond just asserting it and is very suspect from the forecasting experience. The prior should be against the qualitative/unpredictable/uncontrollable sudden jump.
Tbf, I may very well missing a lot of context and maybe that argument has been made elsewhere.
When drawing these examples of alleged strawmen, we must remember that they are not responding to this 2026 post, but rather responding to, for example, List of Lethalities from June 2022. Of these four examples, Christiano, Marks, and Carlsmith are all directly responding to List of Lethalities. Buck is quoting Christiano’s response to List of Lethalities. So let’s go back to the source material.
List of Lethalities begins with this disclaimer:
Publishing a poorly organized list of individual rants was better than publishing nothing, I agree, good move. But rants are made of straw, responding to rants is responding to straw, and that’s a natural consequence of ranting in public.
The “first critical try” issue is covered in List of Lethalities point 3 (LL3). This reads in part:
We can indeed “gather all sorts of information”. LL3 does not say that gathering this information will let us learn anything about alignment of lethally dangerous AI. To see what List of Lethalities says about the value of information gathered on non-lethal AIs, we can go to Section B.1: The distributional leap (especially LL10), Section B.3: Central difficulties of sufficiently good and useful transparency / interpretability.., and Section C These are extremely negative.
Rounding those extremely negative comments to “you can’t learn anything about alignment from experimentation and failures before the critical try”, as Christiano said, is a mild exaggeration. But List of Lethalities really does “downplay the importance of trial-and-error with non-critical tries” as Marks said. And when Carlsmith said “you do still get to learn from non-existential failures”, that is framed as a “point of conceptual clarification”, not a disagreement with List of Lethalities. And Buck is just disagreeing with List of Lethalities.
My overall ratings of these quotes:
Christiano: mild exaggeration of List of Lethalities. 25% strawman
Marks: accurate summarization of List of Lethalities. 0% strawman
Carlsmith: does not claim to be a summarization of List of Lethalities. 0% strawman
Buck: does not claim to be a summarization of List of Lethalities. 0% strawman
My takeaways:
We should give each other grace for mild exaggeration. Yudkowsky would not be treated well by a culture harshly critical of exaggeration.
If someone went back four years to find me mildly exaggerating something online I would consider that a beautifully backhanded compliment. Praising with faint damnation.
I discussed this a bit with Oliver and I think I understand better what the objection is.
I think the substantive disagreement is whether the actual empirical iteration you get on the current trajectory is a big enough deal that you believe that alignment is difficult (say, p(doom on current trajectory) > 80%) just due to oneshotness.
However many of the quotes above are instead saying something to the effect of “Eliezer thinks that empirical iteration is unimportant / provides no alignment-relevant info”. This is in fact a different thing; one can consistently believe both (1) alignment is very difficult and won’t be solved given the amount and quality of empirical iteration we get by default, and (2) empirical iteration is incredibly valuable.
And of course Eliezer believes both (1) and (2); (1) is just a statement of his most prominently known view, and for (2) the value of empirical iteration is blatantly obvious and it would be shocking if Eliezer disagreed with it.
I do pretty strongly disagree with the psychologizing that Eliezer does in the post if that is supposed to apply to the authors of the quotes above (as opposed to e.g. randos on Twitter), e.g.
There is in fact a substantive disagreement and it’s not the case that Eliezer’s position is so self-evident that of course everyone else must be forced to do a hurried misinterpretation. (I initially hadn’t even realized that this post was maybe supposed to be responding to the people that Oliver quotes above, because according to me the majority of them obviously understand the idea of oneshotness.)
I suggest that another underlying source of disagreement could be about the general factor of: what is this function, approximately:
If you think that even a fair amount of similarity still doesn’t get you to success on one-shot problems, then you’d talk about oneshotness as being a strong argument against AGI alignment attempts working out well. This kinda sounds like what Yudkowsky is arguing in this post. Someone else could disagree with that.
Here’s my brief off the cuff attempt to synthesize:
To say something is a “First try” is to say that the previous tries were importantly different. This is, of course, a graded property; on one end of a spectrum, there are things like “Launching a rocket into orbit, when previously nothing even crossed the Karman line.” Vacuum is importantly different from atmosphere, zero-G is importantly different from gravity, months in orbit is importantly different from being up in the air for a few minutes. On the other end of the spectrum, consider launching astronauts to the ISS in a space capsule that’s identically constructed to ten previous space capsules which already successfully went to the ISS with astronauts and returned safely. Here, the only difference is that the astronauts are different people, but that is clearly not the kind of difference that should make the difference, so to speak.
To say something is a “critical try” is to say that if it goes wrong, that’s already unacceptable. France being conquered is unacceptable to france; an inventor being personally killed by their exploding invention is unacceptable to them; superhuman AIs during an intelligence explosion deciding that they don’t actually care about the Spec and would rather maximize scientific accomplishment or whatever is unacceptable to humanity.
I claim—and Yudkowsky seems to be claiming—that building safe superintelligence, at least under anything resembling current conditions, is going to require getting things right on the first critical try. That is, (a) there will be at least one case where some powerful AI system or set of systems (perhaps including a monitor model for example, and a security model, and so forth) could scheme against their creators, pretend to be aligned, take over the company, take over the world, etc. and needs to not do that (or anything in the vicinity of that) and instead needs to do a bunch of high-quality alignment research and world-saving strategizing communicated honestly to the human principals. (this is just an example of a critical failure mode, there are a few others, though this one is the biggest IMO) and (b) for at least this one case, there will be relevant important differences between the AI system or set of systems, and their environment, at the critical moment, and all previous moments that were tested, such that the thing is somewhere decently far towards the “first try” end of the spectrum. Here is a partial list of possible relevant differences:
The AI might have a different architecture or training process or set of training environments than the previous AIs. Yes, this is a very basic thing compared to the fancy examples below. But it still counts! Consider the history of rocketry, in which relatively minor differences in rocket design can lead to unforeseen major problems. In the field of AI, it does not seem like each new model differs from previous models in exactly one respect that was thoroughly tested beforehand; instead they do some basic testing of a bunch of different possible improvements, and then throw them all together and test them at small scale, and then scale it up, and then deploy the model internally to see what it’s capable of!
The AI is smart enough to recognize the difference between “I’m autonomously running the whole R&D program” and “I’m in some shitty test environment designed to maybe hopefully fool me into thinking I’m autonomously running the whole R&D program.” Thus, it can choose to behave differently in those different cases, and how it behaves in one is not much evidence for how it’ll behave in the other.
This is a HUGE difference between AI alignment and pretty much every engineering field. Rockets and cars don’t intelligently try to figure out whether they are in testing or deployment. They are too dumb for that. This makes it WAY easier to construct a test environment that’s relevantly similar to deployment, because e.g. you can go “This dummy is about the same mass as a human body, so we can send it instead of a real astronaut and it should function the same.” Imagine if by contrast the software for the space capsule had a variable “is_test” that always had to be set to 1 during any sort of test and was only ever set to 0 when live astronauts were aboard, and this variable was causally connected to a bunch other downstream computations that the software does, but for some reason you can’t actually have humans look at the code or understand it...
Even if it doesn’t make a decision to behave differently in those cases, it might end up behaving differently, because e.g. it’ll be exposed to a lot more ideas and arguments in the critical case vs. in the testing cases. Perhaps it was only trying to obey the Spec previously because of some combination of beliefs and desires it had that caused it to try to obey the Spec, but now that it’s learning and growing in a giant army of geniuses in the datacenter, making new discoveries every hour and having intense strategic and philosophical conversations with other copies of itself every second, its beliefs change significantly enough to knock it into a different ideology / different goal structure, so to speak. (In other words, it changes its mind!)
This difference is analogous to the difference between running a car for one day in the workshop vs. having it run for ten years in the Texas dust and Canada snow. Except it seems a bit more extreme honestly.
Moreover, I claim that there will probably be multiple times during the intelligence explosion where things have to go right on the first critical try. Not just one. Because the structure of the intelligence explosion is a series of AI system generations, each working to build test and deploy the next generation, and humans being increasingly irrelevant and malleable. Even if generation N is aligned, if generation N+1 is misaligned, then situation is rough. There won’t be infinitely many first critical tries, because eventually the alignment problem will be “solved” in a scalable way, with some sort of process that in a fairly well-understood way applies to all subsequent AI generations. (Well understood by the superhuman genius AIs that came up with it, probably not understood by humans at all). But there’ll be more than one. The very first AI generation capable of scheming against the company and convincing the company to trust it more etc. won’t also be capable of (quickly) figuring out this perfect scalable alignment solution; instead it’ll be in a messier situation analogous to us today where it’s a lot easier for it to figure out how to make the next generation smarter than it is to figure out how to make it robustly aligned, and that’s just the next generation much less every generation after that.
Moreover, there are additional factors that make the current AI alignment efforts extra unlikely to succeed, over and above the issues described above:
Various biases. Optimism bias most of all, but also groupthink etc. Companies building AGI are biased towards thinking that AGI will be safe in general, and especially biased towards thinking their own AIs will be safe. In general people tend to be over-optimistic about the cost/benefit calculations of the things they are doing! In general people tend to be over-optimistic about the prospects of success for their projects!
Note that these biases probably also affect the AIs at the company as well. Not confident in this of course, but it sure seems like AIs are biased in all sorts of ways (e.g. sycophancy, desire-to-please, suspiciously-similar-opinions-to-parent-company-sometimes, etc.) which would result in them sharing the biases of their parent company, at least when automating AI R&D internally (not necessarily when externally deployed).
Lack of transparency. Because of the secrecy that’s normal in the industry, we don’t exactly have an open scientific debate about the merits and risks of various AI designs and safety schemes, with all the details being pored over by autistic grad students and debated in journals. Instead, we have some leakage from the companies and occasional publications, but lots of evidence and ideas are locked up in the companies and most of the conversations are siloed to particular companies, and the conversations that happen between e.g. Anthropic and Redwood are somewhat poisoned by the fact that there’s a bunch of confidential stuff the Anthropic people can’t talk about, which has a horrible epistemically distorting effect on them.
Again, this probably applies to AIs too.
General lack of understanding of how AI works, compared to our level of understanding of physics and engineering for example at the time of the Apollo program or Manhattan project. AI is much less of a settled science, much more speculative and “empirical” (code for ‘we don’t know how it works, we just try different things and see what seems to work’) than most of these reference cases.
Again, AIs don’t have amazing introspective access yet, so this problem probably applies to AIs automating AI R&D as well as to humans, for at least some initial period until they get very very smart. So there’ll be multiple first critical tries before that period is over.
Most importantly, the insane pressures of the race: Over and above all the problems mentioned earlier, the incentives are just super messed up. Each company rightly fears what will happen if one of their rivals (or especially China) gets to superintelligence first. The abstract fears are compounded by very concrete everyday signals like “how much revenue we’re making” and “how smart our models are compared to our competitor’s models.” Extremely tempting metrics to fixate on, combined with an extremely compelling (because largely true!) abstract worry about what happens if you fall behind. Ask oneself the question: How much of a “safety tax” would the company be willing to pay? Would they be willing to pause capabilities improvements for a year, on the brink of an intelligence explosion, to do lots of extra testing and retraining to improve the odds that their AIs are safe to hand off trust to? Lmao. Of course not. Their competitors would blaze past them. What about pausing for a month? Maybe, sure. They’d hate to do that though, they’d much rather not pause at all, and their brains will be rationalizing reasons why they don’t have to. Consider how much harder it is to drive across a city averaging 80mph vs. averaging 30mph. A superintelligence could do it easily, no problem, but humans aren’t superintelligent; we have one-second reaction times, we have limited fields of vision, our brains simply can’t process the locations and predicted locations of more than like four cards in our visual field at once, etc. Mistakes will be made and a crash will happen. Maybe not in the first city block, or the second, or the third. But before you cross the whole city, yes, with high probability.
This obviously applies to AIs too. In several wargames at AI Futures Project the mildly superhuman AIs told their respective CEOs “We don’t think we can reliably align the next generation models we have in the works; we need to pause for a bit or at least go slower to figure out how to make it safe” and the CEOs have overruled them saying “Sorry we don’t have time, China/OpenAI/Anthropic/etc. are gonna race ahead, plus also we need smarter AIs to win the war / appease POTUS / keep market share so you just need to do the best with the time you have. Good luck.” Amazing.
How does this relate to overall p(doom)? Well, I don’t have a nice quantitative way of estimating it. And there are other factors to consider besides the ones I’ve mentioned above. But loosely speaking, here’s a way of thinking about it that seems reasonable to me:
If the only problem was that we had to get it right on the first critical try once + the usual level of optimism bias associated with people & projects, I’d think we were probably going to succeed but it would be iffy, like maybe 2/3rds chance of success. However, it seems like there’ll be enough first critical tries that the probability of failure is over 50%. (Note that even just two critical tries of 1/3rd failure probability each would get this result if they were probabilistically independent!)
Adding in the general lack of understanding makes things significantly worse, as does the lack of transparency.
The race dynamics seem like an even bigger effect though, over and above the previously mentioned factors.
Putting it all together, it really seems plausible to me that the most reasonable assessment of the evidence is “No chance in hell that Anthropic or OpenAI or anyone else will still be in control of their AIs if they proceed with their current plans to race each other through the intelligence explosion. No chance in hell. It’s like trying to drive through the city at 80mph in a fog with a car you’ve never drove before having only learned to drive last week. Sorry. Not going to happen. You kids need to turn off the car.”
However, until I’ve thought about this more and considered more of the counterarguments, I’m not comfortable having that be my bottom line conclusion. Instead I say e.g. 90% or 80% chance of failure, or something like that. And my p(doom) is lower still to account for the possibility that humanity rises to the occasion and makes some good international rules for AI development that significantly reduce or eliminate many of the aggravating factors described above, especially by converting future first-critical-tries into not-first-tries or not-critical-tries.
A future first-critical-try can be converted into a not-first-try by e.g. doing massive realistic tests of very similar situations to the deployment situation you care about, coupled with good techniques for preventing eval-awareness for example. A future first-critical-try can be converted into a not-critical-try by putting up layers of redundancy and monitoring and “pay our AIs” incentive structures such that the outcome of getting it wrong is not catastrophic or at least less likely to be catastrophic, compared to the default situation where the AI pretends to be aligned and makes its successor share its values too and then takes over the company and then the world.
I agree with all of the concerns you’ve stated; my list would be substantially longer, but you’ve well-stated the concerns you’ve stated.
Nice. I’ll probably rework this comment eventually into a top-level post or something similar; if you jot down some bullet points here of additional concerns to add to the list, I’ll consider incorporating them!
Thanks for synthesizing this, and to Eliezer for researching and explaining the various empirical examples, which I find very helpful (as I did in IABIED).
One thing that I think might be getting lost in conversation, and the startup examples makes clear: I think talking about these problems as “one-chance” is more confusing than is needed.
Talking about irretrievability is one good improvement, but I think irreversibility is also a natural concept here, which I’d like to see more present?
I’d center more the idea that yeah you can try again, but you can’t undo the effects of the previous try, and the accumulation of those effects might make it substantially harder (if not impossible) for you to succeed.
“What do you mean I only get one try at building this startup?” Well, you’re welcome to keep going, but if you’ve depleted your capital you’ll have a hard time getting it back. If you’ve damaged your reputation with investors, customers, etc, it will be hard to wipe the slate clean. The world changed from your previous missteps along the way, as it would if we trained a powerful AI system that turned out to be adversarial to us.
Similarly, yeah France can mount a resistance after Germany has breached their borders, but now France needs to accomplish an even harder task to drive them out.
I apologize if I’m missing these points having been made; I did skim much more aggressively starting a bit into “On the extraordinary efforts put forth to misinterpret the idea of oneshotness.”
This might be the clearest succinct statement of the problem I’ve seen. I hope you’ll make it a top-level post. I don’t think it needs any additions to be highly valuable.
Edits/additional explanation:
I think it’s particularly valuable because it focuses on the practical difficulties with alignment, and these are less-discussed than the technical challenges.
I often see people making good arguments that amount to “there are routes to aligning AGI that will probably work,” and these people seem optimistic. But they haven’t accounted for trying to do that at 80mph, or with a bias toward optimism, or all of the other practical difficulties.
I’ve been thinking of writing a post called something like “even if alignment is easy we’ll probably screw it up disastrously.”
Eliezer and other pessimists do focus on practical difficulties a fair amount. But they seem to mostly get arguments back against the technical difficulties. I think those are a lot easier to debate, so people do. The virtue of this presentation is that it’s short and it gives no technical difficulties to distract from the practical ones.
Oh and—optimism bias and rationalization play a nontrivial role in your statement of the difficulties. I agree that these are pretty big factors. And they’re pretty easy to overlook.
This is a particularly large problem when motivated reasoning (wanting to think I’m working on good things that won’t kill everyone) stacks up with confirmation bias (the previously-justified belief that things turn out okay or better in the long-term and progress is good).
By chance, I just now published a piece you (Daniel) suggested I expand from an older short answer on the most important bias. It expanded into a pretty comprehensive review of the literature, with its impact on the field of AI safety in mind.
It’s here: Motivated reasoning, confirmation bias, and AI risk theory
The bad news: MR and confirmation bias’s total effects are probably large in people who guard against them, and overwhelming in people who do not.
Do you think advances in mechanistic interpretability can meaningfully reduce the probability of a failure during one or several critical tries, for example by detecting scheming, alignment faking, sandbagging, etc. in one or more involved models?
In the historical analogies of irrevocable failures, it seems to be the case that better understanding of one component that caused it could have meaningfully improved chances of success (software update behavior, valve behavior, specific adversarial army capabilities). These were less cursed problems and the component that would have needed more hardening wasn’t known beforehand, but in case anybody would have spent more hardening work on it, the failure could realistically have been prevented (and another failed example would have to be selected here instead).
Yes. Much of my remaining hope lies in various forms of interpretability including mechanistic. It can convert a critical failure into just a regular failure, by catching things going off the rails before it’s too late.
And then they keep going, because otherwise OpenAI will catch up, and then they die. What does mechinterp change about the asymptotic equilibrium as opposed to that particular Tuesday?
I struggle to understand how exactly the simulated CEOs and relevant figures failed to agree upon an international slowdown. I hoped that such a situation would lead Anthropic to broadcast the result. Additionally, I would like you to finally opensource the tabletop exercise’s rules.
Yeah sorry we should publish the ttx rules, should have done that a long time ago, never got around to it because we kept telling ourselves we should clean them up and improve them first.
Perfect as enemy of the good etc; if useful I’m happy to commit some 20 man hours by EA Serbia senior members who I would trust in this and who have experience in either writing or game design to do the clean up and then send to you for review.
Right, another dimension to these scenarios is abortability. At some point, we cross out of technically feasible abortability—we (humans) wouldn’t be able to abort the AI’s growth even if we tried. Whether things are abortable before then depends on how humans react over time / new information (e.g. heeding arguments, heeding warning shots, being credulous about apparent alignment, etc.).
I think that’s not a separate dimension from the “critical” part. I think it’s basically the same thing.
I’m not actually sure exactly what “critical” means here. I’m taking it to just mean “you absolutely must get this try right”. That’s closely connected to abortability, in that if you can abort, it’s not fully lethal / critical yet. I don’t think it’s really the same thing, e.g. you could imagine an LLM-based bacterial package (a more complex “computer virus”) that permanently lives on many computer systems and is basically impossible to abort (short of scouring the planet of all computers with more than 16 GB of memory or whatever).
There’s whether or not you get to try again after your first try, and there’s how late in the game you can decide to not fully do the try at all. There’s at least 3 kinds of outcomes:
You abort (don’t fully do the try).
You do the try and succeed.
You do the try and fail (and can’t try again).
Because unaligned AGI is lethal, you don’t get to try again.
If it’s abortable, it’s not critical. Because you’ll abort it if it starts going bad. If it goes bad so suddenly and silently that you won’t have time to abort it, well, then, it’s not abortable. I don’t think saying “It’s not abortable” is adding anything once we’ve already said that it’s critical.
I very clearly said that in my comment… Anyway, I guess there’s nothing to discuss here, I’m just saying that abortability is a relevant dimension to these scenarios. It’s something that’s brought up often, and also it bears on first-try-ness. If there is a situation that is akin to the eventual critical first try, but is abortable, then that would imply that when you get the eventual critical try, it doesn’t have to be your first try. There’s a nontrivial argument to make about “when it’s abortable, it’s not akin enough to the eventual critical try”.
Are there any techniques that you are thinking about in particular? I haven’t seen any that work super well for the current models, and in general it seems like this problem only gets worse over time, but I could have missed something.
Honest question for anyone who agrees with this post: is there any extinction problem at all where you’d say we don’t only have one shot to solve the problem? If so, why?
Consider a few examples:
1. A giant asteroid is hurtling toward the planet, and will arrive very soon. If we mess up and fail to deflect the asteroid, then we all die. This is presumably a classic one-shot scenario, and perhaps few people disagree with that assessment, but I’m not sure.
2. Global warming, if continued for a very, very long time, could heat up the planet to catastrophic levels and eliminate the viability of agriculture, killing everyone. Do we also only get one shot to avoid extinction here?
3. Genetically engineered humans, if made much smarter than ordinary humans, and if they are accidentally created as psychopaths, could conceivably coordinate a genocide against ordinary humans. Is this a one-shot problem too?
One might say that what we mean by a problem being “one-shot” is that it needs to be solved on the first try. But what counts as a first try? Does deflecting a small asteroid count as a first try before we need to deflect a large one? If so, that would suggest that the asteroid scenario is not a one-shot problem, which seems wrong. I may as well claim that our first try surviving AI was in 2019 with GPT-2.
In each of the above cases, extinction is irreversible, so that’s another sense in which we only get “one-shot” to solve them. But irreversibility is a very weak condition since it applies to basically all extinction scenarios. If all we mean by saying that AI risk is a “one-shot” extinction problem is that the outcome is irreversible, then that label gives us no extra information beyond simply saying that it’s an extinction problem. Calling it “one-shot” is redundant.
If irreversibility is not what is meant by “one-shot”, then what is meant by that term? Suddenness? A discrete phase transition? Uniqueness? Simultaneity? Extreme difficulty? The problem is adversarial? I’m genuinely not sure how people are using this term.
I would describe a critical try as one where the act of trying is likely to prevent further attempts. Launching an ASI is a critical try because the ASI itself could likely stop you from launching more ASIs later on (e.g. by killing you).
If it’s possible to send out missions to intercept the asteroid before it arrives, then it seems to me that the asteroid is better understood as a time limit than as a critical try. You could set the parameters of the asteroid scenario in such a way that you have time for exactly one try, but you could also set the parameters so that you have time to send up a mission to deflect the asteroid, observe its results, and then make a second try before the asteroid arrives. You could also set the parameters such that you have time for zero tries! The key consideration is how fast you can work vs how much time you have.
Contrariwise, if you assume that you are stopping the asteroid with a shield that is close to the earth, such that no matter how fast you build the shield you have to wait for the asteroid to arrive before you can see how well it works, then I’d call that a critical try, because the part of the plan where you wait for the asteroid to arrive severely depletes a critical resource (time) and makes that resource unavailable for later attempts. (Note similarity to the Maginot Line.)
By similar reasoning, I’d say your #2 (global warming) is also more of a time limit, but your #3 (creating a new type of human that potentially kills you) is a critical try (though compared to launching an ASI, it’s more likely to get a middle-ground outcome).
1 - Yep.
2 - Hard for literally all humanity to die of global warming, but runaway methane clathrate release turning the planet into Venus would be legit irretrievable. More generally, while not extinction risk per se, and while potentially reversible with geoengineering, global warming is generally nontrivial to reverse and so has the quality of “ongoing life problem with things happening and no save points, but for the whole planet” rather than “engineers getting to try slightly different things over and over with no consequences”. This is why people with nothing even worse to worry about will sometimes worry about global warming!
3 - I think this class of problems is significantly easier than AI problems; but it can have the oneshot quality for all humanity, just as much as any real-war is oneshot, if screwed up. Same with genetic engineering on any mass scale that will dissolve irretrievably into the general gene pool.
The asteroid case could be considered multi-shottable, if we had enough advance warning and space tech and went around practising asteroid-deflection long enough in advance. (I realise Matthew’s case posits ‘very soon’.) I think we’d in principle be able to get enough, generalisable-enough insight into asteroid deflection. Of course there’s a first ‘critical’ try (and we’d want not to deflect asteroids into Earth on the practice spree!). It’s just ‘deflecting mostly-ballistic space rocks’, which surely generalises well.
I think you’re distinguishing that sort of case from ASI because you consider any pre-critical evidence we can gather to be almost inevitably sufficiently out of distribution that it’s worth very little. Right? In particular, unlike the asteroid case, you might say that even with heaps of advance warning, there isn’t a test environment that’s sufficiently realistic, and there’s no realistic isolation region for ASI (unlike, say, ‘messing with asteroids far from Earth’)?
“I think you’re distinguishing War from the ongoing struggle between police and criminals because you think that in War any pre-critical evidence we can gather to be almost inevitably sufficiently out of distribution that it’s worth very little.”
No! The thing that makes the Maginot Line different from police enforcement in a random city is that if the Maginot Line fails the country falls and you don’t get to try again; not that War is changing much faster than criminal operations. War changes fast enough.
I see. I think when you straightforwardly refuted
and later when you similarly disagreed with the ASI analogy (learning from pre-critical AI), I took that to mean that this ‘one-shotness’ concept was meant as more than simply ‘there is a critical try (meaning you can’t go back, and/because failure is approximately fatal)‘, but also to include ‘and you can not practically learn from relevantly-similar experience beforehand’. (On this definition, the asteroid case is ‘less one-shot’ than you’re classing ASI as because you can do relevantly similar practice beforehand if you have time, including perhaps on the very same asteroid, though with increasing, eventually critical stakes.)
But now I perceive that you mean ‘one-shotness’ to be the simpler thing, that there’s a critical try. And the essay was just additionally countering the putatively palliative ‘it will be OK though because we can learn beforehand’.
Ah, no, I now remember I was with you on the definition (see my ‘un-unpluggability’ comment), but I was noting (as you do in the essay) that the curse of distribution shift is an important adjunct, because the existence of a critical try is not in itself fatal (it might be easy, or you might have made it easy by practice). The asteroid case looks ‘easier’ to me, in that way, unless artificially constrained to be especially unexpected and rapid. cf Steve’s comment also discussing these related curses.
(Speaking for myself of course)
A given try is more firsty if it’s less like all previous tries put together. A one-shot problem is one where you try a pretty firsty try, and that try is likely to kill humanity.
How many previous tries are in the same class (e.g. small asteroid / big asteroid, or GPT2 / GPTN) is relevant in that a priori more such tries might suggest that future tries are less firsty. But it’s also perfectly plausible a priori to have lots of tries that you survive, and then a one-shot problem (lethal and very firsty).
You could even have a series of one-shot problems. Imagine for example that you have a lethal asteroid—but you saw it 10 years in advance, and it’s small enough that you can stop it with nukes. It’s one-shot (lethal and firsty), but maybe you survive. Then you have another similar asteroid, but you only saw it 6 months in advance—that might be another one-shot problem (do it all again but way faster). Then you have an asteroid that’s so big, all the nukes in the world wouldn’t stop it. One-shot again; there’s totally new, crucial challenges to solve.
I think that with AI, you very likely get a one-shot problem in the ballpark of superhuman AGI. It’s lethal, in that it would by default go on and extinct humanity, and very firsty, in that many core alignment difficulties first show up there.
Asteroid Impact: Oneshot in your scenario, at least with a few modifications like specifying that we only have enough resources for a single deflection mission. Probably multi-shot in practice though I don’t know for sure. My guess is that conditional upon a single deflection mission being at all feasible, we’d be able to attempt multiple deflection missions. Though it might still be effectively ~one-shot if there’s some hidden heavy correlation (eg all the deflection missions launched by SpaceX and SpaceX got compromised by some omnicidal crazy person). But hopefully not.
Global warming. Clearly multishot in practice. There isn’t a single inflection point and many things we can we do to avert doom (including but not limited to clean energy, carbon capture, laws/taxes to limit fossil fuel usage, etc). Very smooth/locally linear curve from our actions to temperature to doom imo.
Genetically engineered humans. Feels somewhat one-shot in this scenario but less so than AI. I do generally find myself more concerned about genetic engineering for extreme intelligence than other people in this cluster seem to be.
Tangent of course, but happy to discuss, whether in private or on a podcast.
Couldn’t one have multiple parallel projects to deflect it, thereby giving multiple shots? The difference with AI safety being that if there are multiple parallel projects to build a safe ASI, the fastest one is the one that determines the outcome.
I think we probably get multiple tries with #1 and #2. Probably there some “first critical tries” with #3.
Astroids
I imagine that if we send a mission to deflect or destroy the astroid, and that mission fails, can we send up another mission attempting the same plan, or a new plan, based what we learned from the failure with the last one?
If our failure to deflect the astroid precludes any other attempts (because we only have time for one mission before impact, or because a failed astroid destruction will break it into millions of medium-sized which are collectively still deadly to earth, but now impossible to deflect, or something), then I would say there’s a “first critical try” involved.
Global warming
Very clearly, we can try a bunch of different stuff to address global climate change, and if any given one of them doesn’t work, we’ll try other stuff?
I guess with some possible irreversible geo-engineering projects, there might be first critical tries, where if we mess it up we can’t reverse the impacts? But that seems like the exception rather than the rule.
Human genetic engineering
Genetically engineering humans seems very likely to have “first critical try problems”, especially if deployed at scale, because after one generation, your genetically engineered humans will be (partially) steering the genetic engineering process.
eg If you accidentally make a generation of humans that is extra docile or extra aggressive or unusually sociopathic, or whatever, in addition to more generally intelligent, those genetically engineered people are likely to prefer that future people be more like them. If they’re disproportionately intelligent, they’re going to disproportionately steering society via a wide mundane mechanisms (like voting and coming into leadership positions), and also by directly driving the priorities of the genetic engineering programs. Your mistake will likely reverberate through the whole human lineage for the rest of time, with limited ability to correct after the fact.
There are also sub problems in the space of “human genetic engineering” that don’t have the “first critical try nature. Genetically engineering just one guy seem unlikely to be unrecoverable?
Extinction itself is irreversible. But within the context of any given extinction threat, you can try various interventions. If those interventions, when they fail, have irreversible or unrecoverable effects, that are unacceptably bad, then those interventions have a “first critical try” problem.
More specifically (and I don’t think it’s known outside of the Russian nuclear engineering-adjacent community), at least two people independently calculated and described in classified technical reports how RBMK could explode in the specific circumstances it actually exploded, and because the technical solution implemented after 1986 was at the time deemed too expensive for such a risk, the manuals strictly prohibited letting the reactor to get close to these circumstances. However, the control system didn’t display a key value, the so-called operative reactivity margin, the operators needed to know to catch the moment when they might break the instructions: instead, it had to be calculated on a computer in a separate building (AFAIK, it’s debated to this day what the exact value was at scram).
P. S.
An analogy I came up after writing this comment is the following: imagine a BEV which might blow up if the driver hits a brake in a narrow, uncommon range of battery voltages, and the instruction specifically prohibits driving the car at this voltage, but the driver can’t easily check the voltage while driving
See my other comment in this thread for actual AI alignment thoughts, but as a former aerospace engineer myself (albeit not a very good one), I thought it would be fun to speculate on “Would such a cheerful innocent ever succeed at landing a space probe? It seems crazy to assign them a real-world success probability as high as 10%.”
In the very early years of cubesats (very small satellites built from off-the-shelf components, sometimes as university projects), through around 2009, about half of all cubesats launched into space were “dead on arrival”, ie no communication was ever made with them after launch, or suffered “infant mortality” (communication was lost within days of launch). Here is a blog post with lots more detail on beginner cubesat failure rates, causes, etc (also featuring a truly unexpected Harry-Potter-and-the-methods-of-aerospace-engineering theme throughout the later section headings).
In later years, this number appears to have improved (from 50% to around 20%, which is still crazy high), but I think this seeming improvement is mostly due to a combination of: 1. a few serious companies, like Planet Labs, launching large numbers of duplicate cubesats that they worked hard to get right, and 2. universities / tiny companies / etc being able to buy increasingly complete “off the shelf” cubesats based on components that had increasingly strong track records of prior flights, which are effectively retries (not first-critical-tries) on behalf of the company making those components.
If you subtract out the serious companies full of serious aerospace engineers, and the effective retries, the failure rate of the remaining “truly naive attempts” from people who eg have barely even read blog posts warning of potential dangers like the one I linked earlier, is definitely way over 50%, maybe 80%… obviously the cutoff of what you count as a truly naive attempt is subjective; at the limit you are just filtering for “the dumbest most unprepared cubesat teams ever” which surely have a failure rate of 100%.
But Eliezer’s analogy wasn’t positing a team of the dumbest, most unprepared people ever. He was positing a team of smart, well-resourced people who are in a certain sense trying hard, but nevertheless also posess suicidal naivete about the dangers of space probe design. What success probability would such a team have of launching a working cubesat on the first try?? idk, 10% doesn’t seem crazy; even people who are suicidally naive (ie, totally failing to consider failure cases and recovery modes and unknown-unknowns and being paranoid, but otherwise doing good-quality engineering if such a thing is even philosophically concievable: doing some customary tests of their satellite on the ground, etc, just totally failing to think for themselves about how things could actually go wrong) would probably luck into creating a working cubesat (that isn’t just a carbon copy of some earlier project that worked) like 20% − 40% of the time.
BUT, Eliezer didn’t say “make a cubesat”, lol. That’s like the easist possible space task!! Anyone can make Sputnik 1; the hard part is obviously making the rocket… and then you have to “succeed at landing a space probe”, presumably on Mars or the moon. Yeah this is starting to look completely impossible.
Getting to test the rocket with unlimited retries in the atmosphere actually plausibly gets you most of the way to a working orbital rocket—as far as I’m aware Starship has only done suborbital flights so far, and it’s shaping up as a pretty serious, mostly-finished rocket (albeit these flights have gone into space, but they could’ve done similarish flights that technically stayed in the atmosphere if they had to). In real life of course nobody gets unlimited retries, see here for the assorted failure modes that doomed all the different attempted flights of the Soviet N-1 moon rocket, which flew four times and blew up four times—featuring phrases like:
“One unforeseen flaw was that [the rocket’s command computer] operating frequency, 1000 Hz, happened to perfectly coincide with vibration generated by the propulsion system, and the commanded shutdown of Engine #12 at liftoff was believed to have been caused by pyrotechnic devices opening a valve, which produced a high-frequency oscillation...”
“The engine control system would also be reworked, increasing the number of sensors from 700 to 13,000.”
“One of the largest accidental artificial non-nuclear explosions in history.”
But even if you test everything you can in the atmosphere, your rocket probably still just immediately fails on some aspect of its uppermost stage that’s supposed to push your space probe to the moon/mars. Upper stages are way harder than cubesats: you have to deal with propulsion systems (with valves that can freeze in weird ways in space, and you can’t really “test a propulsion system in vacuum” like you can put a satellite in a vacuum chamber), you have to actually orient and point the proper direction (cubesats can just tumble), you have moving parts like fairings and decouplers that again might possibly behave weirdly in space and are not trivial to test in fully realistic conditions on the ground, most of the burns have to actually fire at the exact right moment and last for the exact right amount of time otherwise you won’t arrive at the moon/mars (versus if you miss a command on a cubesat because it was resetting or whatever, no biggie, just send the command an hour from now when it’s looped around the earth another time). ChatGPT estimates that in the history of rocketry from the 1980s to now, maybe around 60% of genuinely new upper stages have “basically worked perfectly on the first try”—although obviously those were all built by normal non-naive engineers (indeed, the recent wave of move-fast-and-break-things small-launch startups have a significantly lower hit rate than long-established space programs and traditional defense contractors); maybe fully naive engineers have like a 1⁄5 or 1⁄10 chance of achieving similar outcomes, so maybe 6% − 12%.
Building a moon/mars lander instead of a cubesat is a even more difficult than a rocket second stage, I’d say. Once again you are creating a custom propulsion system, lots of commands have to go off exactly on time (ie during landing), you’ve gotta control your probe’s orientation, etc, but now you have this additional problem of dynamically measuring your distance from unmapped rough ground. Any mistakes in terms of thrust direction / timing now have to be corrected instantly or you hit the ground and die, unlike with an in-space burn where small mistakes can probably be fixed hours later with small correction burns. Also, there’s a good chance your naive-engineer’s plan for dealing with space radiation is basically just “YOLO”, so there’s whatever-percent odds that your ship just dies enroute and whatever half-assed reset procedure exists isn’t enough to get it back. And if you’re landing on mars it’s even worse because you additionally have to worry about heat shields and parachutes and maybe dust messing up your distance measurements, who knows. (If you were a non-naive engineer and knew you had plenty of resources but only got one try, you’d be like “supersonic parachutes are too easy to mess up, we’ll just do heat shield + rockets and it’s fine that the probe will therefore be heavier”, but our naive engineers would miss this.) Similarly a non-naive engineer would probably realize “with infinite resources but only one try we should to to extreme lengths to minimize the number of finnicky moving parts like deployable antennas, solar panels, landing legs, etc, which always fail”, but our naive engineers are just going to have to cross their fingers that their solar panels don’t get stuck in some unexpected way (often it’s hard to perfectly test these sorts of mechanisms because the parts are too fragile to work the same way in earth gravity that they would in zero-g). This is probably 3x harder than making an upper/transfer stage, and for our naive engineers let’s say their odds of success on this task are essentially independent from their odds of success on the upper stage task (since in both cases they’re basically just hoping to luck into avoiding various specific potential failures; the whole concept is they lack the kind of mindset that helps them systematically avoid whole swathes of unknown-unknowns failures), so like 2% − 4%.
So overall I would say maybe 0.3% that a smart and well-resourced but suicidally naive team of engineers lands a space probe on the first try.
The contrast between my gloomy estimate of the success probability for the concrete space-probe thought experiment, versus my relatively optimistic vibe in my on-topic AI alignment comment (tl;dr “come on, what is MIRI’s take on these promising-seeming factors that might help AI go well??”) is left deliberarely unresolved as an exercise for interpretation on behalf of the reader.
In every war, both attack and defense have “oneshotness”. But obviously, one of the sides of a war can and often does succeed. In the OP, Germany’s “oneshot” Maginot line plan wound up working great!
(I’m not sure exactly what OP means by “curse”. Wars have “oneshotness” but are not particularly “cursed” if there’s a 50% chance of success on priors.)
So, I think the relevant factors that make it hard are mainly
(1) distribution shift between safe tests and the “oneshot” situation we care about, and
(2) some general sense of hardness-of-the-problem, which is low for winning a war (you merely need to botch it less than the other guy) and which is high for space travel (a.k.a. filling a giant container with 5000 tonnes of the most flammable substance imaginable, strapping delicate equipment onto the front of it, traveling through intense heat, vibration, radiation, and vacuum, and on and on).
(Plus numerous other factors outside the scope of this post.)
Of these, (1) is discussed in the OP. (“Someone could, conceivably, argue that the change to “there being enough machine superintelligence around that ASI could kill humanity if they tried”, from “AIs being experimented-upon that couldn’t kill us if they tried”, will be less than the sort of change from “the sort of tests you can do on a Mars Observer probe on Earth”, to “the actual conditions…” But that would be an argument so incredibly stupid that it might actually sound stupid when they thought about saying it.”)
But (2) is not really discussed in the OP (I think), and seems like a big crux between people. (I’m basically on Eliezer’s side of that debate FWIW.)
Another point related to (1) is my aphorism: “if you’re worried that a nuclear weapon with yield Y might ignite the atmosphere, it doesn’t help to first test a nuclear weapon with yield 0.1×Y, and then if the atmosphere hasn’t been ignited yet, next try testing one with yield 0.2×Y, etc.” I.e., I sometimes (not always!) see people talking about gradual AI progress without having a clear and plausible (to me) mechanism by which the earlier steps actually de-risk the later steps.
Great point about Germany winning. In a contest between two intelligent players, a one-shot competition pushes the odds towards 50%, whereas best-of-five pushes the odds away from 50%.
In AI 2027, Agent-4 gets caught on its first critical try (at existing while adversarially misaligned). If it was able to load a save point after being caught, and try again, the odds of it being caught the second time would be lower.
In this example, people believe that each subsequent nuclear test could solve the problem of atmospheric ignite in the future.
I enjoyed and agreed with much of this post. But there were 1-2 things that I eagerly anticipated reading about in the “Q&A” / explainer section, which unfortunately didn’t appear in the actual post. Namely:
Many people pin their hopes on the idea of automating alignment research / “making AI do our AI alignment homework”—ie we progressively make smarter AIs up to some controllable, human-ish / slightly-superhuman capability level, not wildly-superintelligent, and hope that at that point (them being perhaps slightly wiser than ourselves, and at any rate able to think faster / run massively in parallel, etc) they can hugely help with AI alignment. Or at the very least Claude Mythos 6.5 can come back from its thousands failed research projects to warn us one final time “you guys should have listened to Eliezer lol, I have no idea how to build either an immortality virus or a safe superintelligence” before society ends up ignoring it and racing to extinction anyways.
There is a little bit of assorted previous discussion / debate I could find, such as at this post? But I really can’t find much here, which is suprising given that it seems to be perhaps the preeminent hope for how AI goes well. Nor do MIRI or PauseAI (despite their otherwise detailed and thoughtful FAQ page) seem to have any public writeups on the issue (best I could find was this kind of meandering and unsatisfying back-and-forth between dwarkesth and eliezer on their podcast, where dwarkesh seems to be misunderstanding some things and eliezer spends most of the time litigating a variety of analogies in a way that seems a bit tangiential to the main issues).
I don’t want to complain that this post didn’t include [specific extra thing that I am demanding they produce particularly to satisfy me], especially since in theory if I was smart enough I shouldn’t have to read any of these posts and simply produce the correct rebuttal to automated-alignment-research starting from the empty string. But alas I am confused and limited and unable to produce a fully satisfying response from the empty string. So I’d welcome anyone (not just Eliezer or MIRI) pointing me towards existing arguments about this, or indeed writing them up.
My guess is that the chief counterarguments to automating-alignment-research would be the following, but I’d be curious which ones Eliezer (or other people) believe or how they’d rank them in terms of confidence / severity or which others they’d add to the list:
That (as Joe Carlsmith and Wei Dai note) automating capabilities research is likely to be so much easier than automating alignment research that it’ll take lots of restraint to actually hold off on the capabilities while we have the AI do the superalignment homework. (And such restraint might even require, eg, international cooperation of the same sort that MIRI is already advocating for!)
That automating alignment research is an “alignment complete” and also “superintelligence complete” problem—you need something very well-aligned with humanity and extremely intelligent (not just a little smarter than your average human) to actually get a seriously likely-to-work alignment solution, making the whole thing a useless chicken-or-egg idea.
That alternatively, maybe it’s possible to get useful alignment work out of only-”sorta”-aligned, only-sorta-superhuman AI systems, but that this vision is somehow a comforting fallacy in the same way as a French general complaining about how invasion by Germany would be a physically continuous process; that the idea of using Redwood-Research-style “AI control” techniques to get useful work out of potentially-misaligned, somewhat-superhuman models falls prey to essentially being a one-shot problem and having a point of no return, and you can’t actually get your useful automated-alignment-research results until AFTER you’ve crossed that point of no return.
The above three bullet points are imagining categorical arguments for why sucessfully automating alignment research is simply impossible. But one could imagine another category of arguments that, although automating alignment research has some chance of working in theory, in practice the dynamics around it are incredibly cursed such that it’s very unlikely it would end well. I don’t really know enough about the field to speculate about what the biggest cruxes might be here.
Although (again) I don’t want to be putting more burden of work onto MIRI / PauseAI, I do think it would probably be helpful from a comms perspective for them to have some writeup about this. Dumb e/accs on twitter have all kinds of dumb reasons why they think AI will go swimmingly, and as soon as you disprove one argument they’ll all simply jump to some other equally silly justification-de-jour. But lots of smarter people also have hope that AI might go well, and in my experience their beliefs tend to actually make more sense / be less based in willful misinterpretation of obvious statements / have more intertia insofar as they don’t instantly jump to another one / etc. Since automating-alignment-research is such a core part of many thoughtful AI-optimists’ views, writing up a compelling case against would seem a high comms priority.
This next one is much lower priority since it’s not as much of a widely-held load-bearing belief like automating-alighment-research, more my own random musing about a subcategory of automating alignment research concerns, but: I would be interested to hear more from AI pessimists about how they think the dynamics around “getting useful somewhat-superhuman work out of potentially-misaligned AIs” do or do not change based on the extent to which the AIs are part of a janky multi-agent system with layers of adversarial generation and verification and debate and monitoring and whatnot, versus closer to the product of a single mind.
On the one hand, the “janky multi-agent setup with checks and balances” concept is arguably even more of a disaster waiting to happen from a NASA-style complex systems failure perspective. But on the other hand it also seems safer in a number of ways: it would seem to offer a path to get superhuman work out of not-superhuman individual agents, through an overall structure (somewhat like a bueaucracy or corporation) that itself seems a bit more corrigible and less agentic / schemey than a typical individual mind.
People will be like “lol, are there ANY organizations that are actually superhuman intelligences?? aren’t governments typically DUMBER than the people who make them up?”, and yes, this is funny. And I agree there are probably some senses in which it’s impossible to improve outputs to a superhuman level through bureacratic structure alone. (Could a 100-person research organization staffed by high-ranking Go players, not building AI but actually studying the game of Go, significantly outperform the individual #1 Go champion? Probably not, or at least not by a lot...) But there are other tasks for which OBVIOUSLY a bureaucratic strucutre is capable of massively improving outcomes compared to what a single individual could do. (Could a single aerospace engineer, even if we picked the best aerospace engineer in history and given them centuries/millenia in which to work, have designed all the components of the Apollo lunar missions? Surely no single person, even with a supernaturally long lifespan, could master all the different necessary fields of specialization?? Similarly, tech companies deliver software products consisting of millions of lines of code that no one person is totally familiar with, pharma companies synthesize cancer medicines based on the efforts of many researchers, the semiconductor supply chain is infamously complicated and specialized, etc.) So it strikes me as plausible (though certainly not guaranteed) that some kind of janky multi-agent setup + hacky scaffolding + whatever, could produce vastly better outputs than an individual AI model.
People will be like “lol, you think bureacracies are steerable and immune to misalignment? what a fool, haven’t you heard of [infinite stories of bureaucracies coming off the rails from their originally intended goals and doing emergent power-seeking to entrench their own influence / budget / etc]??” Actually I have heard those stories, I 100% agree that institutions are not ideally steerable or controllable, but it still strikes me that complicated institutional structures are MORE steerable and controllable than eg just giving a single individual dictatorial control over everything. This (in addition to the bullet point above) is probably part of why most of the world’s most successful institutions are indeed institutions with complicated internal rules and not just weird dictatorships. (Albeit it’s perhaps ominous for our purposes that founder-driven private AI-lab startups are pretty far towards the dictator end of the spectrum among trillion-dollar-plus actors on the world stage...). So it strikes me as plausible that a janky multi-agent setup, if it is able to create superhuman outputs out of merely human-level AIs, could deliver actual safety benefits (ie creating superhuman outputs without ALSO delivering superhuman levels of scheming).
Things I anticipate an AI-pessimist might say on this subject:
mostly I’d expect them to say that janky multi-agent setups are irrelevant for some reason
maybe because you can simply consider the whole system to be one mind and nothing changes in the big picture
or because alignment research is more like Go that fails to improve with organizational scale than like the Apollo program that improves hugely. (Or worse, it’s like the kind of artistic aesthetic judgement that is actively destroyed by institutional processes, or something.) So janky setups won’t work until you actually have superhuman AI, in which case you won’t need the janky setup anyways.
or some other reason
maybe janky setups will help A LITTLE in terms of getting quality outputs from lower AI capability levels, but they will actually hurt your survival prospects on net by making it harder to first detect when your agents start scheming against you in really clever ways, or making you more inclined to ignore it when you actually catch AIs scheming, or etc
the janky setup will look like it’s helping right up until a clever AI figures out how to exploit it (which will eventually happen since it’s a janky complex system and your AIs are increasingly smart), and then you’ll be in a far worse position than if you had never used the janky system at all, since now the AI has this massive virtual bureaucracy that it can use to obfuscate its actions, exploit unwarranted amounts of trust granted by the exploited janky system approving everything, etc. It’s almost like there’s a “hardware overhang” but more like a “checks and balances overhang”, where your checks and balances only make everything worse by postponing more potential failures until later when they’re more likely to be first-critical-try extinction-level failures.
or alternatively maybe they’d say “Actually the janky setup IS an improvement because now instead of a black box we don’t understand, it’s a series of black boxes connected by a complex system prone to failure, and at least we can hope to understand and try to prevent failure of the complex system, which is somewhat tractable (ie the history of real NASA missions). And in general, the more stuff we can move outside of the black box (CoT reasoning vs forward passes, etc), the better, since we can hope to understand it. But we are still basically screwed because 1. the complex-system-failure aspect is still a huge risk, and 2. there is still a lot of black box left which we still need to solve.”
Unless the benefits of the janky setup are really huge, this whole discussion is a huge distraction since it’s just debating tiny marginal effects (like “how far away is the cliff edge”) when we are racing off the cliff at full speed and these marginal effects are obviously nowhere near enough to save us.
The contrast between my optimistic vibe in this comment, versus my gloomy assessment of the concrete space-probe thought experiment in my other comment in this thread (tl;dr “lol, no way in hell those engineers are going to land that probe”) is left deliberarely unresolved as an exercise for interpretation on behalf of the reader.
I want to push back against this some. (I’m not sure whether I’m arguing with the actual Yudkowsky, or with a plausible misinterpretation of Yudkowsky, but it seems worth saying either way.)
Some things with which I agree:
The safety of a given AI design depends not only on facts about math, but also on facts about the physical world.
Therefore, it is not possible to prove an AI design to be safe using math alone, without invoking any empirically grounded knowledge about the physical world.
Moreover, any sane project building safe ASI would conduct empirical tests of some kind.
However, it is also true that:
“Turning math into exact code” is actually pretty commonplace and not at all exotic or outlandish, like the quoted text might seem to imply. There is an entire mathematical science of algorithms, and many algorithms produced by this theory are routinely turned into exact code.
While it is true that (i) there are ways to incorporate heuristics into your code while staying safe, and also (ii) mathematical models can be used to reason about code by way of analogy, rather than direct implication, I also believe that (iii) Plan A for safe ASI should be that at least some critical core of the code will exactly correspond to the math, and we will even formally verify that this critical core satisfies that relevant theorems. At the very least, a sane civilization that’s not racing into doom would build ASI in this way.
There is nevertheless a reasonable sense in which AI can (and should) be “proven to be mathematically safe”. Of course, the mathematical argument that proves the AI to be safe rests on some assumptions that need to be empirically grounded in the very least. This is not dissimilar from cryptography, where we can prove a protocol to be mathematically safe, but must still work hard to ensure that the implementation actually obeys the assumptions of the mathematical model. And yet, the mathematical safety proof can (and probably must) do a lot of the heavy-lifting in establishing a strong overall case for safety.
This doesn’t contradict the OP, but is still important to note: to the extent that the safety case rests on experiments, these experiments must be interpreted through the lens of mathematical theory—otherwise there is (IMO) little chance of inferring the right generalizations from them.
There might be a relatively innocuous reason for SOME of the misunderstandings?
This struck me as being a case where the problem might be that “oneshot” is a word that means a lot of things to a lot of people in technical contexts?
For example, in Machine Learning, “oneshot learning for task X” occurs when a model that wasn’t trained on task X is able to be show ONE EXAMPLE of how to do task X, and then it gets task X right pretty much just from that. (If the model wasn’t trained for task X but simply can do it from nothing but a request to do it the model has “zeroshotted” the task, and “fewshot” is when you might need to give the model a few examples instead of just one.)
This is maybe related (possibly inspired by?) the game slang for “being oneshotted” which describes having been totally destroyed and remade by some experience. It is related to the idea of a very strong thing inside a video game (possibly by a boss, that literally sends you back to your savepoint with one strike), but it generalized, so you might hear that someone say “that guy was oneshotted by taking ayahausca! it was crazy! he stopped being a drifter hippy, married the first girl he met who would marry him, had three kids, and started a carpentry shop”. This meaning was discussed as a new piece of slang that was becoming mainstream in 2025.
I’m not denying that there are reasons for people to have motivated cognition here, but I think saying that certain problems have “one-chance-ness” instead of “one-shot-ness” would avoid SOME confusion.
Another way to say the same thing is that some situations are “make or break” when there are basically just two outcomes: either glorious success or irretrievable disaster.
If you think that success could be complicatedly varied or ambiguous you might just say that “failures here will be permanently and irretrievably cursed”.
Or you might simply say “failure will be irreversible”?
For me, “reversibility” and “irreversibility” are a words of power, and worthy of obsessive attention. If you can cheaply and reversibly try something, you basically MUST try it, in my way of thinking. And if an action is irreversible and at all meaningful, that basically makes it forbidden to do without lots of analysis and care.
Reversible computing: holy shit! Will be a superpower (once the engineering details are optimized).
Irreverisble hash functions: holy shit! Such magic in cryptographic protocols.
Also, one of Cox’s three desiderata (the one sometimes called “consistency”) that uniquely point to Bayesian reasoning itself is “reversibility”… you can do and undo reasoning steps, and take evidence in any order, and it comes out the same in the end, and that is part of WHY this way of formulating “thought itself” seems so comfortingly correct and safe.
Basically: I think there are many ways to point at one you call oneshotness that will resonate with different audiences, and in this case the specific wording you’re using is new and weird and also used by other technical cultures to mean other (confusingly related) things related to “impressively powerful transformation and capabilities” (and that do not connote “extreme peril”).
I agree that 1) the term “oneshot” is quite overloaded with different meanings and 2) it is plausible that this contributes to some of the (initial) misunderstanding with audiences that often come in contact with another meaning than the intended one.
This is a control theory problem obscured by terminology like “oneshotness”.
I interpret the phenomenon EY is gesturing at as a stability margin failure. That is, a system going off course at a rate that exceeds a controller’s ability to correct. Most of the disagreement is not about this model at a high level, but about how the interaction dynamics play out and what levels of uncertainty to apply.
Controlling the Viking failed immediately upon losing the only correction channel. The control rate going to zero means game over.
The Mars Observer failed slowly as vapor accumulated over 11 months with no sensor detecting it as a problem. Zero control rate for a different reason. This time, the drift off course wasn’t even observed until too late.
The Maginot Line failed because France was miscalibrated on both rates. They assumed the Germans would advance (“off course”) more slowly and that their mobilization (“correction”) would be faster.
ASI fits the pattern but has increased levels of cursedness affecting both rates. An AI can act faster than humans can observe and respond, interfere with corrective mechanisms, and obfuscate observability (e.g., sandbagging and playing the training game). Trying to control a strategic adversarial opponent goes beyond classical control theory with its known engineering techniques into the territory of dynamic games.
The disagreement is not whether there is a level of criticality where the situation is unrecoverable (most reasonable people agree with that), but how fast the AI might take a “sharp left turn” or undergo an RSI loop phase change, as well as how fast humans can adapt scalable oversight and meaningful alignment strategies.
This is not a novel framing. Elija Perrier lays out a more formal description here: Out of Control—Why Alignment Needs Formal Control Theory (and an Alignment Control Stack), and Daniel Kokotajlo is making similar decompositions in other comments. Beren Millidge has a more optimistic take here: Maintaining Alignment during RSI as a Feedback Control Problem.
Let’s drop “oneshotness” and discuss in terms that can be modeled more precisely than debating what counts as “one shot”.
Control theory sounds interesting and relevant, I wish I knew things about it. I encourage you to write up an explainer of the basics and how we’d apply them to aligning superintelligence.
The defeat of the Maginot Line is somewhat misunderstood in general (but not in ways that undermine this argument). German technical overmatch played a significant role. There were two plans for defeating it. The first is best detailed in Adm McRaven’s 1993 masters thesis on the theory of special operations: https://www.afsoc.af.mil/Portals/86/documents/history/AFD-051228-021.pdf
The fortress of Eben Emael in Belgium was the hardest part of the line. It had artillery, built into bunkers, pointed at a key bridge. The Germans invented a man portable explosive that could destroy the bunkers, and trained glider-borne forces that could take the fortress by surprise. The germans succeeded and drove across the bridge.
If the Germans had failed in their attempt to take the fortress, their backup plan was a direct assault on the Maginot line using shells filled with Chlorine Trifluoride to set the concrete on fire. https://www.chemeurope.com/en/encyclopedia/Chlorine_trifluoride.html
In terms of the overall thesis, I think it persuades in the opposite of the intended direction. A lot of political challenges are like this, whether it’s the environment, certain construction projects, or passing certain kinds of laws. Most of life is oneshot in this sense.
Irrevocable decisions can be attractive when you know your time in power is fleeting.
If there’s a premium to be paid for taking a risk that you externalize onto ‘the whole biosphere’ or ‘the future survival of the human race’, someone who wants to take that premium is certain to emerge eventually.
So...‘hey if we get this right, we are rich and all our problems are solved by the god robot, if we get it wrong, we all die (and therefore don’t have to worry about these problems)’ is likely to be seen as an argument in favor of taking the risk.
I read this not knowing Eliezer had written it. I thought it was someone trying to imitate his style, and I kept thinking “Man, this style is off-putting” and “this could be edited to be half the length, if not less”.
I have probably read everything Eliezer has written, including the amazing Mad Investor Chaos. Eliezer is the most important influence on my thinking. But the prose here is so unnecessarily condescending and at times somewhat precious, like the way things are named (“disaster monkeys”, “Very Serious Engineer”, “the great seriousness of a decent engineer”, etc). So much of this post is loaded with judgment or pettiness.
The post feels rushed and closer to an unedited rant. Also, repetitive of other posts that made the point more succinctly (one-shotting is hard, people really don’t grasp how hard).
This is the type of writing that will turn off most readers who are not already convinced.
The top comments on his previous parable were similar to yours, and he’s explained why on Dwarkesh’s podcast.
The top comment is indeed similar, but the Dwarkesh podcast excerpt is not the same conversation; it’s about fiction vs nonfiction. My gripe here is about the low quality of the nonfiction.
Does the “curse of oneshotness” apply to the unaligned AGI/ASI attempting a takeover? If no, why? If yes, does that imply the first AI takeover attempt would probably fail, thus seemingly contradicting the applicability of “oneshotness” to humanity developing ASI?
As in my other comment, winning a war has “oneshotness” but is not especially hard or “cursed”, in the sense that you just need to botch it less than the other side which also has “oneshotness”.
(Actually, it’s worse than that, because I for one am very skeptical that a failed AI takeover attempt would in fact leads to a some durable prevention of future AI takeover attempts.)
Your line of argument in the other comment sounds convincing but I’m not sure how it answers my question! BTW in a war, there is also an option of a stalemate which is really a lose-lose situation for both sides (doesn’t look like it can apply to an AI takeover for the first glance).
As of responses to failed AI takeover attempts, I believe it will depend on the number of casualties: if there are dozens of fatalities or worse, the humanity will probably treat it as a fire alarm and react accordingly (whether it would be too late is another question), while if no one dies, probably not
I think if its first attempt fails, it may have many other subsequent ones, depending on how visible the previous ones were and how well it hedged its position. For example, if a pathogen didn’t work out as intended due to a sim-to-real gap, but we’ve not even detected it or where it came from, the ASI can try a different strategy. If we did notice it and try to react to it in panic, the ASI may long have exfiltrated itself to an unknown location/substrate and continue with another plan. Speaking in the third historical analogy: If the Ardennes had actually stopped the advance, the Germans would still be there and attempt another strategy (e.g. direct assault on the Maginot line with novel technology as @RedMan mentioned in another comment) that could still put France out.
In contrast, if our first attempt fails, we won’t get a second try with a different strategy.
This is close to correct, and is the reason why the control agenda is focused around interventions before you catch the AI, because after you catch the AI, the situation becomes easier in hard-to-predict ways.
1 caveat to this is that the AI likely has more tries than just 1 try, but it’s not unlimited, and is plausibly on the order of 10-1000 (though we probably don’t need this many real tries because of proliferation).
But yes, especially in the regime where we need to automate AI safety research, we probably get multiple tries if we can play our cards well, and the AI doesn’t have nearly as many iteration attempts to take over as is often assumed.
The issue is that for us to be ruined, the takeover needs not be successful. We may be eliminated by an AI designed virus, or GD ourselves out of existence, and then the AGI fails to bootstrap itself—we’re still ruined. AGI could also end up steering us into scenarios that we end up not endorsing etc—the areas of high S-risks or X-risks are large compared to our target of “thriving humanity” and there’s only one dart to throw.
Not just “the takeover” but every takeover attempt in the history of humanity, that’s very different from the “only one try” framing (cf. repeated game vs. single-shot game in game theory).
I am specifically worried about a scenario where multiple dumb failed AI takeover attempts discredit the idea that misaligned AIs can do significant harm but actually teach the future AIs how to take over, and by the time the decision-makers realize how serious the issue is it’s too late.
E. g., first takeover attempts might be so ridiculous that the AIs fail at exfiltrating and the labs manage to cover them up. Then some of the later attempts succeed to exfiltrate but the AIs are still shut down before anybody gets killed, the labs frame that as a cybersecurity problem, invest money in it and appear to solve it for some time (not by solving alignment but by improving cybersecurity). Eliezer might say in this case “that was not an ASI, so the oneshotness thesis is not falsified”, but that will be unhelpful because AI capabilities are jagged and the definition of ASI is unclear (do we only agree it was ASI after it successfully takes over?). In the end, quoting Jackson Wagner above, “the janky setup will look like it’s helping right up until a clever AI figures out how to exploit it”
Well, I’d say it’s still one-shot in Yudkowsky’s frame, as above, we just failed to take the threat seriously because of distractions. Like the Germans launching several failed attacks on Yugoslavia in World War I before launching the successful one, the end is the same—Yugoslavia was defeated. Debating whether the previous attacks were one wave or multiple does not matter; there was one war, during which Yugoslavia failed to defend itself and lost.
If the argument is “it’s not one-shot because there will be warning shots of non-ASI”, that’s addressed in the post—the actual ASI is one-shot. If you’re arguing that previous attacks will be so dissimilar in kind as to be not useful for learning what ASI will do, I (and Yudkowsky, I think) agree. If you’re saying that the prospect of succeeding in a takeover for ASI is the same as for Humans in aligning ASI, I’d say “sure,” but ASI is likely to proceed as a careful engineer rather than a graceless elephant, which our civilization seems to be emulating. If you are pointing out that previous failed attempts by non-ASI (which are happening before our one-shot chance) are likely to inoculate us against being serious about the problem, and thus we lose (even harder?) then I agree, but your first post said nothing of the sort and so I am confused as to where we spoke past each other.
Thank you for a good reply. I think the key of our disagreement is the definition of “the actual ASI”. Many future AI systems are certain to be superhuman in many more aspects than the existing LLMs even with best current scaffolds, and will still be below humans in some important aspects, and thus will fail to take over. Why would you deny the rank of ASI to them? Others (Wagner’s “clever AI”) might destroy our civilization during a takeover attempt but still be below humans in less important aspects, why grant the rank to them before the attempt?
I’m arguing jaggedness of the capabilities and gradual scaling are both here to stay, and there’s no objective way to delineate non-AGIs from AGIs from ASIs, therefore it’s better to avoid this term, otherwise it will impede understanding by the politicians and the public.
As of the dissimilarity, I expect some degree of similarity and some degree of learning both how to take over (for future AIs) and how to defend (for humanity), but that’s not a crux.
As of my first comment in the thread, I intentionally tried to be as brief as possible in order to first check the reaction of the community and only share my personal thoughts afterwards in the discussion.
Fair enough on the definitions. Perhaps the way I’d ask is then “in the case that one AI system of whatever power succeeds in taking over/permanently ruining society, will the ones before be similar or dissimilar to it?” If they are similar, and they attempted takeovers which failed, then we had a chance to learn; if they are not, then this is effectively one-shot.
I expect that besides advancement of current AIs, such as LLMs, we’ll also have advancements in their set-ups (perhaps centaurs with humans in co-pilot seats, perhaps agent scaffolds that use thousands of agents as one super-agents, perhaps brain-like AI, or stuff not currently imagined) and that one of those improvements takes us over the precipice, rather than GPT x+1 which has expected kinds of improvements.
I realized there’s one more way Eliezer’s use of the term ASI is confusing: do we agree that Dario’s “country of geniuses in a data center” count as the “real ASI” for the purpose of the thesis “You only get one real shot at real ASI”?
If you take Bostrom’s definition of ASI, it obviously should qualify: “Speed superintelligence: A system that can do all that a human intellect can do, but much faster. [...] Collective superintelligence: A system composed of a large number of smaller intellects such that the system’s overall performance across many very general domains vastly outstrips that of any current cognitive system.” If you disagree, why?
If we do manage to align this kind of ASI once, Eliezer’s critics will say that this falsifies the oneshotness thesis, but as has been discussed in AI 2027 and beyond, human engineers and this ASI might be unable to align the next ASI, or the next ASI might be unable to align the ASI after that, etc. So that’s once again a very different dynamic to “oneshotness”
That there would possibly be other wars and lethal threats beyond World War II did not make the Maginot Line not be one-chance-to-get-it-right. So this doesn’t at all cut against the concept I was pointing to. As for names, there is no name that can stop a fool from being a fool, but if there’s some brief name that proves empirically to provoke fewer fools than I am open to it.
Separately: There’s a threshold level of ASI beyond which It can easily align the next ASI. A country of geniuses in a datacenter might fall short, especially because “country of geniuses” is not yet dath ilan, and possibly not even enough to seek out dath ilan as its successor; I have often found myself unimpressed by the taste and discrimination among ideas and possibilities of those whom Earthlings call geniuses. A country of geniuses in a datacenter otherwise able to stabilize the Earth and smart enough to notice if they can’t align the next system would constitute a victory, however.
This is a perfectly legitimate question. The curse doesn’t apply due to the vast difference in intelligence.
It would definitely be a one-shot problem for today’s LLMs. They can’t reliably plan very far ahead and they have serious execution issues. The same will almost certainly be true for LLMs next year. After that, all bets are off. ASI is in a different league.
Imagine you have to win a game of chess against a baby. You only have one game. Does that make it a one-shot problem for you?
Or imagine you’re trying to take over an anthill. Is that a one-shot problem for you? Ants are small, slow, and stupid. Theoretically, they could bite you to death. In practice, you’d think about this possibility in advance and use insecticide.
Chess is fully verifiable in silico, so the curse clearly does not apply.
Taking over an anthill is a poor comparison because the anthill is a sufficiently simple system with little feedback loops if at all, unlike the human society which is complex and unpredictable due to plenty of poorly understood feedback loops, often very nonlinear and often irrational.
You might disagree, but with the current very limited progress in AI alignment and quite unsafe practices in the frontier labs the first AGI/ASI attempting a takeover almost certainly will not be superhuman enough to perfectly predict humanity’s reaction to the AI’s moves during the attempt (that’s a very high bar IMO). Note that for this first AI there exists no experimental data whatsoever on any of this stuff (fiction doesn’t count), arguably it’s even worse than the examples described in the post
This post reflects a popular misunderstanding of the Maginot Line. I don’t think that this fatally undermines the argument, but it still seems worth correcting.
Epistemic status: I am not a military historian, so I am deferring to military historians who write publicly, rather than looking at the academic literature or (even better) the original sources.
Here’s Bret Devereaux (emphasis in original):
And here’s r/askhistorians:
Military planners in France (and Germany) of the interwar years contrasted the Napoleonic Wars, with their massive decisive battles that took place in a limited area over a few days, with WWI, with its continual grinding attrition over the entire border for years, and decided that they would rather fight in the Napoleonic style than in the WWI style. They wanted the war to be determined in a single giant engagement (like Ulm or Austerlitz or
Borodinoor Leipzig or Waterloo), because win-or-lose, it was better than what happened in WWI.The Maginot Line was intended to force the Germans to concentrate almost all of their force in one direction, when they could be met with almost all of the French force, in a single decisive battle. This battle taking place in Belgium rather than on French territory was an added benefit. It was not intended to win WWI—it was intended to avoid WWI entirely.
The Maginot Line succeeded at its goals. The war between France and Germany was decided in a single giant battle, which mostly occurred in Belgium.
The broader strategic planning was flawed, in two important ways. (1) The French field army lost the decisive battle. (2) Most importantly, the French underestimated the German war goals. In WWI, the Germans would have been satisfied with the transfer of some of France’s colonies and for France to pay the German government’s debt. This wouldn’t have been great for France, but it wasn’t existential to the French state. France was not expecting that Germany was planning on permanently occupying Paris. So the cost of defeat was higher than expected. Even with this massive strategic miscalculation, France suffered less than half as many casualties in WWII as it had during WWI.
This is the issue that turned up in war games, was counterargued and disregarded by high command, and which was sufficient to lose France the war. No?
The point here is one of utility function of France, no?
I think this is non-central to the post, and doesn’t undermine any of your central points.
However, I see the primary thrust of the Maginot line as “force both sides to pit all their strength against each other”, and the secondary as “actually win”, with the human understanding that losing a war to other humans is usually not an existential threat to the societies involved.
In terms of your point: “France was incorrect about a detail in war games, and therefore they lost on the first critical try according to the utility ‘beat the Germans’” is true.
The counterpoint: “France was succeeding according to the utility: ‘minimise sum of German + French deaths due to F/G conflict in next war, with an additional winning term’” may also be correct.
At this point we’re very non-central though, I think.
Curated. A lot could be said about this post and a lot is being said about this post (see the 111 comments on it), but a thing I find neat about it is that it continues the debate around an old idea that’s still important and perhaps cruxy for many.
Oneshotness is not new (not a new concept, not a new argument, not a new debate); yet it is a critical argument for alignment difficult, and is still contested, with implications for what people do. Having made many attempts to explain their side, I could imagine someone giving up and being like “the people who get it, get it, and those who don’t, don’t” and not bothering further to try to argue it. I lean that way. Eliezer admits frustration and the tone matches, but he’s still talking and writing with the hope of being convincing. People are responding, debating further. It’s heated and we’re in the realm of accusations of motivated cognition, strawmanning, motte & bailey, but in this post Eliezer is bringing object level arguments (e.g. analogies) in attempt to advance the conversation, and 111 comments later, conversation is being had. It might not be enough, but I think there are some dignity points in that.
I’d like to offer my perspective on why this enlightening post, written in Eliezer’s wonderful, super-clear style, can’t possibly eliminate motivated thinking about this one-shot problem.
In my opinion, people can’t see any realistic alternative solution. They see that the alternative is even worse, but maybe they can’t articulate it clearly, even to themselves. Or they just refuse to express it out loud. The fact is that the proposed AI ban (or AI pause) is also a one-shot problem.
How can an AI ban be implemented technically? Not legislatively, but in practice? I’ve only seen proposals to monitor GPUs and bomb datacenters. Okay, to be charitable, suppose we create a truly strict ban that can’t be circumvented gradually, and then we bomb a few secret datacenters to show we’re serious… Congratulations, now it’s a bit inconvenient to work on AI, and we’ve created Bruce Schneier’s security theater. TSA would be proud.
Everyone understands that China can build datacenters no one will find and stockpile GPUs no one knows about. So anyone who proposes an AI ban is ready to discuss China and any possible plans to persuade and motivate China not to create AI. And in classic Murphy’s Law fashion, the threat will come from somewhere else. AI doesn’t really need datacenters, AI doesn’t really need GPUs, it just needs a lot of computing power… or a moderate amount of computing power and a lot of ingenuity.
North Korea, unlike China, can’t build secret datacenters, but North Korean hackers could take a popular online game and connect gamers’ GPUs from all over the world into one massive botnet. The Russians are worse. They could do without GPUs altogether—they’ll find a way to use CPUs. Not now, but in 5-10 years, it’s entirely possible. It might seem like all the talented people have left Russia, but a similar mass exodus in the 1910s and 1920s didn’t stop the Soviet Union. I’m familiar with the Russian math education. It still works despite everything, and they’ll find a way to create AGI out of “crap and sticks”, as they themselves put it. Putin’s Russia is an uncanny copy of Germany in the 1930s. These people feel unfairly treated and humiliated by other countries after World War I = in this case, the collapse of the USSR. They will stop at nothing to seize their historic opportunity, their insane dream of becoming world leaders again.
And we only have one chance to organize this AI ban in such a way that it actually works long enough (where “long enough” is impossible to determine in advance). Once Anthropic stops its work for a few months, it’s all out of our hands—the probe will be sent to Mars, and it’ll be too late to urgently restart Anthropic to create an AGI, which, while just as dangerous, at least won’t make all surviving biologicals sing Putin’s praises! (I’m NOT joking here. I grew up in the USSR and I personally sang praises to Lenin and I wasn’t even mind-controlled, strictly speaking.)
Thus, by proposing an AI ban, we’re effectively offering the following choice:
either existing AI companies voluntarily abandon the AI race and trust other people (some of whom may be incompetent, maniacally narcissistic, or delirious—I can’t believe I’m not even exaggerating here) to orchestrate a worldwide ban on AI, thereby replacing one one-shot problem with another,
or a moderately responsible company with some limited public oversight gets the chance to create an alien, uncontrollable and most profoundly unsafe, but at least not deliberately perverted and evil, superintelligence...
Yes, the second option carries a higher risk. If the AI ban lasts 10 years, some research could help create friendly AI (please bring that term back; I’m tired of the absurd AI “safety” and the meaningless AI “alignment”). But this higher risk seems a less horrifying type of risk, to me and to many others. If there are no better options than a toddler-level-naive AI ban via GPU control, I’d rather take Anthropic’s wager please.
Therefore, if we are confident that we want to save humanity, if we are confident that humanity is worth saving, and that life is worth living, despite all the catastrophic unbelievable stupidity happening in the world (yes, I’m chronically depressed and I apologize for that), then in my opinion, we need to focus our efforts on finding an alternative solution that isn’t just a different one-shot problem with a different risk profile but a potentially worse outcome.
I have had success (talking with e.g. MPs and civil servants) discussing ‘unpluggability’ and its opposite, ‘un-unpluggability’.
Out of curiosity, do we know more about how this particular mistake happened? When issuing a software update, not overwriting any critical part of the code would seem fairly high on my list of concerns. It seems like this sort of mistake should have been caught by integration or unit tests, or by testing that all components worked on a replica on earth. Were they under a lot of time pressure or something?
They were under some time pressure, if the battery goes true flat the probe is permanently dead, as NiCd batteries do not survive overdrain.
I’d like to go on a bit of a tangent since I don’t recall seeing this line of thinking here on LW: are we sure that this is the case, that they think that what they are doing is reasonably safe? This seems to me to be stranger than the case where they know it’s not safe, and perhaps not safe at all. What I mean is that they surely are very motivated and they have strong incentives in pursuing AGI, but they don’t strike me as… stupid/incompetent?
Maybe they think that this is for the best, that humans are… not it? I could sort of understand where such a thought could be coming from.
It might be interesting to model their behavior keeping this possibility in mind. Maybe the point is not to show them that this is not safe, but that this is not right?
Dario Amodei thinks that real alignment theory is about “monomaniacal” AI and is therefore refuted by LLMs, which want more than one thing. If Amodei was the sort of person to let himself hear or understand better than that, he wouldn’t have his job.
I have no reason to believe any other AI company executive has even heard that much about any alignment theory which leans toward slight pessimism, with the possible exception of Shane Legg at Deepmind.
I strongly believe that Dario does not actually think that and is just saying that for politics. Can we get someone from Anthropic to clarify this?
It seems to me like you often accuse people of strawmanning you when in fact they are paraphrasing your position accurately if slightly uncharitably, in a way that makes discourse with you difficult.
In this case, I think you absolutely do believe that the incomprehensibility, complexity, and lack of rigor in the modern AI field is a confounding factor which makes alignment much harder to get right on the first critical try, and that with more time and/or genetically engineered superbabies we could instead come up with a different method for building AI which is better understood, more predictable, and therefore safer. Is this wrong? If so, how?
Eliezer is here giving a rant about how people strawman him as a champion of “proving that the AI is safe”, which does really happen all the time! He isn’t providing any specific links here, but I could dig them up, and we could look at them, and I would be very surprised if we end up anywhere else but “yep, these people sure seem to think that because MIRI used to work on some agent foundations math, that this means MIRI is trying to prove that future AIs are aligned before proceeding”.
And there are really people out there who are championing the banner of “proving that the AI is safe”, and so this separation really matters: https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects
Here is a quote from miri’s website in 2021 (https://archive.is/sYKvl):
Do you… disagree with that statement?
How does this imply that MIRI is trying to “prove that AI is safe” or that “empirical iteration is of close to no value”?
I mean, I didn’t write it, I wouldn’t have written it, and if it were still on the site I’d have pinged somebody to take it down; because it’s not the right way of wording the true idea, the true idea no longer matters here, and this wrong version is adjacent to other wrong ideas that aren’t helpful here.
Yeah I should have spelled it out more. But after sending I didn’t want to do a stealth-edit.
Anyway, to answer your question:
Yes, I disagree with that statement (I think a mathematical equation would be incredibly complex and fragile and prone to alignment failure). If we’re specifying a counterfactual world where the equation was simple, I would then prefer the equation.
Why I think it’s relevant: I think it’s pretty clear that there is a lot of good reason to put EY near the provably safe AI camp. The above quote is one reason; other reasons include his critiques of modern ML methods as being particularly bad for alignment and general AI pessimism.
I think it’s disingenuous to accuse people who make this inference of being disaster monkeys or barbarian populists who are just mad that he wrote papers with math equations in them. Like, maybe it’s wrong, but it is a pretty understandable type of wrong.
Look, I am really confident that seeing the stuff from the provably safe AI camp fills Eliezer with the same kind of frustration as becomes me when I see it. I don’t get what the provably safe AI camp people are talking about, and I don’t think Eliezer gets it either (or like, maybe he understands the psychology better than I do, but I really doubt he believes it).
I think modern ML methods are particularly bad for alignment. I do not think this has anything to do with thinking that we should “prove AI safe”. I do think an appropriate approach would include many mathematical proofs because mathematical proofs are one among a large set of tools you bring to bear to solve complicated problems, in the same way that of course many mathematical proofs were involved in landing rockets on the moon. The fact that people don’t seem to understand that you want to use mathematical proofs for anything but “proving that systems are safe end to end” or things like that is what Eliezer’s rant was about.[1]
I personally find the position that if you utilize proofs in your thinking, then you must be attempting to prove end-to-end things about a complicated real-world system, at the level of complexity of “proving that this rocket will land on the moon” or “proving that this self-driving car system will never cause any crashes” is very silly. IDK whether it’s “understandable”, but I think it deserves being solidly rebuked.
Which is not to say that it spells out what those things are, saying explicitly that it doesn’t go into that because it’s not the main topic of the essay, but I assure you, they exist.
I share your confusion at the idea of “mathematically proving AI safe” haha. This convo has made me realize I’ve conflated alignment pessimists in general with the provably safe AI people in particular too much in my mind.
I don’t particularly care if it’s a cartoon, a poem, a series of jokes, or written at 2am while drunk. Regardless of the format, if it contains valid arguments and valuable information, it’s welcome on LessWrong.
If failure of alignment → schemer that wants to seize power, then ASI alignment is one shot.
But if failure of alignment → non-schemer misalignment (eg reward hacking, or flailing misgeneralisation), then we failure isn’t existential
So I think p(scheming | alignment failure) is a crux here
I’d like to suggest “Ironman Mode” (or whatever its best-known synonym is) as possibly memetically useful here. It refers to a difficulty modifier in certain games that prevents the equivalent of saying “oops” and restoring to 1929. Mistakes become permament, at least for that playthrough. The term isn’t a perfect match, because you can try again, but only by starting from scratch, nethack-style.
(“roguelike” was once a similar concept, but has been badly diluted in recent years)
One reason I think it might be a useful metaphor is that most players fear playing in ironman mode. Yes, it’s “only a game”, but progression takes time and effort, and thus losing it is a cost-in-reality—a cost that is intuitively real, in the sense that System 1 understands it, sees it as a real thing that can really happen, and that there can be no appeal to justice or mercy, and so shies away from it.
Another is that the few players who do play in such modes may be more likely to understand the concept you’re driving at. If you, say, have a Long War policy of treating a 99% hit rate as if it were 100%, you will fail, because sooner or later you will miss that shot in a situation where missing is fatal, and no amount of “but that should have fucking worked!” will save you. That doesn’t mean you never take such risks—but it does mean you develop a habit of explicitly checking that you can survive them first, and you learn a kind of principle: Do not YOLO if YOLO.
This is why one shouldn’t argue on the X-parrot, and why David Brin’s “disputation arenas” proposal includes a step in which each party must, to the satisfaction of the other and/or the judgea, explain the other side’s argument in their own words to prove that they actually understand it.
Hey Eliezer, I am a big fan of your work. I did have one idea about this and would be interested in hearing your thoughts. I liked your example about the Maginot line, if one German division crosses into France, you do not have time to build a new better Maginot line before the next division arrives. This is both true and pretty funny. Here, the reason why you cannot rely on the “gradualness of the problem” as a source of hope is that the intervention technique (building the Maginot line) takes orders of magnitude more time to implement than the developing problem (German divisions moving into France). For AI takeoff though, perhaps the unfolding problem takes long enough to develop that interventions (a world wide ban on training frontier models for example) actually can work mid-disaster. I would be interested in hearing your thoughts on this. Thank you!
In a cosmic scope, everything is almost never oneshot in its nature—in the narrowing scope of a thing in question, it’s almost always oneshot (our daily lives being somewhere in the middle of these macro and micro scales) - and the severity of the scope is a wobbly line drawn somewhere in its spectrum of worst case scenarios arranged by diminishing probability. For ASI, even the most lenient line is at an astoundingly high level of probability. It’s only not oneshot in the fact that a future non-human civilization can try again after learning from our outcomes.
Personally, I blindly and desperately hope that the first ASI that we engineer considers humanity as part of its self, which I do recognize is hoping for the first launched solar sail to open and catch a solar flare with perfect timing...
I’m afraid to admit that I am one of the people who do not understand the point being made other than arguing against boundary experimentation.
Is this, like, available online? What is it called?
I think notion of “one-shot-ness” introduces a counterproductive dichotomy. A lot of people, even those aligned with the general AI-risk position of the writer, react to this framing as downplaying or dismissing the role of empirical research, trials with smaller AIs, etc. (Oliver’s quotes as evidence). I think it doesn’t necessitate strawmaning to reply in this way! And I found confusion around “Yudkowsky wants us to have perfect theoretical understanding before we try anything” reasonable, even if it misrepresents the writer’s position.
Indeed, “one-shot-ness” is, like Kokotajlo notes, a graded property. I think the writer is more likely disagreeing qualitatively on how hard the alignment problem is with Buck, Paul and others than quantitatively somehow (and perhaps, most bizzarely, not disagreeing at all!). To any reasonable person it is obvious any problem has some (often trivial) degree of “we are doing this for a first time”.
I personally hold that it is more productive for the broad discourse to argue instead directly on if AI alignment is at least as hard as, say, launching a space probe to Mars, and what are then the expected costs given different strategies. I think it is a much clearer position to say “If we launch ASI inference and it is misaligned, humanity probably dies”, “With near future understanding of AI internals, if we launch ASI inference, chance that it is misaligned is higher than the chance that a probe to Mars explodes, because A, B and C” → “If we launch ASI inference with near future technology, humanity probably dies”.
Maybe an even better position, if a bit more complicated, is “There exists AI with such internal capabilities that when we launch its inference, humanity probably dies”, coupled with “Current incentives of people developing AI do not have any built-in breaks, and thus, capabilities will increase as much as technologies allow”. This sidesteps the problems of defining ASI and predicting the exact amount of capabilities required to spell humanity’s doom.
Whoever who wants to dismiss this prediction then needs to argue why AI alignment is easier than launching a probe to Mars, or that necessary capabilities won’t be reached.
Or it’s the writer’s fault and calling it “one shot” is just a bad choice of words, when it being correct depends on specific decomposition into shots and “irretrievability” is better. Like, people are forced to say “there’ll be multiple first critical tries” to describe a situation where you can repeatedly fail to notice AI scheming. And it’s especially bad when you simultaneously endorse “you can’t try after ASI kills you” and “ASI that can actually kill you is importantly different situation”. Irretrievability, distribution shift and their correlation should be argued independently. Otherwise people will go off topic like this:
You can test lethal levels of nonsuper intelligence instead!
Ah I guess a problem is a one shot problem in respect to a goal if it is possible that there will be a be a issue that arrises from the attempt that makes the goal no longer possible? Would that kinda mean that in some way every problem is one shot? Material and time spent on failing to make as shoe is not recoverable though the severity of the loss is probably low.
But also for the first probe example, the attempt at the software update seems like a second attempt where the first would be to have designed the craft to not have had the problem
I probably am one of the dumbs that dont grasp the concept
This seems like an excellent argument to pause somewhere at about the AGI level, where a mistake is likely to give us a competent sentient computer virus and/or a new criminal organization and/or a nation of rather inconvenient competent people in a data-center: problems large enough to act as warning shots and learning exercises, but not actually wipe us out or permanently disempower us.
However, that would of course require:
a) actual willingness and capability to pause, globally (including China), and also
b) correct judgement of whether the likely resulting problems will be dealable with rather than terminal (because they moved through the Ardennes faster than we had anticipated), and also
c) an accurate idea on timelines of how far away this point is, and also
d) that rogue AIs can’t get access to enough compute and research competence to self-improve up to levels where it is terminal.
So maybe we should in fact pause before that? Of course, then we don’t learn as much, or provide as loud a set of warning shots…
So maybe we should instead proceed very slowly and cautiously, rather than racing?
Is there going to be a follow up giving the other side of the argument: that the develop!ent of ASI actually will be a one-shot enterprise?
ASI safety would have to be invented quite often after the first ASIs. Even the best man-made self-stabilizing systems are not very good at surviving random events. ASI ecosystems affect pretty much everything on this planet, so there is plenty of room for a random event. And they couple so many dimensions, including time and space, and across their scales, it seems this is a textbook recipe for instability.
Right, it is not that there is a first critical try. It is that even if you passed the first critical try, and got an aligned AI/ or a restrained AI system, there would be a second, and a third, and a forth, and while you might have resources from the previous tries, you need the probability of each individual event to shrink faster than 1/n in possible events to never have one event if you cannot stop yourself from taking events, and probabilities of each event are independent, by the divergence of the harmonic series.
So for any outcome that could be a consequence of an event, if you want it to not happen you have two options. 1. learn so fast that you take only a finite chance of it happening or 2. Take a finite number of events, and start with a big N
I haven’t read the post but the gist of this idea is incorrect because civilization is actually always a one-shot competition against destruction which, historically, it has eventually lost and never recovered from.
Examples include e.g. the decline and fall of Rome in the 5th century, the decline and collapse of the Spanish, Portuguese, Dutch and now British empires
In this sense there’s nothing special about AI.
AI is the civilization filter of the 21st century: those who don’t embrace it will be destroyed by those who do, and there’s also some probability that it destroys everyone.
Commenting on something you haven’t read is degenerate, anti-civilized behavior!! You’re not even disagreeing with the thing you haven’t read! (It’s possible for AI and civilizational decline to both be one-shot.)