A non-review of “If Anyone Builds It, Everyone Dies”
I was hoping to write a full review of “If Anyone Builds It, Everyone Dies” (IABIED Yudkowsky and Soares) but realized I won’t have time to do it. So here are my quick impressions/responses to IABIED. I am writing this rather quickly and it’s not meant to cover all arguments in the book, nor to discuss all my views on AI alignment; see six thoughts on AI safety and Machines of Faithful Obedience for some of the latter.
First, I like that the book is very honest, both about the authors’ fears and predictions, as well as their policy prescriptions. It is tempting to practice strategic deception, and even if you believe that AI will kill us all, avoid saying it and try to push other policy directions that directionally increase AI regulation under other pretenses. I appreciate that the authors are not doing that. As the authors say, if you are motivated by X but pushing policies under excuse Y, people will see through that.
I also enjoyed reading the book. Not all parables made sense, but overall the writing is clear. I agree with the authors that the history of humanity is full of missteps and unwarranted risks (e.g. their example of leaded fuel). There is no reason to think that AI would be magically safe on its own just because we have good intentions or that the market will incentivize that. We need to work on AI safety and, even if AI falls short of literally killing everyone, there are a number of ways in which its development could turn out bad for humanity or cause catastrophes that could have been averted.
At a high level, my main disagreement with the authors is that their viewpoint is very “binary” while I believe reality is much more continuous. There are several manifestations of this “binary” viewpoint in the book. There is a hard distinction between “grown” and “crafted” systems, and there is a hard distinction between current AI and superintelligence.
The authors repeatedly talk about how AI systems are grown, full of inscrutable numbers, and hence we have no knowledge how to align them. While they are not explicit about it, they implicit assumption is that there is a sharp threshold between non superintelligent AI and superintelligent AI. As they say “the greatest and most central difficulty in aligning artificial superintelligence is navigating the gap between before and after.” Their story also has a discrete moment of “awakening” where “Sable” is tasked with solving some difficult math problems and develops its independent goals. Similarly when they discuss the approach of using AI to help with alignment research, they view it in binary terms: either the AI is too weak to help and may at best help a bit with interpretability, or AI is already “too smart, too dangerous, and would not be trustworthy.”
I believe the line between “grown” vs “crafted” is much more blurry than the way the authors present it. First, there is a sense in which complex systems are also “grown”. Consider for example, a system like Microsoft Windows with 10s of millions of lines of source code that has evolved over decades. We don’t fully understand it either—which is why we still discover zero day vulnerabilities. This does not mean we cannot use Windows or shape it. Similarly, while AI systems are indeed “grown”, they would not be used by hundreds of millions of users if AI developers did not have strong abilities to shape them into useful products. Yudkowsky and Soares compare training AIs to “tricks .. like the sort of tricks a nutritionist might use to ensure a healthy brain development in a fetus during pregnancy.” In reality model builders have much more control over their systems than even parents who raise and educate their kids over 18 years. ChatGPT might sometimes give the wrong answer, but it doesn’t do the equivalent of becoming an artist when its parents wanted it to go to med school.
The idea that there would be a distinct “before” and “after” is also not supported by current evidence which has shown continuous (though exponential!) growth of capabilities over time. Based on our experience so far, the default expectation would be that AIs will grow in capabilities, ability for longterm planning and acting, in a continuous way. We also see that AI’s skill profile is generally incomparable to humans. (For example it is typically not the case that an AI that achieves a certain score in a benchmark/exam X will perform in task Y similarly to humans that achieve the same score.) Hence there would not be a single moment where AI transitions from human level to superhuman level, but rather AIs will continue to improve, with different skills transitioning from human to superhuman levels at different time.
Continuous improvement means that as AIs become more powerful, our society of humans augmented with AIs is also more powerful, both in terms of defensive capabilities as well as research on controlling AIs. It also mean that we can extract useful lessons about both risks and mitigations from existing AIs, especially if we deploy them in the real world. In contrast, the binary point of view is anti empirical. One gets the impression that no empirical evidence for alignment advances would change the authors’ view since it would all be evidence from the “before” times, which they don’t believe will generalize to the “after” times.
In particular, if we believe in continuous advances then we have more than one chance to get it right. AIs would not go from cheerful assistants to world destroyers in a heartbeat. We are likely to see many applications of AIs as well as (unfortunately) more accidents and harmful outcomes, way before they get to the combination of intelligence, misalignment, and unmonitored powers that leads to infecting everyone in the world with a virus that gives them “twelve different kinds of cancer” within a month.
Yudkowsky and Soares talk in the book about various accidents in nuclear reactors and space ships, but they never mention all the cases that nuclear reactors actually worked and space ships returned safely. If they are right that there is one threshold which once passed, it’s “game over” then this makes sense. In the book they make an analogy to a climbing a ladder in the dark where every time you climb it you get more rewards but no one can see where the ladder ends and once you reach the top rung it explodes and kills everyone. However, our experience so far with AI does not suggest that this is a correct world view.
The gap between Before and After is the gap between “you can observe your failures and learn from them” and “failure kills the observer”. Continuous motion between those points does not change the need to generalize across them.
It is amazing how much of an antimeme this is (to some audiences). I do not know any way of saying this sentence that causes people to see the distributional shift I’m pointing to, rather than mapping it onto some completely other idea about hard takeoffs, or unipolarity, or whatever.
Where do you think you’ve spelled this argument out best? I’m aware of a lot of places where you’ve made the argument in passing, but I don’t know of anywhere where you say it in depth.
My response last time (which also wasn’t really in depth; I should maybe try to articulate my position better sometime...) was this:
I’m guessing this won’t turn out to resolve your current disagreement, but I think the best articulation of this is probably in the Online Resource page: A Closer Look at Before And After.
From past discussions, it sounds like you think “the AIs are now capable of confidently taking over” is, like, <50% (at least < 60%?) likely to in practice be a substantially different an environment.
I don’t really get why. But, to be fair, on my end, I also don’t really have much more gears underneath the hood of “obviously, it’s way different to run tests and interventions on someone who isn’t capable of confidently taking over, vs someone who is, because they just actually have the incentive to defect in the latter case and mostly don’t in the former”. It seems like there’s just a brute difference in intuition I’m not sure what to do with?
(I agree there might be scenarios like “when the AI takeover is only 10% likely to work, it might try anyway, because it anticipates more powerful AIs coming later and now seems like it’s best shot.” That’s a reason you might get a warning shot, but, not a reason that “it can actually just confidently takeover with like 95%+ likelihood” doesn’t count as a significantly new environment once we actually get to that stage.)
Text of the relevant resource section, for reference
It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.
In reality in AI there would be multiple generations of different “Bobs” (although it is possible—and needs to be studied—that they would have affinity for one another and try to communicate or influence each other via hidden means). These “bobs” will be growing in both capability but also actually deployed in the world with real-world consequences. Our current “Bobs” are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. “Bob” will start doing work of an intern, then a junior worker, then more senior one, managing teams of other “Bobs”. To continue the analogy, maybe “Bob” will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.
I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will “explain away” these failures and still end up in a very bad place. I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.
I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky ’s expectations of how likely this scenario is to materialize.
Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.
This interpretation only works if by missed shots you mean “missed opportunities to completely solve alignment”. Otherwise you can observe multiple failures along the way and fix observable scheming, but you only need to miss one alignment failure on the last capability level. The point is just that your monitoring methods, even improved after many failures to catch scheming in pre-takeover regime, are finally tested only when AI is really can take over. Because real ability to take over is hard to fake. And you can’t repeat this test after you improved your monitoring, if you failed. Maybe your alignment training after previous observed failure in pre-takeover regime really made AI non-scheming. But maybe you just missed some short thought where AI decided to not think about takeover since it can’t win yet. And you’ll need to rely on your monitoring without actually testing whether it can catch all such possibilities that depend on actual environment that allows takeover.
Yep. Feel free to add it here
You seem to be assuming that you cannot draw any useful lessons from cases where failure falls short of killing everyone on earth that would apply to cases where it does.
However, if AI’s advance continuously in capabilities, then there are many intermediate points between today where (for example) “failure means prompt injection causes privacy leak” and “failure means everyone is dead”. I believe that if AIs that capable of the latter would be scaled up version of current models, then by studying which alignment methods scale and do not scale, we can obtain valuable information.
If you consider the METR graph, of (roughly) duration of tasks quadrupling every year, then you would expect non-trivial gaps between the points. that (to take the cybersecurity example) AI is at the level of a 2025 top expert, AI can be equivalent to a 2025 top level hacking team, AI reaches 2025 top nation state capabilities. (And of course while AI improves , the humans will be using AI assistance also.)
I believe there is going to be a long and continuous road ahead between current AI systems and ones like Sable in your book.
I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
Hence I believe we will be able to learn from both successes and failures of our alignment methods throughout this time.
Of course, it is possible that I am wrong, and future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.
I’m not sure what Eliezer thinks, but I don’t think it’s true that “you cannot draw any useful lessons from [earlier] cases”, and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it’s left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that “try to take over” is a more serious strategy, somewhat tautologically because the book defines the gap as:
I don’t see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.
It seems like maybe you’re just accepting that there is this one problem that we won’t be able to get direct evidence about in advance, but you’re optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.
When you say “by studying which alignment methods scale and do not scale, we can obtain valuable information”, my interpretation is that you’re basically saying “by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D”. Is that right?
Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can’t tell if the sticking point is that people don’t actually believe in the second regime.
There are rumors that many capability techniques work well at a small scale but don’t scale very well. I’m not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it’s pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.
Treating “takeover” as a single event brushes a lot under the carpet.
There are a number of capabilities involved—cybersecurity, bioweapon, etc.. - that models are likely to develop at different stages. I agree AI will ultimately far surpass our 2025 capabilities in all these areas. Whether that would be enough to take over the world at that point in time is a different questoin.
Then there are propensities. Taking over requires the model to have the propensity to “resist our attempts to change its goal” as well to act covertly in pursuit of its own objectives, which are not the ones it was instructed. (I think these days we are not really thinking models are going to misunderstand their instructions in a “monkey’s paws” style.)
If we do our job right in alignment, we would be able to drive these propensities down to zero.
But if we fail, I believe these propensities will grow over time, and as we iteratively deploy AI systems with growing capabilities, even if we fail to observe these issues in the lab, we will observe them in the real world well before the scale of killing everyone.
There are a lot of bad things that AIs can do before literally taking over the world. I think there is another binary assumption which is that AIs utility function is binary—somehow the expected value calculations work out such that we get no signal until the takeover.
Re my comment on the 16 hour 200K GPU run. I agree that things can be different at scale and it is important to keep measuring them as scale increases. What I meant is that even when things get worse with scale we would be able to observe it. But the exampe of the book—as I understood it—was not a “scale up.” Scale up is when you do a completely new training run, in the book that run was just some “cherry on top”—one extra gradient step—which presumably was minor in terms of compute compared to all that came before it. I don’t think one step will make the model suddenly misaligned. (Unless it completely borks it, which would be very observable.)
Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.
Okay, it does sound like you’re saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.
Maybe I’m misunderstanding though, it’s possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that’s an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I’m worried about.
To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn’t think the game would ever happen, that sounds like it’s the real crux, not some complicated argument about how well the training drills match the game.
I think it’s more like we have problems A_1, A_2, A_3, ….. and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.
We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn’t one “Game Day.” Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.
So my mental model is there is a sequence of models M_1,M_2,.… of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)
Now suppose you have such a sequence of models M_1,M_2,..… of growing capabilities. I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.
I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.
Regarding your second question of whether AIs would be powerful enough to take over the world at some point:
My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.
I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.
Suppose that all scheming AIs were perfectly allied. As in, Mn is indifferent between it taking over and Mn+k taking over, so Mn is willing to be patient if this would be a better strategy. (We can imagine this applying within a company and between companies.) Would this change your perspective?
For what it’s worth, I broadly agree with “we’ll probably be able to study the problem of scheming AIs hiding their intentions in advance, including by studying reasonably analogous (to the dangerous AIs we’re worried about) ‘actual natural scheming AIs’ that consistently try to hide their intentions in training/testing”, but it’s pretty easy for this to go poorly. The main reasons why it’s easy to go poorly are that schemers actively want to avoid being detected and to avoid being studied and we might be extremely rushed due to a software intelligence explosion. Another mundane issue is that we just might not have that many distinct capable (scheming) AIs to study (as creating capable AIs is expensive) and findings might not transfer.
I say much more in this post I recently wrote.
Here’s an example situation where it seems like studying scheming is pretty hard: suppose that schemers are decently allied and that all schemers above some capability level X are smart enough to reliably lie in wait undetected and takeover is easy/possible for AIs (collectively) at capability level Y where Y>>X. Then, naively, you can only study schemers below capability level X while your methods still have to generalize all the way to Y. Further, schemers below capability level X might be particularly disanalogous because schemers that have trouble avoiding detection have importantly different properties from schemers that do avoid detection. You can study schemers which are just barely below X, but then these schemers will probably be hard to detect and you’ll miss them a bunch which makes studying them harder (e.g. because you can’t be confident whether some method avoided scheming or if it instead made the AI a bit better at hiding its scheming which suffices to make it not be detectable). You could try to overcome this in a bunch of ways, e.g. trying to create schemers which don’t have allied goals so they are more likely to take risks or to accept deals.
This example situation is relatively pessimistic, but not totally implausible and just subsets of these factors could still cause serious issues.
A lot of my view is that this is actually somewhat conceptually tricky to study and generally a pain for a bunch of reasons (e.g. you might have to train a bunch of diverse AIs), and as it currently stands we don’t seem on track to do a great job studying this on time if capabilities progress as quickly as seems pretty likely. It also seems like companies might react very poorly to clear cut evidence for risk.
Hi Ryan, will be brief but generally:
1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations.
2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don’t have good solutions for them yet..
I note that to my eyes, you appear to be straightforwardly accepting the need-to-generalize claim, and arguing for ability-to-generalize. Putting words in your mouth a little, what I see you saying is that, by the time we have a true loss-of-control-can-be-catastrophic moment where failure kills boazbarak, we have had enough failure recoveries on highly similar systems to be sure deadly-failure probability is indistinguishable from zero, that maximum-likely-failure-consequence is shrinking as fast or faster than model capability.
But current approaches don’t seem to me to zero out the rate of failures above a certain level of catastrophicness. They’re best seen as continuous in probability, not continuous in failure size.
I am not sure I 100% understand what you are saying. Again, like I wrote elsewhere, it is possible that for one reason or another rather than systems becoming safer and more controlled, they will become less safe and riskier over time. It is possible we will have a sequence of failures growing in magnitude over time, but for one reason or another do not address them, and hence since end up in a very large scale catastrophe.
It is possible that current approaches are not good enough and will not improve fast enough to match the stakes at which we want to deploy AI. If that is the case then it will end badly, but I believe that we will see many bad outcomes well before an extinction event. To put it crudely, I would expect that if we are on a path to that ending, the magnitude of harms that will be caused by AI will climb on an exponential scale over time similar to how other capabilities are growing.
“future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.”
This doesn’t feel like convincing reasoning to me. For one, there is also a third option, which is that both scaling up current methods (with small modifications) and paradigm shifts could lead us to superintelligence. To me, this seems intuitively to be the most likely situation. Also paradigm shifts could be around the corner at any point, any of the vast number of research directions could give us a big leap in efficiency for example at any point.
Note that this is somewhat of an anti-empirical stance—by hypothesizing that superintelligence will arrive by some unknown breakthrough that would both take advantage of current capabilities and render current alignment methods moot—you are essentially saying that no evidence can update you.
One thing I like about your position is that you basically demand of Eliezer and Nate to tell you what kind of alignment evidence would update them towards believing it’s safe to proceed. As in, E&N say we would need really good interp insights, good governance, good corrigibility on hard tasks, and so on. I would expect that they put the requirements very high and that you would reject these requirements as too high, but still seems useful for Eliezer and Nate to state their requirements. (Though perhaps they have done this at some point and I missed it)
To respond to your claim that no evidence could update ME and that I am anti-empirical? I don’t quite see were I wrote anything like that. I am making the literal point that: you say that there are two options, either scaling up current methods leads to superintelligence or it requires new paradigm shifts/totally new approaches. But there is also a third option, that there are multiple paths forward right now to superintelligence, paradigm shifts and scaling up.
Yes, I do expect that current “alignment” methods like RLHF or COT monitoring will predictably fail for overdetermined reasons when systems are powerful enough to kill us and run their own economy. There is empirical evidence against COT monitoring and against RLHF. In both cases we could have also predicted failure without empirical evidence just from conceptual thinking (people will upvote what they like vs whats true, COT will become less understandable the less the model is trained on human data), though the evidence helps. I am basically seeing lots of evidence that current methods will fail, so no I don’t think I am anti-empirical. I also don’t think that empiricism should be used as anti-epistemology or as an argument for not having a plan and blindly stepping forward.
I also believe that our current alignment methods will not scale and that we need to develop new ones. In particular I am a co author of the scheming paper mentioned in the first link you say.
As I said multiple times, I don’t think we will succeed by default. I just think that if we fail we will do so multiple times with failures continually growing in magnitude and impact.
In this framing the crux is whether there is an After at all (at any level of capability). The distinction between “failure doesn’t kill the observer” (a perpetual Before) and “failure is successfully avoided” (managing to navigate the After).
Here’s my attempted phrasing, which I think avoids some of the common confusions:
Suppose we have a model M with utility function ϕ, where M is not capable of taking over the world. Assume that thanks to a bunch of alignment work, ϕ is within δ (by some metric) of humanity’s collective utility function. Then in the process of maximizing ϕ, M ends up doing a bunch of vaguely helpful stuff.
Then someone releases model M′ with utility function ϕ′, where M′ is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, ϕ′ is also within δ′ of humanity’s collective utility function, where δ′≤δ Then in the process of maximizing ϕ′, M′ gets rid of humans and rearranges their molecules to satisfy ϕ′ better.
Does this phrasing seem accurate and helpful?
This is an excellent encapsulation of (I think) something different—the “fragility of value” issue: “formerly adequate levels of alignment can become inadequate when applied to a takeover-capable agent.” I think the “generalization gap” issue is “those perfectly-generalizing alignment techniques must generalize perfectly on the first try”.
Attempting to deconfuse myself about how that works if it’s “continuous” (someone has probably written the thing that would deconfuse me, but as an exercise): if AI power progress is “continuous” (which training is, but model-sequence isn’t), it goes from “you definitely don’t have to get it right at all to survive” to “you definitely get only one try to get it sufficiently right, if you want to survive,” but by what path? In which of the terms “definitely,” “one,” and “sufficiently” is it moving continuously, if any?
I certainly don’t think it’s via the number of tries you get to survive! I struggle to imagine an AI where we all die if we fail to align it three times in a row.
I don’t put any stock in “sufficiently,” either—I don’t believe in a takeover-capable AI that’s aligned enough to not work toward takeover, but which would work toward takeover if it were even more capable. (And even if one existed, it would have to eschew RSI and other instrumentally convergent things, else it would just count as a takeover-causing AI.)
It might be via the confidence of the statement. Now, I don’t expect AIs to launch highly-contingent outright takeover attempts; if they’re smart enough to have a reasonable chance of succeeding, I think they’ll be self-aware enough to bide their time, suppress the development of rival AIs, and do instrumentally convergent stuff while seeming friendly. But there is some level of self-knowledge at which an AI will start down the path toward takeover (e.g., extricating itself, sabotaging rivals) and succeed with a probability that’s very much neither 0 nor 1. Is this first, weakish, self-aware AI able to extricate itself? It depends! But I still expect the relevant band of AI capabilities here to be pretty narrow, and we get no guarantee it will exist at all. And we might skip over it with a fancy new model (if it was sufficiently immobilized during training or guarded its goals well).
Of course, there’s still a continuity in expectation: when training each more powerful model, it has some probability of being The Big One. But yeah, I more or less predict a Big One; I believe in an essential discontinuity arising here from a continuous process. The best analogy I can think of is how every exponential with r<1 dies out and every r>1 goes off to infinity. When you allow dynamic systems, you naturally get cuspy behavior.
Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]:
1. Because of the way AI systems are trained, δ,δ′ will be large even if we knew humanity’s collective utility function and could target that (this is inner misalignment)
2. Even if δ′ were fairly small, this would still result in catastrophic outcomes if M′ is an extremely powerful optimizer (this is fragility of value)
A few questions:
3. Are the claims (1) and (2) part of the standard MIRI position?
4. Is the “misgeneralization” claim just ”δ′ will be much larger than δ”?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that either (1) or (2) alone is sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than either (1) or (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
There’s probably a better name for this. Please suggest one!
Non-exhaustive list.
I overall agree with this framing, but I think even in Before sufficiently bad mistakes can kill you, and in After sufficiently small mistakes wouldn’t. So, it’s mostly a claim about how strongly the mistakes would start to be amplified at some point.
Thanks for sharing your thoughts! I disagree with you significantly in a bunch of ways but I think people in positions of power at AI companies have a responsibility to keep the public informed about their takes on matters this important.
Thank you Daniel. I’m generally a fan of as much transparency as possible. In my research (and in general) I try to be non dogmatic and so if you believe that there are aspects I am wrong about, then I’d love to hear about them. (Especially if those can be empirically tested.)
Thanks. Well, I don’t have much time right now I’m afraid, but real quick I’ll say: I basically agree that progress will be fairly continuous in the future… yet still fast, fast enough that e.g. I don’t expect there to be any incidents where an AI system schemes against the company that created it, takes over its datacenter and/or persuades company leadership to trust it, and then gets up to further shenanigans before getting caught and shutdown. If I expected there to be things like that happening, especially repeatedly, before the first case of a scheming AI system that actually can succeed in taking over, then I’d think we’d have several opportunities to learn how to control AIs of that power level. (Even still this might not generalize to AIs of higher power levels, but maybe that’s OK if the same arguments apply e.g. higher levels of capability would still lead to failed attempts several times before successful attempts)
Anyhow I’m curious to hear what your critique of AI 2027 would be, since I like to think of it as a continuous story. Or, just I’d like to hear some examples from you of what sorts of misaligned AI failures you expect to see, before the first unrecoverable failure, that are nevertheless quite similar to the first unrecoverable failure such that we’ll probably be able to prevent the latter by studying the former.
I am also short in time, but re AI 2027. There are some important points I agree with, which is why I wrote in Machines of Faithful Obedience that I think the scenario where there is no competition and only internal deployment is risky.
I mostly think that the timelines were too aggressive and that we are more likely to continue on the METR path than explode, as well as multiple companies training and releasing models at a fast cadence. So it’s more like “Agent-X-n” (for various companies X and some large n) than “Agent 4“ and the difference between “Agent-X-n” and “Agent-X-n+1” will not be as dramatic.
Also, if we do our job right, Agent-X-n+1 will be more aligned than Agent-X-n.
First of all, I think that we will see the intelligence explosion once the AIs become superhuman coders. In addition, I don’t think that I understand how Agent-x-n+1 will become more aligned than Agent-x-n if mankind doesn’t create a new training environment which actually ensures that the AI obeys the Spec. For example, sycophancy was solved by the KimiK2 team which dared to stop using RLHF, resorting to RLVR and self-critique instead.
However, there is a piece of hope. For example, one could deploy the AIs to cross-check each other’s AI research. Alas, this technique might as well run into problems due to the fact that the companies were merged beforehand as a result of Taiwan having been invaded or that the AIs managed to agree on a common future. I did try to explore this technique and its potential results back when I wrote my version of the AI-2027 scenario.
Have you written more about your thought process on this somewhere else?
Either I don’t understand your view about continuous development and threshold things, or I think it’s probably wrong. I’ve decided to spin this out into its own comment, though I originally wrote it as part of this one.
I’m going to think out loud: One story I could tell for how we cross the gap from “the AI would not succeed at taking over” to “it would succeed if it tried”, in a continuous manner, is as @Nina Panickssery points out in a tweet, “The AI could be more or less likely to succeed”.
(My understanding is that we’re actually focusing on the AI’s beliefs about its likelihood of success, which I’ll call AI-p(success), not p(success) according to some observer—the AI’s beliefs are what shape its decisions.) So here’s an example model: maybe as the AI gets smarter or acquires more resources, its p(success) rises at 1 percentage point per day. We think the AI might do something like an expected value calculation, only trying to takeover when p(success) crosses some threshold. (But, like, we don’t know that that’s true. And we don’t know what the threshold is.) If the AI is doing some threshold thing like this, then it looks all good and chill until day 23 or whatever when the threshold is hit and the AI fully commits to a takeover. Under that AI decision procedure, it looks like by default we get this rough discontinuity where we don’t get any more iteration after day 22.
Now imagine a really cautious AI company, maybe during each day they try really hard to trick some instance of the AI that its AI-p(success) is actually higher than it is, and they’re able to push up AI-p(success) by 5 percentage points even while the actual likelihood of takeover is not inflated by 5 points (maybe they also do extra security measures for this AI, and together the adversarial testing and security create a 5% spread). So then on day 18 the AI tries to take over and we try not to be unlucky on an 18% takeover chance. And if we succeed at containing the AI, maybe we can run more experiments on this 18 day AI. What experiments are we going to run? I guess we’ll try to make it so that even when AI-p(success) exceeds 23% the AI does not try to take over, but idk how.
There are many issues with this plan:
We don’t actually know the decision process of the AI and don’t know if this model is at all right. Maybe instead it has the totally illogical decision process of trying hard to takeover in proportion to AI-p(success), so on day 1 it tries very little, on day 10 some, and on day 50 it tries pretty hard; that seems like an insanely bad strategy (and humanity might actually win if the AIs always took strategies like this), but there’s a bunch of other strategies the AI could be taking, some of which could be better or worse than the model here.
Even if we know the high-level decision process of the AI, we don’t know the threshold. (Aside: I assume the AI is well calibrated and a good forecaster, other than the modifications we make, i.e., its p(success) is accurate). If the threshold is high, then we are basically screwed in this scenario. Like if the AI follows the policy “only attempt take over if AI-p(success) ≥ 99%”, then we’re taking on a huge amount of risk (i.e., dying) unless we could somehow artificially modulate AI-p(success) by a ton. Like, using the numbers above, then the AI tries to takeover when its actual chance of success is actually merely 94%, ruh roh.
It’s not obvious that we’ll be able to measure AI-p(success) with much confidence. Even if we knew the threshold, we probably can’t actually measure the thing. AI mind reading seems kinda hard, and while I definitely think it’s plausible that we’ll get this, I might be at like 50⁄50. Doing the above scheme without being able to measure AI-p(success) seems extremely difficult due to not knowing if your methods for modifying AI-p(success) are working.
I don’t expect us to know observer-p(success) with much confidence. It seems like most of the claims from AI companies about current AI capabilities are pretty soft, unconfident claims. For instance, OpenAI, Anthropic, and Google DM have all said in the last few months that their frontier models might be capable of significant CBRN (bio especially) uplift for unskilled actors, but they aren’t sure. OpenAI: “We do not have definitive evidence that this model could meaningfully help a novice to create severe biological harm, our defined threshold for High capability, and the model remains on the cusp of being able to reach this capability.”Anthropic: “Although we have not determined whether the ASL-3 Standard is required for this model”. Google: “Further study is required to reach our final assessment of whether the model has reached the CCL for CBRN Uplift Level 1.” Now maybe this is just corporate hedging, but I think it’s way more likely, and supported by a preponderance of evidence, that assessing AI capabilities, especially for real world tasks, is hard to do and we are currently bad at it. I basically expect that in a couple years we’ll enter a regime of “the experts do not know if frontier AIs would be able to successfully escape their containment or successfully takeover if they tried, it seems like at least 1% likely” and probably our p(success) will go up as AI capabilities increase, but it won’t be well justified ever and there will be significant disagreement about the number among experts.
It’s unclear if we’ll be able to change AI-p(success) much. 5 percentage points felt reasonable but idk. This is certainly a place where if we don’t try very hard at this kind of red-teaming, then we certainly won’t succeed at this part.
Unfortunately, it looks like it might be dangerous to experiment with the 18-day AI, on account of it having an 18% chance of successful take over. It’s unclear how correlated outcomes will be across different experiments (i.e., are you rolling 18% each time or not), especially considering how the humans can potentially patch various pathways that different instances take when attempting takeover. But gosh, looks rough.
It’s not clear what experiments one would run with the 18-day AI that would actually help cross the gap robustly. For instance, maybe we’ll just end up pushing the threshold up a couple days or changing the AI’s decision process; it seems like there are a lot of ways we could fail at this step.
I could be totally misunderstanding Nina’s idea, this is all very complicated.
It seems that in P[success] you are assuming that AI only cares about escaping and taking over the world. I believe that if we fail at alignment and our systems will have the propensity to pursue covert goals then we will be able to observe and track it. Indeed, we have a paper (see also my twitter thread) where we do observe some propensity, and tackling scheming is a priority for us at OpenAI. If we fail, then I believe that unfortunately we will see bad consequences of deploying such models in the real world well before they are capable of taking over.
Also it seems that you are assuming that the AI is a singular entity. If there are N copies of the AIs out there (not to mention if there are K different companies releasing AIs of simialr capabilities), even if all of them mak a correct expected utility calculation on whether to escape, some might be detected. (Unless the utility of being detected is negative infinity but in this case they would never attempt exfiltration.)
Wrote the following twitter and thought I would share here:
Just to make sure no one gets the impression that I think AI could not have catastrophic consequences or that it will be safe by default. However, the continuous worldview also implies very different approaches for policies than the essentially total AI development ban proposed in the book.
Thanks for taking the time to write up your reflections. I agree that the before/after distinction seems especially important (‘only one shot to get it right’), and a crux that I expect many non-readers not to know about the EY/NS worldview.
I’m wondering about your take in this passage:
I’m curious what about the world’s experience with AI seems to falsify it from your POV? / casts doubt upon it? Is it about believing that systems have become safer and more controlled over time?
(Nit, but the book doesn’t posit that the explosion happens at the top rung; in that case, we could just avoid ever reaching the top rung. It posits that the explosion happens at a not-yet-known rung, and so each successive rung climb carries some risk of blow-up. I don’t expect this distinction is load-bearing for you though)
(Edit: my nit is wrong as written! Thanks Boaz—he’s right that the book’s argument is actually about the top of the ladder, I was mistaken—though with the distinction I was trying to point at, of not knowing where the top is, so from a climber’s perspective there’s no way of just avoiding that particular rung)
p.s. I just realized that I did not answer your question:
> Is it about believing that systems have become safer and more controlled over time?
No this is not my issue here. While I hope it won’t be the case, systems could well become more risky and less controlled over time. I just believe that if that is the case then it would be observable via seeing increased rate of safety failures far before we get to the point where failure means that literally everyone on earth dies.
What’s the least-worrying thing we may see that you’d expect to lead to a pause in development?
(this isn’t a trick question; I just really don’t know what kind of thing gradualists would consider cause for concern, and I don’t find official voluntary policies to be much comfort, since they can just be changed if they’re too inconvenient. I’m asking for a prediction, not any kind of commitment!)
See my response to Eliezer. I don’t think it’s one shot—I think there are going to be both successes and failures along the way that would give us information that we will be able to use.
Even self improvement is not a singular event—already AI scientists are using tools such as codex or claude code to improve their own productivity. As models grow in capability, the benefit of such tools will grow, but it is not necessarily one event. Also, I think that we would likely require this improvement just to sustain the exponential at its current rate- it would not be sustainable to continue the growth in hiring and so increasing productivity via AI would be necessary.
Re the nit: In page 205 they say “Imagine that evert competing AI company is climbing a ladder in the dark. At every rung but the top one, they get five times as much money … But if anyone reaches the top rung, the ladder explodes and kills everyone. Also nobody knows where the ladder ends.”
I’ll edit a bit the text so it’s clear you don’t know when it ends.
(Appreciate the correction re my nit, edited mine as well)
The time when the AI can optimize itself better then a human is a one-off event. You get the overhang/potential take-off here. Also the AI having a coherent sense of “self” that it could protect by say changing its own code, controlling instances of itself could be an attractor and give “before/after”.
See this response
Also I don’t think sense of “self” is a singular event either, indeed already today’s systems are growing in their situational awareness which can be thought as some sense of self. See our scheming paper https://www.antischeming.ai/