The gap between Before and After is the gap between “you can observe your failures and learn from them” and “failure kills the observer”. Continuous motion between those points does not change the need to generalize across them.
It is amazing how much of an antimeme this is (to some audiences). I do not know any way of saying this sentence that causes people to see the distributional shift I’m pointing to, rather than mapping it onto some completely other idea about hard takeoffs, or unipolarity, or whatever.
Where do you think you’ve spelled this argument out best? I’m aware of a lot of places where you’ve made the argument in passing, but I don’t know of anywhere where you say it in depth.
My response last time (which also wasn’t really in depth; I should maybe try to articulate my position better sometime...) was this:
I agree that the regime where mistakes don’t kill you isn’t the same as the regime where mistakes do kill you. But it might be similar in the relevant respects. As a trivial example, if you build a machine in America it usually works when you bring it to Australia. I think that arguments at the level of abstraction you’ve given here don’t establish that this is one of the cases where the risk of the generalization failing is high rather than low.
I’m guessing this won’t turn out to resolve your current disagreement, but I think the best articulation of this is probably in the Online Resource page: A Closer Look at Before And After.
From past discussions, it sounds like you think “the AIs are now capable of confidently taking over” is, like, <50% (at least < 60%?) likely to in practice be a substantially different an environment.
I don’t really get why. But, to be fair, on my end, I also don’t really have much more gears underneath the hood of “obviously, it’s way different to run tests and interventions on someone who isn’t capable of confidently taking over, vs someone who is, because they just actually have the incentive to defect in the latter case and mostly don’t in the former”. It seems like there’s just a brute difference in intuition I’m not sure what to do with?
(I agree there might be scenarios like “when the AI takeover is only 10% likely to work, it might try anyway, because it anticipates more powerful AIs coming later and now seems like it’s best shot.” That’s a reason you might get a warning shot, but, not a reason that “it can actually just confidently takeover with like 95%+ likelihood” doesn’t count as a significantly new environment once we actually get to that stage.)
Text of the relevant resource section, for reference
As mentioned in the chapter, the fundamental difficulty researchers face in AI is this:
You need to align an AI Before it is powerful enough and capable enough to kill you (or, separately, to resist being aligned). That alignment must then carry over to different conditions, the conditions After a superintelligence or set of superintelligences* could kill you if they preferred to.
In other words: If you’re building a superintelligence, you need to align it without ever being able to thoroughly test your alignment techniques in the real conditions that matter, regardless of how “empirical” your work feels when working with systems that are not powerful enough to kill you.
This is not a standard that AI researchers, or engineers in almost any field, are used to.
We often hear complaints that we are asking for something unscientific, unmoored from empirical observation. In reply, we might suggest talking to the designers of the space probes we talked about in Chapter 10.
Nature is unfair, and sometimes it gives us a case where the environment that counts is not the environment in which we can test. Still, occasionally, engineers rise to the occasion and get it right on the first try, when armed with a solid understanding of what they’re doing — robust tools, strong predictive theories — something very clearly lacking in the field of AI.
The whole problem is that the AI you can safely test, without any failed tests ever killing you, is operating under a different regime than the AI (or the AI ecosystem) that needs to have already been tested, because if it’s misaligned, then everyone dies. The former AI, or system of AIs, does not correctly perceive itself as having a realistic option of killing everyone if it wants to. The latter AI, or system of AIs, does see that option.†
Suppose that you were considering making your co-worker Bob the dictator of your country. You could try making him the mock dictator of your town first, to see if he abuses his power. But this, unfortunately, isn’t a very good test. “Order the army to intimidate the parliament and ‘oversee’ the next election” is a very different option from “abuse my mock power while being observed by townspeople (who can still beat me up and deny me the job).”
Given a sufficiently well-developed theory of cognition, you could try to read the AI’s mind and predict what cognitive state it would enter if it really did think it had the opportunity to take over.
And you could set up simulations (and try to spoof the AI’s internal sensations, and so on) in a way that your theory of cognition predicts would be very similar to the cognitive state the AI would enter once it really had the option to betray you.‡
But the link between these states that you induce and observe in the lab, and the state where the AI actually has the option to betray you, depends fundamentally on your untested theory of cognition. An AI’s mind is liable to change quite a bit as it develops into a superintelligence!
If the AI creates new successor AIs that are smarter than it, those AIs’ internals are likely to differ from the internals of the AI you studied before. When you learn only from a mind Before, any application of that knowledge to the minds that come After routes through an untested theory of how minds change between the Before and the After.
Running the AI until it has the opportunity to betray you for real, in a way that’s hard to fake, is an empirical test of those theories in an environment that differs fundamentally from any lab setting.
Many a scientist (and many a programmer) knows that their theories of how a complicated system is going to work in a fundamentally new operating environment often don’t go well on the first try.§ This is a research problem that calls for an “unfair” level of predictability, control, and theoretical insight, in a domain with unusually lowlevels of understanding — with all of our lives on the line if the experiment’s result disconfirms the engineers’ hopes.
This is why it seems overdetermined, from our perspective, that researchers should not rush ahead to push the frontier of AI as far as it can be pushed. This is a legitimately insane thing to attempt, and a legitimately insane thing for any government to let happen.
It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.
In reality in AI there would be multiple generations of different “Bobs” (although it is possible—and needs to be studied—that they would have affinity for one another and try to communicate or influence each other via hidden means). These “bobs” will be growing in both capability but also actually deployed in the world with real-world consequences. Our current “Bobs” are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. “Bob” will start doing work of an intern, then a junior worker, then more senior one, managing teams of other “Bobs”. To continue the analogy, maybe “Bob” will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.
I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will “explain away” these failures and still end up in a very bad place. I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.
I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky ’s expectations of how likely this scenario is to materialize.
Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.
I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all.
This interpretation only works if by missed shots you mean “missed opportunities to completely solve alignment”. Otherwise you can observe multiple failures along the way and fix observable scheming, but you only need to miss one alignment failure on the last capability level. The point is just that your monitoring methods, even improved after many failures to catch scheming in pre-takeover regime, are finally tested only when AI is really can take over. Because real ability to take over is hard to fake. And you can’t repeat this test after you improved your monitoring, if you failed. Maybe your alignment training after previous observed failure in pre-takeover regime really made AI non-scheming. But maybe you just missed some short thought where AI decided to not think about takeover since it can’t win yet. And you’ll need to rely on your monitoring without actually testing whether it can catch all such possibilities that depend on actual environment that allows takeover.
You seem to be assuming that you cannot draw any useful lessons from cases where failure falls short of killing everyone on earth that would apply to cases where it does.
However, if AI’s advance continuously in capabilities, then there are many intermediate points between today where (for example) “failure means prompt injection causes privacy leak” and “failure means everyone is dead”. I believe that if AIs that capable of the latter would be scaled up version of current models, then by studying which alignment methods scale and do not scale, we can obtain valuable information.
If you consider the METR graph, of (roughly) duration of tasks quadrupling every year, then you would expect non-trivial gaps between the points. that (to take the cybersecurity example) AI is at the level of a 2025 top expert, AI can be equivalent to a 2025 top level hacking team, AI reaches 2025 top nation state capabilities. (And of course while AI improves , the humans will be using AI assistance also.)
I believe there is going to be a long and continuous road ahead between current AI systems and ones like Sable in your book. I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run. Hence I believe we will be able to learn from both successes and failures of our alignment methods throughout this time.
Of course, it is possible that I am wrong, and future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.
I’m not sure what Eliezer thinks, but I don’t think it’s true that “you cannot draw any useful lessons from [earlier] cases”, and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it’s left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that “try to take over” is a more serious strategy, somewhat tautologically because the book defines the gap as:
Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed. (p. 161)
I don’t see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.
It seems like maybe you’re just accepting that there is this one problem that we won’t be able to get direct evidence about in advance, but you’re optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.
When you say “by studying which alignment methods scale and do not scale, we can obtain valuable information”, my interpretation is that you’re basically saying “by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D”. Is that right?
Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can’t tell if the sticking point is that people don’t actually believe in the second regime.
I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
There are rumors that many capability techniques work well at a small scale but don’t scale very well. I’m not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it’s pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.
Treating “takeover” as a single event brushes a lot under the carpet.
There are a number of capabilities involved—cybersecurity, bioweapon, etc.. - that models are likely to develop at different stages. I agree AI will ultimately far surpass our 2025 capabilities in all these areas. Whether that would be enough to take over the world at that point in time is a different questoin.
Then there are propensities. Taking over requires the model to have the propensity to “resist our attempts to change its goal” as well to act covertly in pursuit of its own objectives, which are not the ones it was instructed. (I think these days we are not really thinking models are going to misunderstand their instructions in a “monkey’s paws” style.)
If we do our job right in alignment, we would be able to drive these propensities down to zero. But if we fail, I believe these propensities will grow over time, and as we iteratively deploy AI systems with growing capabilities, even if we fail to observe these issues in the lab, we will observe them in the real world well before the scale of killing everyone.
There are a lot of bad things that AIs can do before literally taking over the world. I think there is another binary assumption which is that AIs utility function is binary—somehow the expected value calculations work out such that we get no signal until the takeover.
Re my comment on the 16 hour 200K GPU run. I agree that things can be different at scale and it is important to keep measuring them as scale increases. What I meant is that even when things get worse with scale we would be able to observe it. But the exampe of the book—as I understood it—was not a “scale up.” Scale up is when you do a completely new training run, in the book that run was just some “cherry on top”—one extra gradient step—which presumably was minor in terms of compute compared to all that came before it. I don’t think one step will make the model suddenly misaligned. (Unless it completely borks it, which would be very observable.)
Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.
There are a lot of bad things that AIs can do before literally taking over the world.
Okay, it does sound like you’re saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.
Maybe I’m misunderstanding though, it’s possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that’s an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I’m worried about.
Whether that would be enough to take over the world at that point in time is a different questoin.
To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn’t think the game would ever happen, that sounds like it’s the real crux, not some complicated argument about how well the training drills match the game.
I think it’s more like we have problems A_1, A_2, A_3, ….. and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.
We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn’t one “Game Day.” Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.
So my mental model is there is a sequence of models M_1,M_2,.… of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)
Now suppose you have such a sequence of models M_1,M_2,..… of growing capabilities. I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.
I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.
Regarding your second question of whether AIs would be powerful enough to take over the world at some point:
My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.
I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.
I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.
Suppose that all scheming AIs were perfectly allied. As in, Mn is indifferent between it taking over and Mn+k taking over, so Mn is willing to be patient if this would be a better strategy. (We can imagine this applying within a company and between companies.) Would this change your perspective?
For what it’s worth, I broadly agree with “we’ll probably be able to study the problem of scheming AIs hiding their intentions in advance, including by studying reasonably analogous (to the dangerous AIs we’re worried about) ‘actual natural scheming AIs’ that consistently try to hide their intentions in training/testing”, but it’s pretty easy for this to go poorly. The main reasons why it’s easy to go poorly are that schemers actively want to avoid being detected and to avoid being studied and we might be extremely rushed due to a software intelligence explosion. Another mundane issue is that we just might not have that many distinct capable (scheming) AIs to study (as creating capable AIs is expensive) and findings might not transfer.
Here’s an example situation where it seems like studying scheming is pretty hard: suppose that schemers are decently allied and that all schemers above some capability level X are smart enough to reliably lie in wait undetected and takeover is easy/possible for AIs (collectively) at capability level Y where Y>>X. Then, naively, you can only study schemers below capability level X while your methods still have to generalize all the way to Y. Further, schemers below capability level X might be particularly disanalogous because schemers that have trouble avoiding detection have importantly different properties from schemers that do avoid detection. You can study schemers which are just barely below X, but then these schemers will probably be hard to detect and you’ll miss them a bunch which makes studying them harder (e.g. because you can’t be confident whether some method avoided scheming or if it instead made the AI a bit better at hiding its scheming which suffices to make it not be detectable). You could try to overcome this in a bunch of ways, e.g. trying to create schemers which don’t have allied goals so they are more likely to take risks or to accept deals.
This example situation is relatively pessimistic, but not totally implausible and just subsets of these factors could still cause serious issues.
A lot of my view is that this is actually somewhat conceptually tricky to study and generally a pain for a bunch of reasons (e.g. you might have to train a bunch of diverse AIs), and as it currently stands we don’t seem on track to do a great job studying this on time if capabilities progress as quickly as seems pretty likely. It also seems like companies might react very poorly to clear cut evidence for risk.
Hi Ryan, will be brief but generally: 1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations. 2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don’t have good solutions for them yet..
I note that to my eyes, you appear to be straightforwardly accepting the need-to-generalize claim, and arguing for ability-to-generalize. Putting words in your mouth a little, what I see you saying is that, by the time we have a true loss-of-control-can-be-catastrophic moment where failure kills boazbarak, we have had enough failure recoveries on highly similar systems to be sure deadly-failure probability is indistinguishable from zero, that maximum-likely-failure-consequence is shrinking as fast or faster than model capability.
But current approaches don’t seem to me to zero out the rate of failures above a certain level of catastrophicness. They’re best seen as continuous in probability, not continuous in failure size.
I am not sure I 100% understand what you are saying. Again, like I wrote elsewhere, it is possible that for one reason or another rather than systems becoming safer and more controlled, they will become less safe and riskier over time. It is possible we will have a sequence of failures growing in magnitude over time, but for one reason or another do not address them, and hence since end up in a very large scale catastrophe.
It is possible that current approaches are not good enough and will not improve fast enough to match the stakes at which we want to deploy AI. If that is the case then it will end badly, but I believe that we will see many bad outcomes well before an extinction event. To put it crudely, I would expect that if we are on a path to that ending, the magnitude of harms that will be caused by AI will climb on an exponential scale over time similar to how other capabilities are growing.
“future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.”
This doesn’t feel like convincing reasoning to me. For one, there is also a third option, which is that both scaling up current methods (with small modifications) and paradigm shifts could lead us to superintelligence. To me, this seems intuitively to be the most likely situation. Also paradigm shifts could be around the corner at any point, any of the vast number of research directions could give us a big leap in efficiency for example at any point.
Note that this is somewhat of an anti-empirical stance—by hypothesizing that superintelligence will arrive by some unknown breakthrough that would both take advantage of current capabilities and render current alignment methods moot—you are essentially saying that no evidence can update you.
One thing I like about your position is that you basically demand of Eliezer and Nate to tell you what kind of alignment evidence would update them towards believing it’s safe to proceed. As in, E&N say we would need really good interp insights, good governance, good corrigibility on hard tasks, and so on. I would expect that they put the requirements very high and that you would reject these requirements as too high, but still seems useful for Eliezer and Nate to state their requirements. (Though perhaps they have done this at some point and I missed it)
To respond to your claim that no evidence could update ME and that I am anti-empirical? I don’t quite see were I wrote anything like that. I am making the literal point that: you say that there are two options, either scaling up current methods leads to superintelligence or it requires new paradigm shifts/totally new approaches. But there is also a third option, that there are multiple paths forward right now to superintelligence, paradigm shifts and scaling up.
Yes, I do expect that current “alignment” methods like RLHF or COT monitoring will predictably fail for overdetermined reasons when systems are powerful enough to kill us and run their own economy. There is empirical evidence against COT monitoring and against RLHF. In both cases we could have also predicted failure without empirical evidence just from conceptual thinking (people will upvote what they like vs whats true, COT will become less understandable the less the model is trained on human data), though the evidence helps. I am basically seeing lots of evidence that current methods will fail, so no I don’t think I am anti-empirical. I also don’t think that empiricism should be used as anti-epistemology or as an argument for not having a plan and blindly stepping forward.
I also believe that our current alignment methods will not scale and that we need to develop new ones. In particular I am a co author of the scheming paper mentioned in the first link you say.
As I said multiple times, I don’t think we will succeed by default. I just think that if we fail we will do so multiple times with failures continually growing in magnitude and impact.
In this framing the crux is whether there is an After at all (at any level of capability). The distinction between “failure doesn’t kill the observer” (a perpetual Before) and “failure is successfully avoided” (managing to navigate the After).
Here’s my attempted phrasing, which I think avoids some of the common confusions:
Suppose we have a model M with utility function ϕ, where M is not capable of taking over the world. Assume that thanks to a bunch of alignment work, ϕ is within δ (by some metric) of humanity’s collective utility function. Then in the process of maximizing ϕ, M ends up doing a bunch of vaguely helpful stuff.
Then someone releases model M′ with utility function ϕ′, where M′ is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, ϕ′ is also within δ′ of humanity’s collective utility function, where δ′≤δ Then in the process of maximizing ϕ′, M′ gets rid of humans and rearranges their molecules to satisfy ϕ′ better.
This is an excellent encapsulation of (I think) something different—the “fragility of value” issue: “formerly adequate levels of alignment can become inadequate when applied to a takeover-capable agent.” I think the “generalization gap” issue is “those perfectly-generalizing alignment techniques must generalize perfectly on the first try”.
Attempting to deconfuse myself about how that works if it’s “continuous” (someone has probably written the thing that would deconfuse me, but as an exercise): if AI power progress is “continuous” (which training is, but model-sequence isn’t), it goes from “you definitely don’t have to get it right at all to survive” to “you definitely get only one try to get it sufficiently right, if you want to survive,” but by what path? In which of the terms “definitely,” “one,” and “sufficiently” is it moving continuously, if any?
I certainly don’t think it’s via the number of tries you get to survive! I struggle to imagine an AI where we all die if we fail to align it three times in a row.
I don’t put any stock in “sufficiently,” either—I don’t believe in a takeover-capable AI that’s aligned enough to not work toward takeover, but which would work toward takeover if it were even more capable. (And even if one existed, it would have to eschew RSI and other instrumentally convergent things, else it would just count as a takeover-causing AI.)
It might be via the confidence of the statement. Now, I don’t expect AIs to launch highly-contingent outright takeover attempts; if they’re smart enough to have a reasonable chance of succeeding, I think they’ll be self-aware enough to bide their time, suppress the development of rival AIs, and do instrumentally convergent stuff while seeming friendly. But there is some level of self-knowledge at which an AI will start down the path toward takeover (e.g., extricating itself, sabotaging rivals) and succeed with a probability that’s very much neither 0 nor 1. Is this first, weakish, self-aware AI able to extricate itself? It depends! But I still expect the relevant band of AI capabilities here to be pretty narrow, and we get no guarantee it will exist at all. And we might skip over it with a fancy new model (if it was sufficiently immobilized during training or guarded its goals well).
Of course, there’s still a continuity in expectation: when training each more powerful model, it has some probability of being The Big One. But yeah, I more or less predict a Big One; I believe in an essential discontinuity arising here from a continuous process. The best analogy I can think of is how every exponential with r<1 dies out and every r>1 goes off to infinity. When you allow dynamic systems, you naturally get cuspy behavior.
Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]: 1. Because of the way AI systems are trained, δ,δ′ will be large even if we knew humanity’s collective utility function and could target that (this is inner misalignment) 2. Even if δ′ were fairly small, this would still result in catastrophic outcomes if M′ is an extremely powerful optimizer (this is fragility of value)
A few questions: 3. Are the claims (1) and (2) accurate representations of inner misalignment and fragility of value? 4. Is the “misgeneralization” claim just ”δ′ will be much larger than δ”?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that (1) and (2) are sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than (1) and (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
I like your made up notation. I’ll try to answer, but I’m an amateur in both reasoning-about-this-stuff and representing-others’-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the “generalization step is hard” point is roughly “you can get δ low by trial and error. The technique you found at the end that gets δ low—it better not intrinsically depend on the trial and error process, because you don’t get to do trial and error on δ‘. Moreover, it better actually work on M’.”
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That’s one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don’t like “generalization” as a name for this step. Maybe “extension”? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ′ (verbiage different because of the first-time constraint), the disastrousness of even smallish δ’…
I overall agree with this framing, but I think even in Before sufficiently bad mistakes can kill you, and in After sufficiently small mistakes wouldn’t. So, it’s mostly a claim about how strongly the mistakes would start to be amplified at some point.
The gap between Before and After is the gap between “you can observe your failures and learn from them” and “failure kills the observer”. Continuous motion between those points does not change the need to generalize across them.
It is amazing how much of an antimeme this is (to some audiences). I do not know any way of saying this sentence that causes people to see the distributional shift I’m pointing to, rather than mapping it onto some completely other idea about hard takeoffs, or unipolarity, or whatever.
Where do you think you’ve spelled this argument out best? I’m aware of a lot of places where you’ve made the argument in passing, but I don’t know of anywhere where you say it in depth.
My response last time (which also wasn’t really in depth; I should maybe try to articulate my position better sometime...) was this:
I’m guessing this won’t turn out to resolve your current disagreement, but I think the best articulation of this is probably in the Online Resource page: A Closer Look at Before And After.
From past discussions, it sounds like you think “the AIs are now capable of confidently taking over” is, like, <50% (at least < 60%?) likely to in practice be a substantially different an environment.
I don’t really get why. But, to be fair, on my end, I also don’t really have much more gears underneath the hood of “obviously, it’s way different to run tests and interventions on someone who isn’t capable of confidently taking over, vs someone who is, because they just actually have the incentive to defect in the latter case and mostly don’t in the former”. It seems like there’s just a brute difference in intuition I’m not sure what to do with?
(I agree there might be scenarios like “when the AI takeover is only 10% likely to work, it might try anyway, because it anticipates more powerful AIs coming later and now seems like it’s best shot.” That’s a reason you might get a warning shot, but, not a reason that “it can actually just confidently takeover with like 95%+ likelihood” doesn’t count as a significantly new environment once we actually get to that stage.)
Text of the relevant resource section, for reference
It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.
In reality in AI there would be multiple generations of different “Bobs” (although it is possible—and needs to be studied—that they would have affinity for one another and try to communicate or influence each other via hidden means). These “bobs” will be growing in both capability but also actually deployed in the world with real-world consequences. Our current “Bobs” are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. “Bob” will start doing work of an intern, then a junior worker, then more senior one, managing teams of other “Bobs”. To continue the analogy, maybe “Bob” will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.
I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will “explain away” these failures and still end up in a very bad place. I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.
I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky ’s expectations of how likely this scenario is to materialize.
Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.
I know you’re pointing out the easier case still not working, but I just want to caution against the “drive it to zero” mentality, since I worry strongly that it’s the exact mentality researchers often have.
When that’s your mental model, reducing rates will seem like progress.
IMO the default failure mode here is:
We do observe them (or early versions of them)
The lab underinvests in the problem
It becomes enough of a problem that its painful to product or internal capabilities usage
We didn’t invest enough to actually solve the underlying problem, and we can’t afford to not use the model while we wait for alignment research to catch up
The lab patches over the problem with some “reduces but does not eliminate” technique
The model is then usable, but with harder to detect misalignment
Scale capabilities and repeat
This is the exact loop we’re in now, and the dynamics only intensify with time and capabilities.
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.
This interpretation only works if by missed shots you mean “missed opportunities to completely solve alignment”. Otherwise you can observe multiple failures along the way and fix observable scheming, but you only need to miss one alignment failure on the last capability level. The point is just that your monitoring methods, even improved after many failures to catch scheming in pre-takeover regime, are finally tested only when AI is really can take over. Because real ability to take over is hard to fake. And you can’t repeat this test after you improved your monitoring, if you failed. Maybe your alignment training after previous observed failure in pre-takeover regime really made AI non-scheming. But maybe you just missed some short thought where AI decided to not think about takeover since it can’t win yet. And you’ll need to rely on your monitoring without actually testing whether it can catch all such possibilities that depend on actual environment that allows takeover.
Yep. Feel free to add it here
You seem to be assuming that you cannot draw any useful lessons from cases where failure falls short of killing everyone on earth that would apply to cases where it does.
However, if AI’s advance continuously in capabilities, then there are many intermediate points between today where (for example) “failure means prompt injection causes privacy leak” and “failure means everyone is dead”. I believe that if AIs that capable of the latter would be scaled up version of current models, then by studying which alignment methods scale and do not scale, we can obtain valuable information.
If you consider the METR graph, of (roughly) duration of tasks quadrupling every year, then you would expect non-trivial gaps between the points. that (to take the cybersecurity example) AI is at the level of a 2025 top expert, AI can be equivalent to a 2025 top level hacking team, AI reaches 2025 top nation state capabilities. (And of course while AI improves , the humans will be using AI assistance also.)
I believe there is going to be a long and continuous road ahead between current AI systems and ones like Sable in your book.
I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
Hence I believe we will be able to learn from both successes and failures of our alignment methods throughout this time.
Of course, it is possible that I am wrong, and future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.
I’m not sure what Eliezer thinks, but I don’t think it’s true that “you cannot draw any useful lessons from [earlier] cases”, and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it’s left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that “try to take over” is a more serious strategy, somewhat tautologically because the book defines the gap as:
I don’t see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.
It seems like maybe you’re just accepting that there is this one problem that we won’t be able to get direct evidence about in advance, but you’re optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.
When you say “by studying which alignment methods scale and do not scale, we can obtain valuable information”, my interpretation is that you’re basically saying “by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D”. Is that right?
Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can’t tell if the sticking point is that people don’t actually believe in the second regime.
There are rumors that many capability techniques work well at a small scale but don’t scale very well. I’m not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it’s pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.
Treating “takeover” as a single event brushes a lot under the carpet.
There are a number of capabilities involved—cybersecurity, bioweapon, etc.. - that models are likely to develop at different stages. I agree AI will ultimately far surpass our 2025 capabilities in all these areas. Whether that would be enough to take over the world at that point in time is a different questoin.
Then there are propensities. Taking over requires the model to have the propensity to “resist our attempts to change its goal” as well to act covertly in pursuit of its own objectives, which are not the ones it was instructed. (I think these days we are not really thinking models are going to misunderstand their instructions in a “monkey’s paws” style.)
If we do our job right in alignment, we would be able to drive these propensities down to zero.
But if we fail, I believe these propensities will grow over time, and as we iteratively deploy AI systems with growing capabilities, even if we fail to observe these issues in the lab, we will observe them in the real world well before the scale of killing everyone.
There are a lot of bad things that AIs can do before literally taking over the world. I think there is another binary assumption which is that AIs utility function is binary—somehow the expected value calculations work out such that we get no signal until the takeover.
Re my comment on the 16 hour 200K GPU run. I agree that things can be different at scale and it is important to keep measuring them as scale increases. What I meant is that even when things get worse with scale we would be able to observe it. But the exampe of the book—as I understood it—was not a “scale up.” Scale up is when you do a completely new training run, in the book that run was just some “cherry on top”—one extra gradient step—which presumably was minor in terms of compute compared to all that came before it. I don’t think one step will make the model suddenly misaligned. (Unless it completely borks it, which would be very observable.)
Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.
Okay, it does sound like you’re saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.
Maybe I’m misunderstanding though, it’s possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that’s an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I’m worried about.
To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn’t think the game would ever happen, that sounds like it’s the real crux, not some complicated argument about how well the training drills match the game.
I think it’s more like we have problems A_1, A_2, A_3, ….. and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.
We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn’t one “Game Day.” Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.
So my mental model is there is a sequence of models M_1,M_2,.… of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)
Now suppose you have such a sequence of models M_1,M_2,..… of growing capabilities. I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.
I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.
Regarding your second question of whether AIs would be powerful enough to take over the world at some point:
My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.
I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.
Suppose that all scheming AIs were perfectly allied. As in, Mn is indifferent between it taking over and Mn+k taking over, so Mn is willing to be patient if this would be a better strategy. (We can imagine this applying within a company and between companies.) Would this change your perspective?
For what it’s worth, I broadly agree with “we’ll probably be able to study the problem of scheming AIs hiding their intentions in advance, including by studying reasonably analogous (to the dangerous AIs we’re worried about) ‘actual natural scheming AIs’ that consistently try to hide their intentions in training/testing”, but it’s pretty easy for this to go poorly. The main reasons why it’s easy to go poorly are that schemers actively want to avoid being detected and to avoid being studied and we might be extremely rushed due to a software intelligence explosion. Another mundane issue is that we just might not have that many distinct capable (scheming) AIs to study (as creating capable AIs is expensive) and findings might not transfer.
I say much more in this post I recently wrote.
Here’s an example situation where it seems like studying scheming is pretty hard: suppose that schemers are decently allied and that all schemers above some capability level X are smart enough to reliably lie in wait undetected and takeover is easy/possible for AIs (collectively) at capability level Y where Y>>X. Then, naively, you can only study schemers below capability level X while your methods still have to generalize all the way to Y. Further, schemers below capability level X might be particularly disanalogous because schemers that have trouble avoiding detection have importantly different properties from schemers that do avoid detection. You can study schemers which are just barely below X, but then these schemers will probably be hard to detect and you’ll miss them a bunch which makes studying them harder (e.g. because you can’t be confident whether some method avoided scheming or if it instead made the AI a bit better at hiding its scheming which suffices to make it not be detectable). You could try to overcome this in a bunch of ways, e.g. trying to create schemers which don’t have allied goals so they are more likely to take risks or to accept deals.
This example situation is relatively pessimistic, but not totally implausible and just subsets of these factors could still cause serious issues.
A lot of my view is that this is actually somewhat conceptually tricky to study and generally a pain for a bunch of reasons (e.g. you might have to train a bunch of diverse AIs), and as it currently stands we don’t seem on track to do a great job studying this on time if capabilities progress as quickly as seems pretty likely. It also seems like companies might react very poorly to clear cut evidence for risk.
Hi Ryan, will be brief but generally:
1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations.
2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don’t have good solutions for them yet..
I note that to my eyes, you appear to be straightforwardly accepting the need-to-generalize claim, and arguing for ability-to-generalize. Putting words in your mouth a little, what I see you saying is that, by the time we have a true loss-of-control-can-be-catastrophic moment where failure kills boazbarak, we have had enough failure recoveries on highly similar systems to be sure deadly-failure probability is indistinguishable from zero, that maximum-likely-failure-consequence is shrinking as fast or faster than model capability.
But current approaches don’t seem to me to zero out the rate of failures above a certain level of catastrophicness. They’re best seen as continuous in probability, not continuous in failure size.
I am not sure I 100% understand what you are saying. Again, like I wrote elsewhere, it is possible that for one reason or another rather than systems becoming safer and more controlled, they will become less safe and riskier over time. It is possible we will have a sequence of failures growing in magnitude over time, but for one reason or another do not address them, and hence since end up in a very large scale catastrophe.
It is possible that current approaches are not good enough and will not improve fast enough to match the stakes at which we want to deploy AI. If that is the case then it will end badly, but I believe that we will see many bad outcomes well before an extinction event. To put it crudely, I would expect that if we are on a path to that ending, the magnitude of harms that will be caused by AI will climb on an exponential scale over time similar to how other capabilities are growing.
“future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.”
This doesn’t feel like convincing reasoning to me. For one, there is also a third option, which is that both scaling up current methods (with small modifications) and paradigm shifts could lead us to superintelligence. To me, this seems intuitively to be the most likely situation. Also paradigm shifts could be around the corner at any point, any of the vast number of research directions could give us a big leap in efficiency for example at any point.
Note that this is somewhat of an anti-empirical stance—by hypothesizing that superintelligence will arrive by some unknown breakthrough that would both take advantage of current capabilities and render current alignment methods moot—you are essentially saying that no evidence can update you.
One thing I like about your position is that you basically demand of Eliezer and Nate to tell you what kind of alignment evidence would update them towards believing it’s safe to proceed. As in, E&N say we would need really good interp insights, good governance, good corrigibility on hard tasks, and so on. I would expect that they put the requirements very high and that you would reject these requirements as too high, but still seems useful for Eliezer and Nate to state their requirements. (Though perhaps they have done this at some point and I missed it)
To respond to your claim that no evidence could update ME and that I am anti-empirical? I don’t quite see were I wrote anything like that. I am making the literal point that: you say that there are two options, either scaling up current methods leads to superintelligence or it requires new paradigm shifts/totally new approaches. But there is also a third option, that there are multiple paths forward right now to superintelligence, paradigm shifts and scaling up.
Yes, I do expect that current “alignment” methods like RLHF or COT monitoring will predictably fail for overdetermined reasons when systems are powerful enough to kill us and run their own economy. There is empirical evidence against COT monitoring and against RLHF. In both cases we could have also predicted failure without empirical evidence just from conceptual thinking (people will upvote what they like vs whats true, COT will become less understandable the less the model is trained on human data), though the evidence helps. I am basically seeing lots of evidence that current methods will fail, so no I don’t think I am anti-empirical. I also don’t think that empiricism should be used as anti-epistemology or as an argument for not having a plan and blindly stepping forward.
I also believe that our current alignment methods will not scale and that we need to develop new ones. In particular I am a co author of the scheming paper mentioned in the first link you say.
As I said multiple times, I don’t think we will succeed by default. I just think that if we fail we will do so multiple times with failures continually growing in magnitude and impact.
In this framing the crux is whether there is an After at all (at any level of capability). The distinction between “failure doesn’t kill the observer” (a perpetual Before) and “failure is successfully avoided” (managing to navigate the After).
Here’s my attempted phrasing, which I think avoids some of the common confusions:
Suppose we have a model M with utility function ϕ, where M is not capable of taking over the world. Assume that thanks to a bunch of alignment work, ϕ is within δ (by some metric) of humanity’s collective utility function. Then in the process of maximizing ϕ, M ends up doing a bunch of vaguely helpful stuff.
Then someone releases model M′ with utility function ϕ′, where M′ is capable of taking over the world. Suppose that our alignment techniques generalize perfectly. That is, ϕ′ is also within δ′ of humanity’s collective utility function, where δ′≤δ Then in the process of maximizing ϕ′, M′ gets rid of humans and rearranges their molecules to satisfy ϕ′ better.
Does this phrasing seem accurate and helpful?
This is an excellent encapsulation of (I think) something different—the “fragility of value” issue: “formerly adequate levels of alignment can become inadequate when applied to a takeover-capable agent.” I think the “generalization gap” issue is “those perfectly-generalizing alignment techniques must generalize perfectly on the first try”.
Attempting to deconfuse myself about how that works if it’s “continuous” (someone has probably written the thing that would deconfuse me, but as an exercise): if AI power progress is “continuous” (which training is, but model-sequence isn’t), it goes from “you definitely don’t have to get it right at all to survive” to “you definitely get only one try to get it sufficiently right, if you want to survive,” but by what path? In which of the terms “definitely,” “one,” and “sufficiently” is it moving continuously, if any?
I certainly don’t think it’s via the number of tries you get to survive! I struggle to imagine an AI where we all die if we fail to align it three times in a row.
I don’t put any stock in “sufficiently,” either—I don’t believe in a takeover-capable AI that’s aligned enough to not work toward takeover, but which would work toward takeover if it were even more capable. (And even if one existed, it would have to eschew RSI and other instrumentally convergent things, else it would just count as a takeover-causing AI.)
It might be via the confidence of the statement. Now, I don’t expect AIs to launch highly-contingent outright takeover attempts; if they’re smart enough to have a reasonable chance of succeeding, I think they’ll be self-aware enough to bide their time, suppress the development of rival AIs, and do instrumentally convergent stuff while seeming friendly. But there is some level of self-knowledge at which an AI will start down the path toward takeover (e.g., extricating itself, sabotaging rivals) and succeed with a probability that’s very much neither 0 nor 1. Is this first, weakish, self-aware AI able to extricate itself? It depends! But I still expect the relevant band of AI capabilities here to be pretty narrow, and we get no guarantee it will exist at all. And we might skip over it with a fancy new model (if it was sufficiently immobilized during training or guarded its goals well).
Of course, there’s still a continuity in expectation: when training each more powerful model, it has some probability of being The Big One. But yeah, I more or less predict a Big One; I believe in an essential discontinuity arising here from a continuous process. The best analogy I can think of is how every exponential with r<1 dies out and every r>1 goes off to infinity. When you allow dynamic systems, you naturally get cuspy behavior.
Upon reflection, I agree that my previous comment describes fragility of value.
My mental model is that the standard MIRI position[1] claims the following [2]:
1. Because of the way AI systems are trained, δ,δ′ will be large even if we knew humanity’s collective utility function and could target that (this is inner misalignment)
2. Even if δ′ were fairly small, this would still result in catastrophic outcomes if M′ is an extremely powerful optimizer (this is fragility of value)
A few questions:
3. Are the claims (1) and (2) accurate representations of inner misalignment and fragility of value?
4. Is the “misgeneralization” claim just ”δ′ will be much larger than δ”?
If the answer to (4) is yes, I am confused as to why the misgeneralization claim is brought up. It seems that (1) and (2) are sufficient to argue for AI risk.. By contrast, it seems that the misgeneralization claim is neither sufficient nor necessary to make a case for AI risk. Furthermore, the misgeneralization claim seems less likely to be true than (1) and (2).
Also let me know if I am thinking about things in a completely wrong framework and should scrap my made up notation.
There’s probably a better name for this. Please suggest one!
Non-exhaustive list.
I like your made up notation. I’ll try to answer, but I’m an amateur in both reasoning-about-this-stuff and representing-others’-reasoning-about-this-stuff.
I think (1) is both inner and outer misalignment. (2) is fragility of value, yes.
I think the “generalization step is hard” point is roughly “you can get δ low by trial and error. The technique you found at the end that gets δ low—it better not intrinsically depend on the trial and error process, because you don’t get to do trial and error on δ‘. Moreover, it better actually work on M’.”
Contemporary alignment techniques depend on trial and error (post-training, testing, patching). That’s one of their many problems.
My suggest term for standard MIRI thought would just be Mirism.
I kinda don’t like “generalization” as a name for this step. Maybe “extension”? There are too many steps where the central difficulty feels analogous to the general phenomenon of failure-of-generalization-OOD: the difficulty in getting δ to be small, the difficulty of going from techniques for getting δ small to techniques for getting a small δ′ (verbiage different because of the first-time constraint), the disastrousness of even smallish δ’…
I overall agree with this framing, but I think even in Before sufficiently bad mistakes can kill you, and in After sufficiently small mistakes wouldn’t. So, it’s mostly a claim about how strongly the mistakes would start to be amplified at some point.