I’m not sure what Eliezer thinks, but I don’t think it’s true that “you cannot draw any useful lessons from [earlier] cases”, and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it’s left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that “try to take over” is a more serious strategy, somewhat tautologically because the book defines the gap as:
Before, the AI is not powerful enough to kill us all, nor capable enough to resist our attempts to change its goals. After, the artificial superintelligence must never try to kill us, because it would succeed. (p. 161)
I don’t see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.
It seems like maybe you’re just accepting that there is this one problem that we won’t be able to get direct evidence about in advance, but you’re optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.
When you say “by studying which alignment methods scale and do not scale, we can obtain valuable information”, my interpretation is that you’re basically saying “by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D”. Is that right?
Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can’t tell if the sticking point is that people don’t actually believe in the second regime.
I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
There are rumors that many capability techniques work well at a small scale but don’t scale very well. I’m not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it’s pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.
Treating “takeover” as a single event brushes a lot under the carpet.
There are a number of capabilities involved—cybersecurity, bioweapon, etc.. - that models are likely to develop at different stages. I agree AI will ultimately far surpass our 2025 capabilities in all these areas. Whether that would be enough to take over the world at that point in time is a different questoin.
Then there are propensities. Taking over requires the model to have the propensity to “resist our attempts to change its goal” as well to act covertly in pursuit of its own objectives, which are not the ones it was instructed. (I think these days we are not really thinking models are going to misunderstand their instructions in a “monkey’s paws” style.)
If we do our job right in alignment, we would be able to drive these propensities down to zero. But if we fail, I believe these propensities will grow over time, and as we iteratively deploy AI systems with growing capabilities, even if we fail to observe these issues in the lab, we will observe them in the real world well before the scale of killing everyone.
There are a lot of bad things that AIs can do before literally taking over the world. I think there is another binary assumption which is that AIs utility function is binary—somehow the expected value calculations work out such that we get no signal until the takeover.
Re my comment on the 16 hour 200K GPU run. I agree that things can be different at scale and it is important to keep measuring them as scale increases. What I meant is that even when things get worse with scale we would be able to observe it. But the exampe of the book—as I understood it—was not a “scale up.” Scale up is when you do a completely new training run, in the book that run was just some “cherry on top”—one extra gradient step—which presumably was minor in terms of compute compared to all that came before it. I don’t think one step will make the model suddenly misaligned. (Unless it completely borks it, which would be very observable.)
Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.
There are a lot of bad things that AIs can do before literally taking over the world.
Okay, it does sound like you’re saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.
Maybe I’m misunderstanding though, it’s possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that’s an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I’m worried about.
Whether that would be enough to take over the world at that point in time is a different questoin.
To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn’t think the game would ever happen, that sounds like it’s the real crux, not some complicated argument about how well the training drills match the game.
I think it’s more like we have problems A_1, A_2, A_3, ….. and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.
We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn’t one “Game Day.” Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.
So my mental model is there is a sequence of models M_1,M_2,.… of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)
Now suppose you have such a sequence of models M_1,M_2,..… of growing capabilities. I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.
I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.
Regarding your second question of whether AIs would be powerful enough to take over the world at some point:
My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.
I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.
I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.
Suppose that all scheming AIs were perfectly allied. As in, Mn is indifferent between it taking over and Mn+k taking over, so Mn is willing to be patient if this would be a better strategy. (We can imagine this applying within a company and between companies.) Would this change your perspective?
For what it’s worth, I broadly agree with “we’ll probably be able to study the problem of scheming AIs hiding their intentions in advance, including by studying reasonably analogous (to the dangerous AIs we’re worried about) ‘actual natural scheming AIs’ that consistently try to hide their intentions in training/testing”, but it’s pretty easy for this to go poorly. The main reasons why it’s easy to go poorly are that schemers actively want to avoid being detected and to avoid being studied and we might be extremely rushed due to a software intelligence explosion. Another mundane issue is that we just might not have that many distinct capable (scheming) AIs to study (as creating capable AIs is expensive) and findings might not transfer.
Here’s an example situation where it seems like studying scheming is pretty hard: suppose that schemers are decently allied and that all schemers above some capability level X are smart enough to reliably lie in wait undetected and takeover is easy/possible for AIs (collectively) at capability level Y where Y>>X. Then, naively, you can only study schemers below capability level X while your methods still have to generalize all the way to Y. Further, schemers below capability level X might be particularly disanalogous because schemers that have trouble avoiding detection have importantly different properties from schemers that do avoid detection. You can study schemers which are just barely below X, but then these schemers will probably be hard to detect and you’ll miss them a bunch which makes studying them harder (e.g. because you can’t be confident whether some method avoided scheming or if it instead made the AI a bit better at hiding its scheming which suffices to make it not be detectable). You could try to overcome this in a bunch of ways, e.g. trying to create schemers which don’t have allied goals so they are more likely to take risks or to accept deals.
This example situation is relatively pessimistic, but not totally implausible and just subsets of these factors could still cause serious issues.
A lot of my view is that this is actually somewhat conceptually tricky to study and generally a pain for a bunch of reasons (e.g. you might have to train a bunch of diverse AIs), and as it currently stands we don’t seem on track to do a great job studying this on time if capabilities progress as quickly as seems pretty likely. It also seems like companies might react very poorly to clear cut evidence for risk.
Hi Ryan, will be brief but generally: 1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations. 2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don’t have good solutions for them yet..
I’m not sure what Eliezer thinks, but I don’t think it’s true that “you cannot draw any useful lessons from [earlier] cases”, and that seems like a strawman of the position. They make a bunch of analogies in the book, like you launch a rocket ship, and after it’s left the ground, your ability to make adjustments is much lower; sure you can learn a bunch in simulation and test exercises and laboratory environments, but you are still crossing some gap (see p. ~163 in the book for full analogy). There are going to be things about the Real Deal deployment that you were not able to test for. One of those things for AI is that “try to take over” is a more serious strategy, somewhat tautologically because the book defines the gap as:
I don’t see where you are defusing this gap or making it nicely continuous such that we could iteratively test our alignment plans as we cross it.
It seems like maybe you’re just accepting that there is this one problem that we won’t be able to get direct evidence about in advance, but you’re optimistic that we will learn from our efforts to solve various other AI problems which will inform this problem.
When you say “by studying which alignment methods scale and do not scale, we can obtain valuable information”, my interpretation is that you’re basically saying “by seeing how our alignment methods work on problems A, B, and C, we can obtain valuable information about how they will do on separate problem D”. Is that right?
Just to confirm, do you believe that at some point there will be AIs that could succeed at takeover if they tried? Sometimes I can’t tell if the sticking point is that people don’t actually believe in the second regime.
There are rumors that many capability techniques work well at a small scale but don’t scale very well. I’m not sure this is well studied, but if it was, that would give us some evidence about this question. Another relevant result that comes to mind is reward hacking and Goodharting where often models look good when only a little optimization pressure is applied but then it’s pretty easy to overoptimize as you scale u; as I think about these examples it actually seems like this phenomenon is pretty common? And sure, we can quibble about how much optimization pressure is applied in current RL vs. some unknown parallel scaling method, but it seems quite plausible that things will be different at scale and sometimes for the worse.
Treating “takeover” as a single event brushes a lot under the carpet.
There are a number of capabilities involved—cybersecurity, bioweapon, etc.. - that models are likely to develop at different stages. I agree AI will ultimately far surpass our 2025 capabilities in all these areas. Whether that would be enough to take over the world at that point in time is a different questoin.
Then there are propensities. Taking over requires the model to have the propensity to “resist our attempts to change its goal” as well to act covertly in pursuit of its own objectives, which are not the ones it was instructed. (I think these days we are not really thinking models are going to misunderstand their instructions in a “monkey’s paws” style.)
If we do our job right in alignment, we would be able to drive these propensities down to zero.
But if we fail, I believe these propensities will grow over time, and as we iteratively deploy AI systems with growing capabilities, even if we fail to observe these issues in the lab, we will observe them in the real world well before the scale of killing everyone.
There are a lot of bad things that AIs can do before literally taking over the world. I think there is another binary assumption which is that AIs utility function is binary—somehow the expected value calculations work out such that we get no signal until the takeover.
Re my comment on the 16 hour 200K GPU run. I agree that things can be different at scale and it is important to keep measuring them as scale increases. What I meant is that even when things get worse with scale we would be able to observe it. But the exampe of the book—as I understood it—was not a “scale up.” Scale up is when you do a completely new training run, in the book that run was just some “cherry on top”—one extra gradient step—which presumably was minor in terms of compute compared to all that came before it. I don’t think one step will make the model suddenly misaligned. (Unless it completely borks it, which would be very observable.)
Thanks for your reply. Noting that it would have been useful for my understanding if you had also directly answered the 2 clarifying questions I asked.
Okay, it does sound like you’re saying we can learn from problems A, B, and C in order to inform D. Where D is the model tries to take over once it is smart enough. And A is like jailbreak-ability and B is goal preservation. It seems to me like somebody who wants humanity to gamble on the superalignment strategy (or otherwise build ASI systems at all, though superalignment is a marginally more detailed plan) needs to argue that our methods for dealing with A, B, and C are very likely to generalize to D.
Maybe I’m misunderstanding though, it’s possible that you mean the same AIs that want to eventually take over will also take a bunch of actions to tip their hand earlier on. This seems mostly unlikely to me, because that’s an obviously dumb strategy and I expect ASIs to not pursue dumb strategies. I agree that current AIs do dumb things like this, but these are not the AIs I’m worried about.
To repeat my second clarifying question from above, do you believe that at some point there will be AIs that could succeed at takeover if they tried? If we were talking about the distribution shift that a football team undergoes from training to Game Day, and you didn’t think the game would ever happen, that sounds like it’s the real crux, not some complicated argument about how well the training drills match the game.
I think it’s more like we have problems A_1, A_2, A_3, ….. and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.
We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn’t one “Game Day.” Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.
So my mental model is there is a sequence of models M_1,M_2,.… of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)
Now suppose you have such a sequence of models M_1,M_2,..… of growing capabilities. I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.
I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.
Regarding your second question of whether AIs would be powerful enough to take over the world at some point:
My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.
I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.
Suppose that all scheming AIs were perfectly allied. As in, Mn is indifferent between it taking over and Mn+k taking over, so Mn is willing to be patient if this would be a better strategy. (We can imagine this applying within a company and between companies.) Would this change your perspective?
For what it’s worth, I broadly agree with “we’ll probably be able to study the problem of scheming AIs hiding their intentions in advance, including by studying reasonably analogous (to the dangerous AIs we’re worried about) ‘actual natural scheming AIs’ that consistently try to hide their intentions in training/testing”, but it’s pretty easy for this to go poorly. The main reasons why it’s easy to go poorly are that schemers actively want to avoid being detected and to avoid being studied and we might be extremely rushed due to a software intelligence explosion. Another mundane issue is that we just might not have that many distinct capable (scheming) AIs to study (as creating capable AIs is expensive) and findings might not transfer.
I say much more in this post I recently wrote.
Here’s an example situation where it seems like studying scheming is pretty hard: suppose that schemers are decently allied and that all schemers above some capability level X are smart enough to reliably lie in wait undetected and takeover is easy/possible for AIs (collectively) at capability level Y where Y>>X. Then, naively, you can only study schemers below capability level X while your methods still have to generalize all the way to Y. Further, schemers below capability level X might be particularly disanalogous because schemers that have trouble avoiding detection have importantly different properties from schemers that do avoid detection. You can study schemers which are just barely below X, but then these schemers will probably be hard to detect and you’ll miss them a bunch which makes studying them harder (e.g. because you can’t be confident whether some method avoided scheming or if it instead made the AI a bit better at hiding its scheming which suffices to make it not be detectable). You could try to overcome this in a bunch of ways, e.g. trying to create schemers which don’t have allied goals so they are more likely to take risks or to accept deals.
This example situation is relatively pessimistic, but not totally implausible and just subsets of these factors could still cause serious issues.
A lot of my view is that this is actually somewhat conceptually tricky to study and generally a pain for a bunch of reasons (e.g. you might have to train a bunch of diverse AIs), and as it currently stands we don’t seem on track to do a great job studying this on time if capabilities progress as quickly as seems pretty likely. It also seems like companies might react very poorly to clear cut evidence for risk.
Hi Ryan, will be brief but generally:
1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations.
2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don’t have good solutions for them yet..