Hi Ryan, will be brief but generally:
1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations.
2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don’t have good solutions for them yet..
boazbarak
I think it’s more like we have problems A_1, A_2, A_3, ….. and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.
We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn’t one “Game Day.” Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.
So my mental model is there is a sequence of models M_1,M_2,.… of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)
Now suppose you have such a sequence of models M_1,M_2,..… of growing capabilities. I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.
I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.
Regarding your second question of whether AIs would be powerful enough to take over the world at some point:
My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.
I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.
I also believe that our current alignment methods will not scale and that we need to develop new ones. In particular I am a co author of the scheming paper mentioned in the first link you say.
As I said multiple times, I don’t think we will succeed by default. I just think that if we fail we will do so multiple times with failures continually growing in magnitude and impact.
It seems that in P[success] you are assuming that AI only cares about escaping and taking over the world. I believe that if we fail at alignment and our systems will have the propensity to pursue covert goals then we will be able to observe and track it. Indeed, we have a paper (see also my twitter thread) where we do observe some propensity, and tackling scheming is a priority for us at OpenAI. If we fail, then I believe that unfortunately we will see bad consequences of deploying such models in the real world well before they are capable of taking over.
Also it seems that you are assuming that the AI is a singular entity. If there are N copies of the AIs out there (not to mention if there are K different companies releasing AIs of simialr capabilities), even if all of them mak a correct expected utility calculation on whether to escape, some might be detected. (Unless the utility of being detected is negative infinity but in this case they would never attempt exfiltration.)
I am also short in time, but re AI 2027. There are some important points I agree with, which is why I wrote in Machines of Faithful Obedience that I think the scenario where there is no competition and only internal deployment is risky.
I mostly think that the timelines were too aggressive and that we are more likely to continue on the METR path than explode, as well as multiple companies training and releasing models at a fast cadence. So it’s more like “Agent-X-n” (for various companies X and some large n) than “Agent 4“ and the difference between “Agent-X-n” and “Agent-X-n+1” will not be as dramatic.
Also, if we do our job right, Agent-X-n+1 will be more aligned than Agent-X-n.
Note that this is somewhat of an anti-empirical stance—by hypothesizing that superintelligence will arrive by some unknown breakthrough that would both take advantage of current capabilities and render current alignment methods moot—you are essentially saying that no evidence can update you.
Treating “takeover” as a single event brushes a lot under the carpet.
There are a number of capabilities involved—cybersecurity, bioweapon, etc.. - that models are likely to develop at different stages. I agree AI will ultimately far surpass our 2025 capabilities in all these areas. Whether that would be enough to take over the world at that point in time is a different questoin.
Then there are propensities. Taking over requires the model to have the propensity to “resist our attempts to change its goal” as well to act covertly in pursuit of its own objectives, which are not the ones it was instructed. (I think these days we are not really thinking models are going to misunderstand their instructions in a “monkey’s paws” style.)
If we do our job right in alignment, we would be able to drive these propensities down to zero.
But if we fail, I believe these propensities will grow over time, and as we iteratively deploy AI systems with growing capabilities, even if we fail to observe these issues in the lab, we will observe them in the real world well before the scale of killing everyone.
There are a lot of bad things that AIs can do before literally taking over the world. I think there is another binary assumption which is that AIs utility function is binary—somehow the expected value calculations work out such that we get no signal until the takeover.
Re my comment on the 16 hour 200K GPU run. I agree that things can be different at scale and it is important to keep measuring them as scale increases. What I meant is that even when things get worse with scale we would be able to observe it. But the exampe of the book—as I understood it—was not a “scale up.” Scale up is when you do a completely new training run, in the book that run was just some “cherry on top”—one extra gradient step—which presumably was minor in terms of compute compared to all that came before it. I don’t think one step will make the model suddenly misaligned. (Unless it completely borks it, which would be very observable.)
Thank you Daniel. I’m generally a fan of as much transparency as possible. In my research (and in general) I try to be non dogmatic and so if you believe that there are aspects I am wrong about, then I’d love to hear about them. (Especially if those can be empirically tested.)
I am not sure I 100% understand what you are saying. Again, like I wrote elsewhere, it is possible that for one reason or another rather than systems becoming safer and more controlled, they will become less safe and riskier over time. It is possible we will have a sequence of failures growing in magnitude over time, but for one reason or another do not address them, and hence since end up in a very large scale catastrophe.
It is possible that current approaches are not good enough and will not improve fast enough to match the stakes at which we want to deploy AI. If that is the case then it will end badly, but I believe that we will see many bad outcomes well before an extinction event. To put it crudely, I would expect that if we are on a path to that ending, the magnitude of harms that will be caused by AI will climb on an exponential scale over time similar to how other capabilities are growing.
Wrote the following twitter and thought I would share here:
My biggest disagreement with Yudkowsky and Soares is that I believe we will have many shots of getting AI safety right well before the consequences are world ending.
However humanity is still perfectly capable of blowing all its shots.
Just to make sure no one gets the impression that I think AI could not have catastrophic consequences or that it will be safe by default. However, the continuous worldview also implies very different approaches for policies than the essentially total AI development ban proposed in the book.
Also I don’t think sense of “self” is a singular event either, indeed already today’s systems are growing in their situational awareness which can be thought as some sense of self. See our scheming paper https://www.antischeming.ai/
p.s. I just realized that I did not answer your question:
> Is it about believing that systems have become safer and more controlled over time?
No this is not my issue here. While I hope it won’t be the case, systems could well become more risky and less controlled over time. I just believe that if that is the case then it would be observable via seeing increased rate of safety failures far before we get to the point where failure means that literally everyone on earth dies.
See this response
See my response to Eliezer. I don’t think it’s one shot—I think there are going to be both successes and failures along the way that would give us information that we will be able to use.
Even self improvement is not a singular event—already AI scientists are using tools such as codex or claude code to improve their own productivity. As models grow in capability, the benefit of such tools will grow, but it is not necessarily one event. Also, I think that we would likely require this improvement just to sustain the exponential at its current rate- it would not be sustainable to continue the growth in hiring and so increasing productivity via AI would be necessary.
Re the nit: In page 205 they say “Imagine that evert competing AI company is climbing a ladder in the dark. At every rung but the top one, they get five times as much money … But if anyone reaches the top rung, the ladder explodes and kills everyone. Also nobody knows where the ladder ends.”
I’ll edit a bit the text so it’s clear you don’t know when it ends.
You seem to be assuming that you cannot draw any useful lessons from cases where failure falls short of killing everyone on earth that would apply to cases where it does.
However, if AI’s advance continuously in capabilities, then there are many intermediate points between today where (for example) “failure means prompt injection causes privacy leak” and “failure means everyone is dead”. I believe that if AIs that capable of the latter would be scaled up version of current models, then by studying which alignment methods scale and do not scale, we can obtain valuable information.
If you consider the METR graph, of (roughly) duration of tasks quadrupling every year, then you would expect non-trivial gaps between the points. that (to take the cybersecurity example) AI is at the level of a 2025 top expert, AI can be equivalent to a 2025 top level hacking team, AI reaches 2025 top nation state capabilities. (And of course while AI improves , the humans will be using AI assistance also.)
I believe there is going to be a long and continuous road ahead between current AI systems and ones like Sable in your book.
I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
Hence I believe we will be able to learn from both successes and failures of our alignment methods throughout this time.
Of course, it is possible that I am wrong, and future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.
A non-review of “If Anyone Builds It, Everyone Dies”
To be clear—I am getting some great lecturers from out of OpenAI—confirmed non-OAI guest lecturers in this course at this point are:
* Nicholas Carlini (Anthropic)* Keri Warr (Anthropic)
* Joel Becker (METR)
* Buck Shlegeris (Redwood Research)
* Marius Hobbhahn (Apollo Research)
* Neel Nanda (Google DeepMind)
* Jack Lindsey (Anthropic)
It’s just that my “hit rate” with OpenAI is higher, and I also have some more “insider knowledge” on who is likely to be a great fit. I am trying not to make the course a collection of external people giving their “canned talks” but rather have each lecture really be something coherent that fits with what the students saw before. This is also why I am not making the course fully reliant on guest lecturers.
Learnings from AI safety course so far
Is this a question for me? I am assuming “why not” refers to why I do not support a pause or a ban and not why I support that OpenAI employees should be feel free to speak up in support of such policies if that is what they believe.
This is a bit too complex to go into in a comment. I hope at some point to write a longer text (specifically I have a plan on doing a book review of “if anyone builds it then everyone dies”, maybe together with “The AI con” and “AI snake oil”) and to go there more into why I don’t think the proposed policies are good. Just a matter of getting the time…
It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.
In reality in AI there would be multiple generations of different “Bobs” (although it is possible—and needs to be studied—that they would have affinity for one another and try to communicate or influence each other via hidden means). These “bobs” will be growing in both capability but also actually deployed in the world with real-world consequences. Our current “Bobs” are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. “Bob” will start doing work of an intern, then a junior worker, then more senior one, managing teams of other “Bobs”. To continue the analogy, maybe “Bob” will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.
I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will “explain away” these failures and still end up in a very bad place. I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.
I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky ’s expectations of how likely this scenario is to materialize.