boazbarak

Karma: 1,258

boazbarak 3 Oct 2025 2:23 UTC
2 points
0
on: boazbarak’s Shortform
Wrote one long comment in my non review of IABIED as response to a bunch of other comments.

boazbarak 2 Oct 2025 15:00 UTC
3 points
0
on: A non-review of “If Anyone Builds It, Everyone Dies”
Thanks to everyone who commented! Since there are too many comments for me to respond to all, let me try to summarize here where I disagree with the binay “before vs. after” of EY & NS. (For a very high level of the “continuous” point of view, see OpenAI’s blog post.) As I wrote, I also disagree with the “grown” vs “crafted” as a hard binary dichotomy, but won’t focus on this in this comment.
The way I see it, this framework makes the following assumptions, which I do not believe are currently well supported:
Singular takeover event:
The assumption is that all that matters is a singular time where the AI “takes over.” Even the notion of “take over” is not well defined. For example, is “taking over” the united states enough? I would imagine so. Is “taking over” north Korea enough? Maybe also—DPRK has already been taken over by a hostile entity but most countries in the world are not eager to risk a nuclear confrontation to remedy this. Is “taking over” some company and growing over time in its power also enough, maybe so?

In reality I think there is going to be a gradual increase both in capabilities of AI and in its integration into society and amount of control it is handed over critical systems. There are still much room to grow in both dimensions: both capabilities are still very far from working autonomously at typical human level, let alone superhuman, and the integration in society is still its infancy. EY&NS make the point that to some extent you could trade lack of power for intelligence—e.g. if you are not already in charge of the power grid, you can hack into it—but there is also a lot of friction in such an exchange.

It is unclear why, if AI systems have a propensity for acting covertly in pursuit on their own goals, we would not see this propensity materializes in harmful ways of growing magnitude well before they are capable of taking over the world. It seems that the underlying assumption is their ability to be perfect at hiding their intention and “lying in wait”, but current AI systems are not perfect at anything.
Treating “AI” as a singular entity:
EY&NS essentially treat AI as a singular entity that waits until it is powerful enough to strike. Part of treating it as a single entity means that they don’t model humans as being augmented with AI (or they treat these AIs as insignificant since they are not ASI). In reality there would likely be many AI systems from different vendors with varying capabilities. There may be some degree of collusion and/or affinity between different systems, but to the extent this is an issue I believe it can be measured and tracked over time. However, the EY&NS requires AIs to essentially view themselves as one unit. If an AI system is already in control of a decent-size company and could pursure its goals, the EY&NS model is that it will still not do that, but continue pretending to be perfectly aligned, so that its successor would be able to take over the world.
This is also somewhat related to the “grown” vs “crafted” issue. AI systems today sometimes scheme, hack, and lie. But why they do that is actually not so mysterious as EY&NS make it to be. Often we can trace back how certain aspects in training—e.g. rewarding models for user preferences, or for passing coding tests—will give rise to these bad behaviors. This is not good and we need to fix that, but it’s not some arbitrary behavior either.

Recursive self improvement

EY&NS don’t talk about this enough, but I think the only potentially true story for an actual singularity is via recursive self improvement (RSI). That would be the real point where there is a singularity rather than “take over” which not well defined.

One way to think about this is that RSI happens when a model can completely autonomously train its successor. But for true RSI it should be the case that if it took $N$ flops to train model n, then it would take $c \cdot N$ for $c < 1$ flops to train model n+1 that is more intelligent than model n, and so on and so forth. (And even such an improvement chain would take non trivial amount of time—e.g., if it took 8 months to train model n, then even in the optimistic setting of $c = 1 / 2$ maybe it would take 4 months to train model n+1, and 2 months to train model n+1, etc.. which does converge, but it’s also not happening in split seconds either.)

I think we will see if we are headed in this direction, but right now it does not seems this way. First, there are multiple “doublings” that need to happen before we reach the “train your successor” phase. (See the screenshot below.) Second, to a first approximation, our current paradigm in AI is:

power --> compute --> intelligence

There is certainly a lot of room to improve both of these but:

1. Radically improving the power --> compute efficiency will likely require building new hardware, datacenters, etc.. that takes time.

2. For improving the compute --> intelligence , note that even the most significant ideas, like transformers, were mostly about improving utilization of existing compute—so it was not so much about creating more intelligence per FLOP but about being able to use more of the FLOPs of the existing hardware of many GPUs. So these are also tied to existing hardware. There is definitely some room in increasing utilization of existing compute, but there is a limit to how many OOMs you can get this way before you need to design new hardware. There is also room for improving intelligence/FLOP without it, but I don’t think we have evidence that there is huge number of OOMs that can be saved.

Given the costs involved and huge incentive to save on both 1 and 2, I expect to continue to see improvements in both directions, including improvements that are using AIs, but I expect these will be gradual and help “maintain the exponential.”

Sum up

It is possible that over the next few years, new evidence will emerge that points out more to the EY&NS point of view. Maybe we will see a certain “threshold” crossed where AI systems behave in a completely strategic way. Or maybe we would see some evidence for a completely new paradigm of getting arbitrary levels of intelligence without needing more compute. Even in the “gradual” point of view, it does not mean we will be safe. Perhaps we will see tendencies such as collusion, scheming, deception increasing with model capabilities, with alignment methods unable to keep up. Perhaps as AIs will be deployed in the world, we will see catastrophes of continually growing magnitudes with signs pointing out to safety getting worse, but for some reason or another, humanity would not be able to get its act together. I am cautiously optimistic at the moment (e.g., on Monday Anthropic released Claude Sonnet 4.5 that they claimed—and I think with good reason—that it was both the most capable model and the most aligned model that they ever released). But I think there is till much that we don’t know and still a lot of work to be done.

A screenshot from the first lecture in my AI class.

boazbarak 30 Sep 2025 14:00 UTC
5 points
0
in reply to: Raemon’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
It is interesting that the analogy here is to evaluating your co worker Bob on being the mock dictator of your town and then deploying them as the actual dictator of the country.

In reality in AI there would be multiple generations of different “Bobs” (although it is possible—and needs to be studied—that they would have affinity for one another and try to communicate or influence each other via hidden means). These “bobs” will be growing in both capability but also actually deployed in the world with real-world consequences. Our current “Bobs” are starting to be deployed as coding and personal assistant for short term well-defined coding tasks, but these will grow from minutes, to hours to days to months and more. “Bob” will start doing work of an intern, then a junior worker, then more senior one, managing teams of other “Bobs”. To continue the analogy, maybe “Bob” will be the actual (non mock) mayor of a town, then a province, and maybe then rule over a whole country. But all the while all the Bobs would keep seeming perfectly helpful. (Getting 100% success in the task of hiding their true intentions, even though they are not yet good enough to get 100% success in other tasks) until the day when (by design or by accident) we make the Bobs dictators of the whole world.

I am not saying such a scenario is logically impossible. It just seems highly unlikely to me. To be clear, the part that seems unlikely is not that AI will be eventually so powerful and integrated in our systems, that it could cause catastrophic outcomes if it behaved in an arbitrarily malicious way. The part I find unlikely is that we would not be able to see multiple failures along the way that are growing in magnitude. Of course it is also possible that we will “explain away” these failures and still end up in a very bad place. I just think that it wouldn’t be the case that we had one shot but we missed it, but rather had many shots and missed them all. This is the reason why we (alignment researchers at various labs, universities, non profits) are studying questions such as scheming, colluding, situational awareness, as well as studying methods for alignment and monitoring. We are constantly learning and updating based on what we find out.

I am wondering if there is any empirical evidence from current AIs that would modify your / @Eliezer Yudkowsky ’s expectations of how likely this scenario is to materialize.

boazbarak 29 Sep 2025 23:56 UTC
4 points
0
in reply to: ryan_greenblatt’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
Hi Ryan, will be brief but generally:
1. I agree that scheming and collusion are some of the more difficult settings to study, also understanding the impact of situational awareness on evaluations.
2. I still think it is possible to study these in current and upcoming models, and get useful insights. It may well be that these insights will be that the problems are becoming worse with scale and we don’t have good solutions for them yet..

boazbarak 29 Sep 2025 14:40 UTC
16 points
2
in reply to: Aaron_Scher’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
I think it’s more like we have problems A_1, A_2, A_3, ….. and we are trying to generalize from A_1 ,...., A_n to A_{n+1}.
We are not going to go from jailbreaking the models to give a meth recipe to taking over the world. We are constantly deploying AIs in more and more settings, with time horizons and autonomy that are continuously growing. There isn’t one “Game Day.” Models are already out in the field right now, and both their capabilities as well as the scope that they are deployed in is growing all the time.

So my mental model is there is a sequence of models M_1,M_2,.… of growing capabilities with no clear one point where we reach AGI or ASI but more of a continuum. (Also models might come from different families or providers and have somewhat incomparable capabilities.)

Now suppose you have such a sequence of models M_1,M_2,..… of growing capabilities. I don’t think it would be the case that model M_n develops the propensity to act covertly and pursue its own goals, but the only goal it cares about is taking over the world, and also identifies with future models, and so it decides to “lie in wait” until generation M_{n+k} where it would act on that.

I think that if the propensity to act covertly and pursue misaligned goals would change continuously between generations of models, and it may grow, stay the same, or shrink, but in any case it will be possible to observe it well before we reach ASI.

Regarding your second question of whether AIs would be powerful enough to take over the world at some point:

My assumption is that AIs will grow in capabilities and integration in the world economy. If progress continues on the current trajectory then there would be a point where a variety of AI models are deeply integrated in our infrastructure. My hope (and what I and other alignment and safety researchers are working on) is that by then we would have strong ways to measure, monitor, and predict the envelope of potential risks for these models.

I am not sure it would make sense to think about these models as a singular entity but I agree that at the point we reach such deep integration and reliance, if all of these models were to suddenly and simultaneously act maliciously then they would be successful in causing an arbitrary amount of damage, quite possibly up to an extinction level event.

boazbarak 29 Sep 2025 14:21 UTC
4 points
0
in reply to: Simon Lermen’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
I also believe that our current alignment methods will not scale and that we need to develop new ones. In particular I am a co author of the scheming paper mentioned in the first link you say.

As I said multiple times, I don’t think we will succeed by default. I just think that if we fail we will do so multiple times with failures continually growing in magnitude and impact.

boazbarak 29 Sep 2025 12:36 UTC
2 points
0
in reply to: Aaron_Scher’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
It seems that in P[success] you are assuming that AI only cares about escaping and taking over the world. I believe that if we fail at alignment and our systems will have the propensity to pursue covert goals then we will be able to observe and track it. Indeed, we have a paper (see also my twitter thread) where we do observe some propensity, and tackling scheming is a priority for us at OpenAI. If we fail, then I believe that unfortunately we will see bad consequences of deploying such models in the real world well before they are capable of taking over.
Also it seems that you are assuming that the AI is a singular entity. If there are N copies of the AIs out there (not to mention if there are K different companies releasing AIs of simialr capabilities), even if all of them mak a correct expected utility calculation on whether to escape, some might be detected. (Unless the utility of being detected is negative infinity but in this case they would never attempt exfiltration.)

boazbarak 29 Sep 2025 12:26 UTC
9 points
0
in reply to: Daniel Kokotajlo’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
I am also short in time, but re AI 2027. There are some important points I agree with, which is why I wrote in Machines of Faithful Obedience that I think the scenario where there is no competition and only internal deployment is risky.

I mostly think that the timelines were too aggressive and that we are more likely to continue on the METR path than explode, as well as multiple companies training and releasing models at a fast cadence. So it’s more like “Agent-X-n” (for various companies X and some large n) than “Agent 4“ and the difference between “Agent-X-n” and “Agent-X-n+1” will not be as dramatic.

Also, if we do our job right, Agent-X-n+1 will be more aligned than Agent-X-n.

boazbarak 29 Sep 2025 12:20 UTC
3 points
−25
in reply to: Simon Lermen’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
Note that this is somewhat of an anti-empirical stance—by hypothesizing that superintelligence will arrive by some unknown breakthrough that would both take advantage of current capabilities and render current alignment methods moot—you are essentially saying that no evidence can update you.

boazbarak 29 Sep 2025 12:18 UTC
9 points
−4
in reply to: Aaron_Scher’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
Treating “takeover” as a single event brushes a lot under the carpet.

There are a number of capabilities involved—cybersecurity, bioweapon, etc.. - that models are likely to develop at different stages. I agree AI will ultimately far surpass our 2025 capabilities in all these areas. Whether that would be enough to take over the world at that point in time is a different questoin.

Then there are propensities. Taking over requires the model to have the propensity to “resist our attempts to change its goal” as well to act covertly in pursuit of its own objectives, which are not the ones it was instructed. (I think these days we are not really thinking models are going to misunderstand their instructions in a “monkey’s paws” style.)

If we do our job right in alignment, we would be able to drive these propensities down to zero.
But if we fail, I believe these propensities will grow over time, and as we iteratively deploy AI systems with growing capabilities, even if we fail to observe these issues in the lab, we will observe them in the real world well before the scale of killing everyone.

There are a lot of bad things that AIs can do before literally taking over the world. I think there is another binary assumption which is that AIs utility function is binary—somehow the expected value calculations work out such that we get no signal until the takeover.

Re my comment on the 16 hour 200K GPU run. I agree that things can be different at scale and it is important to keep measuring them as scale increases. What I meant is that even when things get worse with scale we would be able to observe it. But the exampe of the book—as I understood it—was not a “scale up.” Scale up is when you do a completely new training run, in the book that run was just some “cherry on top”—one extra gradient step—which presumably was minor in terms of compute compared to all that came before it. I don’t think one step will make the model suddenly misaligned. (Unless it completely borks it, which would be very observable.)

boazbarak 29 Sep 2025 2:25 UTC
15 points
1
in reply to: Daniel Kokotajlo’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
Thank you Daniel. I’m generally a fan of as much transparency as possible. In my research (and in general) I try to be non dogmatic and so if you believe that there are aspects I am wrong about, then I’d love to hear about them. (Especially if those can be empirically tested.)

boazbarak 29 Sep 2025 0:13 UTC
7 points
0
in reply to: the gears to ascension’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
I am not sure I 100% understand what you are saying. Again, like I wrote elsewhere, it is possible that for one reason or another rather than systems becoming safer and more controlled, they will become less safe and riskier over time. It is possible we will have a sequence of failures growing in magnitude over time, but for one reason or another do not address them, and hence since end up in a very large scale catastrophe.

It is possible that current approaches are not good enough and will not improve fast enough to match the stakes at which we want to deploy AI. If that is the case then it will end badly, but I believe that we will see many bad outcomes well before an extinction event. To put it crudely, I would expect that if we are on a path to that ending, the magnitude of harms that will be caused by AI will climb on an exponential scale over time similar to how other capabilities are growing.

boazbarak 29 Sep 2025 0:04 UTC
16 points
7
on: A non-review of “If Anyone Builds It, Everyone Dies”
Wrote the following twitter and thought I would share here:
My biggest disagreement with Yudkowsky and Soares is that I believe we will have many shots of getting AI safety right well before the consequences are world ending.
However humanity is still perfectly capable of blowing all its shots.
Just to make sure no one gets the impression that I think AI could not have catastrophic consequences or that it will be safe by default. However, the continuous worldview also implies very different approaches for policies than the essentially total AI development ban proposed in the book.

boazbarak 28 Sep 2025 23:17 UTC
3 points
1
in reply to: boazbarak’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
Also I don’t think sense of “self” is a singular event either, indeed already today’s systems are growing in their situational awareness which can be thought as some sense of self. See our scheming paper https://www.antischeming.ai/

boazbarak 28 Sep 2025 23:15 UTC
4 points
0
in reply to: sjadler’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
p.s. I just realized that I did not answer your question:

> Is it about believing that systems have become safer and more controlled over time?

No this is not my issue here. While I hope it won’t be the case, systems could well become more risky and less controlled over time. I just believe that if that is the case then it would be observable via seeing increased rate of safety failures far before we get to the point where failure means that literally everyone on earth dies.

boazbarak 28 Sep 2025 23:03 UTC
3 points
0
in reply to: RussellThor’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
See this response

boazbarak 28 Sep 2025 23:02 UTC
3 points
1
in reply to: sjadler’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
See my response to Eliezer. I don’t think it’s one shot—I think there are going to be both successes and failures along the way that would give us information that we will be able to use.

Even self improvement is not a singular event—already AI scientists are using tools such as codex or claude code to improve their own productivity. As models grow in capability, the benefit of such tools will grow, but it is not necessarily one event. Also, I think that we would likely require this improvement just to sustain the exponential at its current rate- it would not be sustainable to continue the growth in hiring and so increasing productivity via AI would be necessary.

Re the nit: In page 205 they say “Imagine that evert competing AI company is climbing a ladder in the dark. At every rung but the top one, they get five times as much money … But if anyone reaches the top rung, the ladder explodes and kills everyone. Also nobody knows where the ladder ends.”

I’ll edit a bit the text so it’s clear you don’t know when it ends.

boazbarak 28 Sep 2025 22:57 UTC
31 points
−2
in reply to: Eliezer Yudkowsky’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
You seem to be assuming that you cannot draw any useful lessons from cases where failure falls short of killing everyone on earth that would apply to cases where it does.

However, if AI’s advance continuously in capabilities, then there are many intermediate points between today where (for example) “failure means prompt injection causes privacy leak” and “failure means everyone is dead”. I believe that if AIs that capable of the latter would be scaled up version of current models, then by studying which alignment methods scale and do not scale, we can obtain valuable information.

If you consider the METR graph, of (roughly) duration of tasks quadrupling every year, then you would expect non-trivial gaps between the points. that (to take the cybersecurity example) AI is at the level of a 2025 top expert, AI can be equivalent to a 2025 top level hacking team, AI reaches 2025 top nation state capabilities. (And of course while AI improves , the humans will be using AI assistance also.)

I believe there is going to be a long and continuous road ahead between current AI systems and ones like Sable in your book.
I don’t believe that there is going to be an alignment technique that works one day and completely fails after a 200K GPU 16 hour run.
Hence I believe we will be able to learn from both successes and failures of our alignment methods throughout this time.

Of course, it is possible that I am wrong, and future superintelligent systems could not be obtained by merely scaling up current AIs, but rather this would require completely different approaches. However, if that is the case, this should update us to longer timelines, and cause us to consider development of the current paradigm less risky.

boazbarak 28 Sep 2025 16:14 UTC
8 points
1
in reply to: Ryan Kidd’s comment on: Learnings from AI safety course so far
To be clear—I am getting some great lecturers from out of OpenAI—confirmed non-OAI guest lecturers in this course at this point are:

* Nicholas Carlini (Anthropic)
* Keri Warr (Anthropic)

* Joel Becker (METR)

* Buck Shlegeris (Redwood Research)

* Marius Hobbhahn (Apollo Research)

* Neel Nanda (Google DeepMind)

* Jack Lindsey (Anthropic)

It’s just that my “hit rate” with OpenAI is higher, and I also have some more “insider knowledge” on who is likely to be a great fit. I am trying not to make the course a collection of external people giving their “canned talks” but rather have each lecture really be something coherent that fits with what the students saw before. This is also why I am not making the course fully reliant on guest lecturers.

boazbarak 27 Sep 2025 17:27 UTC
2 points
0
in reply to: David Scott Krueger (formerly: capybaralet)’s comment on: Safety researchers should take a public stance
Is this a question for me? I am assuming “why not” refers to why I do not support a pause or a ban and not why I support that OpenAI employees should be feel free to speak up in support of such policies if that is what they believe.

This is a bit too complex to go into in a comment. I hope at some point to write a longer text (specifically I have a plan on doing a book review of “if anyone builds it then everyone dies”, maybe together with “The AI con” and “AI snake oil”) and to go there more into why I don’t think the proposed policies are good. Just a matter of getting the time…