You might still disagree with that claim. I think there are valid reasons to disagree, or at least assign significantly less confidence to the claim.
But none of the reasons listed so far are disagreements with the thesis. And, remember, if the reason you disagree is because you think our understanding of AI will improve dramatically, or there will be a paradigm shift specifically away from “unpredictably grown” AI, this also isn’t actually a disagreement with the sentence.
The authors clearly intend to make a pretty broad claim, not the more narrow claim you imply.
This feels like a motte and bailey where the motte is “If you literally used something remotely like current scaled up methods without improved understanding to directly build superintelligence, everyone would die” and the bailey is “on the current trajectory, everyone will die if superintelligence is built without a miracle or a long (e.g. >15 year) pause”.
I expect that by default superintelligence is built after a point where we have access to huge amounts of non-superintelligent cognitive labor so it’s unlikely that we’ll be using current methods and current understanding (unless humans have already lost control by this point, which seems totally plausible, but not overwhelmingly likely nor argued convincingly for by the book). Even just looking at capabilities, I think it’s pretty likely that automated AI R&D will result in us operating in a totally different paradigm by the time we build superintelligence—this isn’t to say this other paradigm will be safer, just that a narrow description of “current techniques” doesn’t include the default trajectory.
I think it’s pretty clear the authors intend to include “we ~hand off AI R&D and alignment to AIs developed in roughly the current paradigm which proceed with development” as a special case of “anything remotely like current techniques” (as from my perspective it is the default trajectory). But, if these earlier AIs were well aligned (and wise and had reasonable epistemics), I think it’s pretty unclear that the situation would go poorly and I’d guess it would go fine because these AIs would themselves develop much better alignment techniques. This is my main disagreement with the book.
this isn’t to say this other paradigm will be safer, just that a narrow description of “current techniques” doesn’t include the default trajectory.
Sorry, this seems wild to me. If current techniques seem lethal, and future techniques might be worse, then I’m not sure what the point is of pointing out that the future will be different.
But, if these earlier AIs were well aligned (and wise and had reasonable epistemics), I think it’s pretty unclear that the situation would go poorly and I’d guess it would go fine because these AIs would themselves develop much better alignment techniques. This is my main disagreement with the book.
I mean, I also believe that if we solve the alignment problem, then we will no longer have an alignment problem, and I predict the same is true of Nate and Eliezer.
Is your current sense that if you and Buck retired, the rest of the AI field would successfully deliver on alignment? Like, I’m trying to figure out whether your sense here is the default is “your research plan succeeds” or “the world without your research plan”.
I mean, I also believe that if we solve the alignment problem, then we will no longer have an alignment problem, and I predict the same is true of Nate and Eliezer.
By “superintelligence” I mean “systems which are qualititatively much smarter than top human experts”. (If Anyone Builds It, Everyone Dies seems to define ASI in a way that could include weaker levels of capability, but I’m trying to refer to what I see as the typical usage of the term.)
Sometimes, people say that “aligning superintelligence is hard because it will be much smarter than us”. I agree, this seems like this makes aligning superintelligence much harder for multiple reasons.
Correspondingly, I’m noting that if we can align earlier systems which are just capable enough to obsolete human labor (which IMO seems way easier than directly aligning wildly superhuman systems), these systems might be able to ongoingly align their successors. I wouldn’t consider this “solving the alignment problem” because we instead just aligned a particular non-ASI system in a non-scalable way, in the same way I don’t consider “claude 4.0 opus is aligned enough to be pretty helpful and not plot takeover” to be a solution to the alignment problem.
Perhaps your view is “obviously it’s totally sufficient to align systems which are just capable enough to obsolete current human safety labor, so that’s what I meant by ‘the alignment problem’”. I don’t personally think this is obvious given race dynamics and limited time (though I do think it’s likely to suffice in practice). Minimally, people often seem to talk about aligning ASI (which I interpret to mean wildly superhuman AIs rather than human-ish level AIs).
Okay I think my phrasing was kinda motte-and-bailey-ish, although not that Motte-and-Bailey-ish.
I think “anything like current techniques” and “anything like current understanding” clearly set a very high bar for the difference. “We made more progress on interpretability/etc at the current rates of progress” fairly clearly doesn’t count by the book’s standards.
But, I agree that a pretty reasonable class of disagreement here is “exactly how different from the current understanding/techniques do we need to be?” to be something you expect to disagree with them on when you get into the details. That seems important enough for me to edit into the earlier sections of the post.
(Maybe this is obvious, but I thought I would say this just to be clear.)
I think “anything like current techniques” and “anything like current understanding” clearly set a very high bar for the difference. “We made more progress on interpretability/etc at the current rates of progress” fairly clearly doesn’t count by the book’s standards.
Sure, but I expect wildly more cognitive labor and effort if humans retain control and can effectively leverage earlier systems, not just “more progress than we’d expect”. I agree the bar is above “the progress we’d expect by default (given a roughly similar field size) in the next 10 years”, but I think things might get much more extreme due to handing off alignment work to AIs. I agree the book is intended to apply pretty broadly, but regardless of intention does it really apply to “1 million AIs somewhat smarter than humans have spent 100 years each working on the problem (and coordinating etc?)”? (I think the crux is more like “can you actually safely get this alignment work out of these AIs”.)
It seems very unlikely you can get that alignment work out of these AIs without substantially pausing or slowing first?
If you don’t believe that it does seem like we should chat sometime. It’s not like completely implausible, but I feel like we must both agree that if you go full speed on AI there is little chance that you end up getting that much alignment work out of models before you are cooked.
Thanks for the nudge! I currently disagree with “very unlikely”, but more importantly, I noticed that I haven’t really properly analyzed the question of “given how much cognitive labor is available between different capability levels, should we expect that alignment can keep up with capabilities if a small fraction (e.g. 5%) is ongoingly spent on alignment (in addition to whatever alignment-ish work is directly commercially expedient)”. I should spend more time thinking about this question and it seems plausible I’ll end up updating towards thinking risk is substantially higher/lower on the basis of this. I think I was underestimating the case that even if AIs are reasonably aligned, it might just be seriously hard for them to improve alignment tech fast enough to keep up with capabilities (I wasn’t ignoring this in my prior thinking, but I when I thought about some examples, the situation seemed worse than I was previously thinking), so I currently expect to update towards thinking risk is higher.
(At least somewhat rambly from here on.)
The short reason why I currently disagree: it seems pretty likely that we’ll have an absolutely very large amount of cognitive labor (in parallel copies, serial speed, and quality) before we’re cooked because improving algorithms will require more and more cognitive resources before it’s too late. This depends on the takeoff dynamics. If you think that going from “can full automate AI R&D” to “level of capability where we are totally cooked” given a fixed amount of compute will take a small amount of cognitive labor, then you get a proportionally small amount of cognitive labor for safety. I think the corresponding level of cognitive labor might actually be very large, especially in worlds where takeoff is relatively smaller. In general, a key question is “how do the returns to cognitive labor compare between capabilities and alignment: how does the amount of cognitive labor needed to advance from X to Y capability level compare to the amount of cognitive labor needed to progress alignment from the point where AIs of X capability level are safe/aligned to the point where AIs of Y capability level are safe/aligned?” This is basically a question about the relative returns to cognitive labor.
I’ll try to do some initial analysis here (partially just to start off my thinking about this).
I’ll suppose that the AI company spends 5% of its resources on “trying to reduce takeover risk alignment/safety work (x-safety)” at each point in addition to whatever it spends on solving related problems which are commercially expedient and obviously incentivized (e.g. if training runs aren’t working as well due to reward hacking the company’s efforts to fix this aren’t included in the 5% budget, though the 5% budget might subsidize some of this work). Initially, this includes time of human employees, but eventually human labor is irrelevant.
For simplicity, I’ll assume there is no chance of spending more than 5% or slowing down more than this (when averaged over a every few development cycles, including once these development cycles speed up, short slow downs and spikes in spending are fine). In practice, I do think that empirical evidence and AIs warning that the situation is dangerous are very important, especially after handing off decision making to AIs (insofar as this happens).
Prior to full automation, employees working on x-safety are accelerated by AI and then once x-safety work is fully automated, we can still think of the rate of progress as a multiplier on the “without AI” speed. However, there is a difference in that if AIs we hand off to are seriously misalignment we’re ~totally fucked while this isn’t true prior to this point.
I’ll talk about a specific level of capability: Capable enough to handoff all safety work (which is strictly more capable than full automated AI R&D, but maybe less capable than “can dominate top human experts at ~everything”). I’ll call this level of capability “DAI” (Deferable AI).
We can then divide the problem into roughly two parts:
By the time we’re at DAI, will we be able to align DAI (and also ensure these AIs are well elicited and have good epistemics/wisdom)? (As in, at least within some short period of DAI level capabilities.)
Conditional on successfully aligning DAI (including via “lame” prosaic techniques which aren’t themselves very scalable), if we hand over to these AIs can they ongoingly ensure AIs remain aligned/safe given that capabilities keep going with 95% of resources?[1]
It’s a little hard for me to tell if what you’re saying is a claim that the first part isn’t going to work out or the second part isn’t going to work out. I’ll go through both mostly talking about “is it very unlikely that we’ll make it”.
Aligning DAI
On the first: we get moderate but not massive acceleration by DAI. Supposing DAI is made in late 2030 (5 years from now), then maybe we see superhuman coder a year before this and some moderate acceleration (3x AI R&D labor acceleration) a year before that. Supposing moderately good but not great elicitation and that AIs aren’t scheming, I think we slightly conservatively get an average of like 2.5x AI safety R&D acceleration between moderate acceleration and superhuman coder and like 4x acceleration on average between superhuman coder and DAI. (Initially this is mostly due to accelerating engineering / running experiments and then it gets more broad over time.) Ok, so we get roughly 2.5 years (post acceleration) to do research and study AIs in the capability range of [moderate acceleration to superhuman coder] and 4 years to study AIs between [superhuman coder and DAI]. This is as well as 3 years with AIs that aren’t even capable of moderate acceleration. (Presumably the AI safety field is growing over this period?)
Note that average acceleration over some period can be dominated by acceleration at the end if it’s increasing over time.
(I can explain where these numbers come from if this is helpful.)
But, what about the possibility of scheming? I think superhuman coder is only like 20% likely to be scheming (for a moderately strong notion of scheming, see link) by default and mitigations might somewhat lower this. DAI is like 40% likely to be scheming. So even if we fully YOLO scheming, on my views we’re not totally screwed due to this alone. In practice, work could substantially lower these risks. If we had to guess, maybe I think that in practice the risk of scheming from DAI will be cut by a factor of 3 given 5% effort on well allocated safety and this scenario? Idk? So maybe we eat like 15% risk here? If allocated as well as I expect current frontier AI companies to allocate, I’d guess more like 25% risk? (shrug.) For now, I’ll assume that if DAI is scheming we immediately lose (which is roughly true because we have to hand off quickly), but that superhuman coder scheming isn’t fatal (which doesn’t matter that much for the bottom line risk number anyway).
Ok, but it isn’t sufficient to just ensure DAI isn’t scheming, we have to also ensure it is aligned enough to hand off work and has good epistemics and is well elicited on hard to check tasks. This seems pretty hard given the huge rush, but it isn’t obviously fucked IMO, especially given the extra years from acceleration. I have some draft writing on this which should hopefully be out somewhat soon. Maybe my view is 20% chance of failure given good allocation and roughly 60% chance of failure given the default allocation (which includes stuff like the safety team not actually handing off or not seriously working on this etc)?
Edit: I now think 60% is too pessimistic, though I don’t know what I think overall.
My current sense is that current technical AI x-safety work isn’t very well targeted for ensuring DAI isn’t scheming and making DAI safe to hand off to, but it also isn’t pessimally targeted for this. I tenatively think the targeting will improve over time and that it has improved over the last few years.
(In practice, the situation is messier and mostly more optimistic than this than this number appears because people will pursue plans other than handing off at DAI, there is the potential for slowing down, the takeoff might look different, trailing actors might have some influence, etc. Like, 60% chance of failing to handover to DAI is partially due to people not even targeting handover and their alternative plans would have some chance of working.)
Overall, we’re at like ~70% doom given default allocation and more like ~25% doom given an ideal allocation. Doesn’t seem overwhemingly likely to be catastrophic so far given my views (though I agree it doesn’t look good!). Not sure where you disagree.
Does aligned DAI suffice?
At each point, our aligned AIs have to align (or possibly temporarily control?) further AIs as the mad capabilities race using 95% of resources continues. How fast do things progress given algorithmic progress which is accelerated by AI R&D? Well, there is some chance things actually aren’t that fast because of diminishing returns to algorithmic progress, some chance for decent amount of acceleration, and some chance we get wildly, wildly superhuman AIs in a few months (if AIs haven’t taken over and put an end to the insanity by that level of capability).
Really, we actually care about how the rate of capabilities progress compares to alignment, so things which just make the labor supply/efficacy bigger should mostly hit on both sides.
Ok, so how does capabilities progress compare to alignment? One really stupid guess is:
If we get the equivalent of 20 serial years of DAI-level labor (from >100k DAI level parallel agents given a proportional amount of compute) before +3 SDs over DAI we’re fine because we have a scalable solution to alignment. Otherwise takeover. (This is somewhat more conservative than my actual view.)
In practice, there is some continuous alignment progress vs capabilities question.
Naively, maybe DAI runs at like 40x human speed and we probably have enough parallel agents that 5% of the pool is >100k agents. So, 20 serial years takes 0.5 calender years. So naively, we’re in trouble if we get +3 SDs within 0.5 years. I think this is like 65% likely, so we’re in trouble so far. But before we get +3 SDs, we get to +1 SD and +2 SD as well increases to speed etc. Maybe the +1 SD AIs are all considered as good as like ~3x faster given compute bottlenecks and +2 SD is like 10x faster? So, assuming progress in SD is roughly linear over time (probably wrong, but hopefully not too off), we get that we make it if we get +3 SD AIs in >~0.05 years which seems much more likely.
This is pretty shitty oversimplified modeling, but it overall makes me feel at-least-somewhat optimistic about recursive alignment outpacing recursive self-improvement given my views on takeoff speeds. Like maybe 80% chance things are fine?
[Ok, I have to stop writing this comment and do something else]
In practice, I hope that these AIs might be able to advice AI company leadership to take a different path, but we’re assuming that away and assuming that AI company leadership retains enough control to make these AIs spend 95% on capabilities.
This is a long comment! I was glad to have read it, but am a bit confused about your numbers seeming different from the ones I objected to. You said:
1 million AIs somewhat smarter than humans have spent 100 years each working on the problem (and coordinating etc?)
Then in this comment you say:
If we get the equivalent of 20 serial years of DAI-level labor (from >100k DAI level parallel agents given a proportional amount of compute) before +3 SDs over DAI we’re fine because we have a scalable solution to alignment. Otherwise takeover. (This is somewhat more conservative than my actual view.)
Here you now say 20 years, and >100k DAI level parallel agents. That’s a factor of 5 and a factor of 10 different! That’s a huge difference! Maybe your estimates are conservative enough to absorb a factor of 50 in thinking time without changing the probability that much?
I think I still disagree with your estimates, but before I go into them, I kind of want to check whether I am missing something, given that I currently think you are arguing for a resources allocation that’s 50x smaller than what I thought I was arguing against.
I was glad to have read it, but am a bit confused about your numbers seeming different from the ones I objected to.
I gave “1 million AIs somewhat smarter than humans with the equivalent of 100 years each” as an example of a situation I thought wouldn’t count as “anything like current techniques/understanding”. In this comment, I picked a lower number which is maybe my best guess for an amount of labor which eliminates most of the risk by a given level of capability.
I do think that “a factor of 5 and a factor of 10 different” is within my margin of error for amount of labor you need. (Note that there might be aggressively diminishing returns on parallel labor, though possibly not due to very superhuman coordination abilities by AI.)
My modeling/guesses are pretty shitty in this comment (I was just picking some numbers to see how things work out), so if that’s a crux, I should probably try to be thoughtful (I was trying to write this quickly to get something written up).
This makes sense, but I think I am still a bit confused. My comment above was mostly driven by doing a quick internal fermi estimate myself for whether “1 million AIs somewhat smarter than humans have spent 100 years each working on the problem” is a realistic amount of work to get out of the AIs without slowing down, and arriving at the conclusion that this seems very unlikely across a relatively broad set of worldviews.
We can also open up the separate topic of how much work might be required to make real progress on superalignment in time, or whether this whole ontology makes sense, but I was mostly interested in doing a fact-check of “wait, that really sounds like too much, do you really believe this number is realistic?”.
I still disagree, but I have much less of a “wait, this really can’t be right” reaction if you mean the number that’s 50x lower.
Ok, but it isn’t sufficient to just ensure DAI isn’t scheming, we have to also ensure it is aligned enough to hand off work and has good epistemics and is well elicited on hard to check tasks. This seems pretty hard given the huge rush, but it isn’t obviously fucked IMO, especially given the extra years from acceleration. I have some draft writing on this which should hopefully be out somewhat soon. Maybe my view is 20% chance of failure given good allocation and roughly 60% chance of failure given the default allocation (which includes stuff like the safety team not actually handing off or not seriously working on this etc)?
This seems way too pessimistic to me. At the point of DAI, capabilities work will also require good epistemics and good elicitation on hard to check tasks. The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you’re in a subjunctive case where we’ve assumed the DAI is not scheming. Whence the pessimism?
The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you’re in a subjunctive case where we’ve assumed the DAI is not scheming. Whence the pessimism?
I don’t think this is the only disanalogy. It seems to me like getting AIs to work efficiently on automating AI R&D might not result in solving all the problems you need to solve for it to be safe to hand off ~all x-safety labor to AIs. This is a mix of capabilities, elicitation, and alignment. This is similar to how a higher level of mission alignment might be required for AI company employees working on conceptually tricky alignment research relative to advancing AI R&D.
Another issue is that AI societies might go off the rails over some longer period in some way which doesn’t eliminate AI R&D productivity, but would be catastrophic from an alignment perspective.
This isn’t to say there isn’t anything which is hard to check or conceptually tricky about AI R&D, just that the feedback loops seem much better.
I’m not really following where the disanalogy is coming from (like, why are the feedback loops better?)
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
Although on further reflection, even though the current DAI isn’t scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don’t think this makes a big difference—usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking—but I could imagine that causing issues.
Do you agree the feedback loops for capabilities are better right now?
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
For this argument it’s not a crux that it is asymmetric (though due to better feedback for AI R&D I think it actually is). E.g., suppose that in 10% of worlds safety R&D goes totally off the rails while capabilities proceed and in 10% of worlds capabilities R&D goes off totally the rails while safety proceeds. This still results in an additional 10% takeover risk from the subset of worlds where safety R&D goes off the rails. (Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.)
Do you agree the feedback loops for capabilities are better right now?
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
I think studying scheming in current/future systems has ongoingly worse feedback loops? Like suppose our DAI level system wants to study scheming in a +3 SD DAI level system. This is structurally kinda tricky because schemers try to avoid detection. I agree having access to capable AIs makes this much easier to get good feedback loops, but there is an asymmetry.
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
Seems like diminishing returns to capabiltiies r&d should be at least somewhat correlated with diminishing returns to safety r&d, which I believe should extremize your probability (because e.g. if before you were counting on worlds with slow takeoff and low alignment requirements, these become less likely; and the inverse if you’re optimistic)
“Slower takeoff should be correlated with ‘harder’ alignment (in terms of cognitive labor requirements) because slower takeoff implies returns to cognitive labor in capabilities R&D are relatively lower and we should expect this means that alignment returns to cognitive labor are relatively lower (due to common causes like ‘small experiments and theory don’t generalize well and it is hard to work around this’). For the same reasons, faster takeoff should be correlated with ‘easier’ alignment.”
I think I agree with this mostly, though there are some reasons for anti-correlation, e.g., worlds where there is a small simple core to intelligence which can be found substantially from first principles make alignment harder, in practice there is an epistemic correlation among humans between absolute alignment difficulty (in terms of cognitive labor requirements) and slower takeoff.
I don’t really understand why this should extremize my probabilities, but I agree this correlation isn’t accounted for at all in my analysis.
“Slower takeoff should be correlated with ‘harder’ alignment (in terms of cognitive labor requirements) because slower takeoff implies returns to cognitive labor in capabilities R&D are relatively lower and we should expect this means that alignment returns to cognitive labor are relatively lower (due to common causes like ‘small experiments and theory don’t generalize well and it is hard to work around this’). For the same reasons, faster takeoff should be correlated with ‘easier’ alignment.”
Yes, that is what I’m saying. In general a lot of prosaic alignment activities seem pretty correlated with capabilities in terms of their effectiveness.
some reasons for anti-correlation, e.g., worlds where there is a small simple core to intelligence which can be found substantially from first principles make alignment harder, in practice there is an epistemic correlation among humans between absolute alignment difficulty (in terms of cognitive labor requirements) and slower takeoff.
Good points.
I don’t really understand why this should extremize my probabilities
For the “Does aligned DAI suffice?” section, as I understand it you define an alignment labor requirement, then you combine that with your uncertainty over takeoff speed to see if the alignment labor requirement would be met.
I guess I’m making a claim that if you added uncertainty over the alignment labor requirement, thenyou added the correlation, the latter change would extremize the probability.
This is because slower takeoff corresponds to better outcomes, while harder alignment corresponds to worse outcomes, so making them correlated results in more clustering toward worlds with median easiness, which means that if you think the easiness requirement to get alignment is low, the probability of success goes up, and vice versa. This is glossing a bit but I think it’s probably right.
Classic motte and baileys are situations where the motte is not representative of the bailey.
Defending that the universe probably has a god or some deity, and that we can feel connected to it, and then turning around and making extreme demands of people’s sex lives and financial support of the church when that is accepted, is a central motte and bailey.
Pointing out that if anyone builds it using current techniques the it would kill everyone, is not far apart from the policy claim to shut it down. It’s not some weird technicality that would of course never come up. Most of humanity is fully unaware that this is a concern and will happily sign off on massive ML training runs that would kill us all—as would many people in tech. This is because have little-to-no awareness of the likely threat! So it is highly relevant, as there is no simple setting for not that, and it takes a massive amount of work to get from this current situation to a good one, and is not a largely irrelevant but highly defensible claim.
The authors clearly intend to make a pretty broad claim, not the more narrow claim you imply.
This feels like a motte and bailey where the motte is “If you literally used something remotely like current scaled up methods without improved understanding to directly build superintelligence, everyone would die” and the bailey is “on the current trajectory, everyone will die if superintelligence is built without a miracle or a long (e.g. >15 year) pause”.
I expect that by default superintelligence is built after a point where we have access to huge amounts of non-superintelligent cognitive labor so it’s unlikely that we’ll be using current methods and current understanding (unless humans have already lost control by this point, which seems totally plausible, but not overwhelmingly likely nor argued convincingly for by the book). Even just looking at capabilities, I think it’s pretty likely that automated AI R&D will result in us operating in a totally different paradigm by the time we build superintelligence—this isn’t to say this other paradigm will be safer, just that a narrow description of “current techniques” doesn’t include the default trajectory.
I think it’s pretty clear the authors intend to include “we ~hand off AI R&D and alignment to AIs developed in roughly the current paradigm which proceed with development” as a special case of “anything remotely like current techniques” (as from my perspective it is the default trajectory). But, if these earlier AIs were well aligned (and wise and had reasonable epistemics), I think it’s pretty unclear that the situation would go poorly and I’d guess it would go fine because these AIs would themselves develop much better alignment techniques. This is my main disagreement with the book.
Sorry, this seems wild to me. If current techniques seem lethal, and future techniques might be worse, then I’m not sure what the point is of pointing out that the future will be different.
I mean, I also believe that if we solve the alignment problem, then we will no longer have an alignment problem, and I predict the same is true of Nate and Eliezer.
Is your current sense that if you and Buck retired, the rest of the AI field would successfully deliver on alignment? Like, I’m trying to figure out whether your sense here is the default is “your research plan succeeds” or “the world without your research plan”.
By “superintelligence” I mean “systems which are qualititatively much smarter than top human experts”. (If Anyone Builds It, Everyone Dies seems to define ASI in a way that could include weaker levels of capability, but I’m trying to refer to what I see as the typical usage of the term.)
Sometimes, people say that “aligning superintelligence is hard because it will be much smarter than us”. I agree, this seems like this makes aligning superintelligence much harder for multiple reasons.
Correspondingly, I’m noting that if we can align earlier systems which are just capable enough to obsolete human labor (which IMO seems way easier than directly aligning wildly superhuman systems), these systems might be able to ongoingly align their successors. I wouldn’t consider this “solving the alignment problem” because we instead just aligned a particular non-ASI system in a non-scalable way, in the same way I don’t consider “claude 4.0 opus is aligned enough to be pretty helpful and not plot takeover” to be a solution to the alignment problem.
Perhaps your view is “obviously it’s totally sufficient to align systems which are just capable enough to obsolete current human safety labor, so that’s what I meant by ‘the alignment problem’”. I don’t personally think this is obvious given race dynamics and limited time (though I do think it’s likely to suffice in practice). Minimally, people often seem to talk about aligning ASI (which I interpret to mean wildly superhuman AIs rather than human-ish level AIs).
Okay I think my phrasing was kinda motte-and-bailey-ish, although not that Motte-and-Bailey-ish.
I think “anything like current techniques” and “anything like current understanding” clearly set a very high bar for the difference. “We made more progress on interpretability/etc at the current rates of progress” fairly clearly doesn’t count by the book’s standards.
But, I agree that a pretty reasonable class of disagreement here is “exactly how different from the current understanding/techniques do we need to be?” to be something you expect to disagree with them on when you get into the details. That seems important enough for me to edit into the earlier sections of the post.
(Maybe this is obvious, but I thought I would say this just to be clear.)
Sure, but I expect wildly more cognitive labor and effort if humans retain control and can effectively leverage earlier systems, not just “more progress than we’d expect”. I agree the bar is above “the progress we’d expect by default (given a roughly similar field size) in the next 10 years”, but I think things might get much more extreme due to handing off alignment work to AIs. I agree the book is intended to apply pretty broadly, but regardless of intention does it really apply to “1 million AIs somewhat smarter than humans have spent 100 years each working on the problem (and coordinating etc?)”? (I think the crux is more like “can you actually safely get this alignment work out of these AIs”.)
It seems very unlikely you can get that alignment work out of these AIs without substantially pausing or slowing first?
If you don’t believe that it does seem like we should chat sometime. It’s not like completely implausible, but I feel like we must both agree that if you go full speed on AI there is little chance that you end up getting that much alignment work out of models before you are cooked.
Thanks for the nudge! I currently disagree with “very unlikely”, but more importantly, I noticed that I haven’t really properly analyzed the question of “given how much cognitive labor is available between different capability levels, should we expect that alignment can keep up with capabilities if a small fraction (e.g. 5%) is ongoingly spent on alignment (in addition to whatever alignment-ish work is directly commercially expedient)”. I should spend more time thinking about this question and it seems plausible I’ll end up updating towards thinking risk is substantially higher/lower on the basis of this. I think I was underestimating the case that even if AIs are reasonably aligned, it might just be seriously hard for them to improve alignment tech fast enough to keep up with capabilities (I wasn’t ignoring this in my prior thinking, but I when I thought about some examples, the situation seemed worse than I was previously thinking), so I currently expect to update towards thinking risk is higher.
(At least somewhat rambly from here on.)
The short reason why I currently disagree: it seems pretty likely that we’ll have an absolutely very large amount of cognitive labor (in parallel copies, serial speed, and quality) before we’re cooked because improving algorithms will require more and more cognitive resources before it’s too late. This depends on the takeoff dynamics. If you think that going from “can full automate AI R&D” to “level of capability where we are totally cooked” given a fixed amount of compute will take a small amount of cognitive labor, then you get a proportionally small amount of cognitive labor for safety. I think the corresponding level of cognitive labor might actually be very large, especially in worlds where takeoff is relatively smaller. In general, a key question is “how do the returns to cognitive labor compare between capabilities and alignment: how does the amount of cognitive labor needed to advance from X to Y capability level compare to the amount of cognitive labor needed to progress alignment from the point where AIs of X capability level are safe/aligned to the point where AIs of Y capability level are safe/aligned?” This is basically a question about the relative returns to cognitive labor.
I’ll try to do some initial analysis here (partially just to start off my thinking about this).
I’ll suppose that the AI company spends 5% of its resources on “trying to reduce takeover risk alignment/safety work (x-safety)” at each point in addition to whatever it spends on solving related problems which are commercially expedient and obviously incentivized (e.g. if training runs aren’t working as well due to reward hacking the company’s efforts to fix this aren’t included in the 5% budget, though the 5% budget might subsidize some of this work). Initially, this includes time of human employees, but eventually human labor is irrelevant.
For simplicity, I’ll assume there is no chance of spending more than 5% or slowing down more than this (when averaged over a every few development cycles, including once these development cycles speed up, short slow downs and spikes in spending are fine). In practice, I do think that empirical evidence and AIs warning that the situation is dangerous are very important, especially after handing off decision making to AIs (insofar as this happens).
Prior to full automation, employees working on x-safety are accelerated by AI and then once x-safety work is fully automated, we can still think of the rate of progress as a multiplier on the “without AI” speed. However, there is a difference in that if AIs we hand off to are seriously misalignment we’re ~totally fucked while this isn’t true prior to this point.
I’ll talk about a specific level of capability: Capable enough to handoff all safety work (which is strictly more capable than full automated AI R&D, but maybe less capable than “can dominate top human experts at ~everything”). I’ll call this level of capability “DAI” (Deferable AI).
We can then divide the problem into roughly two parts:
By the time we’re at DAI, will we be able to align DAI (and also ensure these AIs are well elicited and have good epistemics/wisdom)? (As in, at least within some short period of DAI level capabilities.)
Conditional on successfully aligning DAI (including via “lame” prosaic techniques which aren’t themselves very scalable), if we hand over to these AIs can they ongoingly ensure AIs remain aligned/safe given that capabilities keep going with 95% of resources?[1]
It’s a little hard for me to tell if what you’re saying is a claim that the first part isn’t going to work out or the second part isn’t going to work out. I’ll go through both mostly talking about “is it very unlikely that we’ll make it”.
Aligning DAI
On the first: we get moderate but not massive acceleration by DAI. Supposing DAI is made in late 2030 (5 years from now), then maybe we see superhuman coder a year before this and some moderate acceleration (3x AI R&D labor acceleration) a year before that. Supposing moderately good but not great elicitation and that AIs aren’t scheming, I think we slightly conservatively get an average of like 2.5x AI safety R&D acceleration between moderate acceleration and superhuman coder and like 4x acceleration on average between superhuman coder and DAI. (Initially this is mostly due to accelerating engineering / running experiments and then it gets more broad over time.) Ok, so we get roughly 2.5 years (post acceleration) to do research and study AIs in the capability range of [moderate acceleration to superhuman coder] and 4 years to study AIs between [superhuman coder and DAI]. This is as well as 3 years with AIs that aren’t even capable of moderate acceleration. (Presumably the AI safety field is growing over this period?)
Note that average acceleration over some period can be dominated by acceleration at the end if it’s increasing over time.
(I can explain where these numbers come from if this is helpful.)
But, what about the possibility of scheming? I think superhuman coder is only like 20% likely to be scheming (for a moderately strong notion of scheming, see link) by default and mitigations might somewhat lower this. DAI is like 40% likely to be scheming. So even if we fully YOLO scheming, on my views we’re not totally screwed due to this alone. In practice, work could substantially lower these risks. If we had to guess, maybe I think that in practice the risk of scheming from DAI will be cut by a factor of 3 given 5% effort on well allocated safety and this scenario? Idk? So maybe we eat like 15% risk here? If allocated as well as I expect current frontier AI companies to allocate, I’d guess more like 25% risk? (shrug.) For now, I’ll assume that if DAI is scheming we immediately lose (which is roughly true because we have to hand off quickly), but that superhuman coder scheming isn’t fatal (which doesn’t matter that much for the bottom line risk number anyway).
Ok, but it isn’t sufficient to just ensure DAI isn’t scheming, we have to also ensure it is aligned enough to hand off work and has good epistemics and is well elicited on hard to check tasks. This seems pretty hard given the huge rush, but it isn’t obviously fucked IMO, especially given the extra years from acceleration. I have some draft writing on this which should hopefully be out somewhat soon. Maybe my view is 20% chance of failure given good allocation and roughly 60% chance of failure given the default allocation (which includes stuff like the safety team not actually handing off or not seriously working on this etc)?
Edit: I now think 60% is too pessimistic, though I don’t know what I think overall.
My current sense is that current technical AI x-safety work isn’t very well targeted for ensuring DAI isn’t scheming and making DAI safe to hand off to, but it also isn’t pessimally targeted for this. I tenatively think the targeting will improve over time and that it has improved over the last few years.
(In practice, the situation is messier and mostly more optimistic than this than this number appears because people will pursue plans other than handing off at DAI, there is the potential for slowing down, the takeoff might look different, trailing actors might have some influence, etc. Like, 60% chance of failing to handover to DAI is partially due to people not even targeting handover and their alternative plans would have some chance of working.)
Overall, we’re at like ~70% doom given default allocation and more like ~25% doom given an ideal allocation. Doesn’t seem overwhemingly likely to be catastrophic so far given my views (though I agree it doesn’t look good!). Not sure where you disagree.
Does aligned DAI suffice?
At each point, our aligned AIs have to align (or possibly temporarily control?) further AIs as the mad capabilities race using 95% of resources continues. How fast do things progress given algorithmic progress which is accelerated by AI R&D? Well, there is some chance things actually aren’t that fast because of diminishing returns to algorithmic progress, some chance for decent amount of acceleration, and some chance we get wildly, wildly superhuman AIs in a few months (if AIs haven’t taken over and put an end to the insanity by that level of capability).
Really, we actually care about how the rate of capabilities progress compares to alignment, so things which just make the labor supply/efficacy bigger should mostly hit on both sides.
Ok, so how does capabilities progress compare to alignment? One really stupid guess is:
If we get the equivalent of 20 serial years of DAI-level labor (from >100k DAI level parallel agents given a proportional amount of compute) before +3 SDs over DAI we’re fine because we have a scalable solution to alignment. Otherwise takeover. (This is somewhat more conservative than my actual view.)
In practice, there is some continuous alignment progress vs capabilities question.
Naively, maybe DAI runs at like 40x human speed and we probably have enough parallel agents that 5% of the pool is >100k agents. So, 20 serial years takes 0.5 calender years. So naively, we’re in trouble if we get +3 SDs within 0.5 years. I think this is like 65% likely, so we’re in trouble so far. But before we get +3 SDs, we get to +1 SD and +2 SD as well increases to speed etc. Maybe the +1 SD AIs are all considered as good as like ~3x faster given compute bottlenecks and +2 SD is like 10x faster? So, assuming progress in SD is roughly linear over time (probably wrong, but hopefully not too off), we get that we make it if we get +3 SD AIs in >~0.05 years which seems much more likely.
This is pretty shitty oversimplified modeling, but it overall makes me feel at-least-somewhat optimistic about recursive alignment outpacing recursive self-improvement given my views on takeoff speeds. Like maybe 80% chance things are fine?
[Ok, I have to stop writing this comment and do something else]
In practice, I hope that these AIs might be able to advice AI company leadership to take a different path, but we’re assuming that away and assuming that AI company leadership retains enough control to make these AIs spend 95% on capabilities.
This is a long comment! I was glad to have read it, but am a bit confused about your numbers seeming different from the ones I objected to. You said:
Then in this comment you say:
Here you now say 20 years, and >100k DAI level parallel agents. That’s a factor of 5 and a factor of 10 different! That’s a huge difference! Maybe your estimates are conservative enough to absorb a factor of 50 in thinking time without changing the probability that much?
I think I still disagree with your estimates, but before I go into them, I kind of want to check whether I am missing something, given that I currently think you are arguing for a resources allocation that’s 50x smaller than what I thought I was arguing against.
I gave “1 million AIs somewhat smarter than humans with the equivalent of 100 years each” as an example of a situation I thought wouldn’t count as “anything like current techniques/understanding”. In this comment, I picked a lower number which is maybe my best guess for an amount of labor which eliminates most of the risk by a given level of capability.
I do think that “a factor of 5 and a factor of 10 different” is within my margin of error for amount of labor you need. (Note that there might be aggressively diminishing returns on parallel labor, though possibly not due to very superhuman coordination abilities by AI.)
My modeling/guesses are pretty shitty in this comment (I was just picking some numbers to see how things work out), so if that’s a crux, I should probably try to be thoughtful (I was trying to write this quickly to get something written up).
This makes sense, but I think I am still a bit confused. My comment above was mostly driven by doing a quick internal fermi estimate myself for whether “1 million AIs somewhat smarter than humans have spent 100 years each working on the problem” is a realistic amount of work to get out of the AIs without slowing down, and arriving at the conclusion that this seems very unlikely across a relatively broad set of worldviews.
We can also open up the separate topic of how much work might be required to make real progress on superalignment in time, or whether this whole ontology makes sense, but I was mostly interested in doing a fact-check of “wait, that really sounds like too much, do you really believe this number is realistic?”.
I still disagree, but I have much less of a “wait, this really can’t be right” reaction if you mean the number that’s 50x lower.
This seems way too pessimistic to me. At the point of DAI, capabilities work will also require good epistemics and good elicitation on hard to check tasks. The key disanalogy between capabilities and alignment work at the point of DAI is that the DAI might be scheming, but you’re in a subjunctive case where we’ve assumed the DAI is not scheming. Whence the pessimism?
(This complaint is related to Eli’s complaint)
I don’t think this is the only disanalogy. It seems to me like getting AIs to work efficiently on automating AI R&D might not result in solving all the problems you need to solve for it to be safe to hand off ~all x-safety labor to AIs. This is a mix of capabilities, elicitation, and alignment. This is similar to how a higher level of mission alignment might be required for AI company employees working on conceptually tricky alignment research relative to advancing AI R&D.
Another issue is that AI societies might go off the rails over some longer period in some way which doesn’t eliminate AI R&D productivity, but would be catastrophic from an alignment perspective.
This isn’t to say there isn’t anything which is hard to check or conceptually tricky about AI R&D, just that the feedback loops seem much better.
I’m not really following where the disanalogy is coming from (like, why are the feedback loops better?)
Sure, AI societies could go off the rails that hurts alignment R&D but not AI R&D; they could also go off the rails in a way that hurts AI R&D and not alignment R&D. Not sure why I should expect one rather than the other.
Although on further reflection, even though the current DAI isn’t scheming, alignment work still has to be doing some worst-case type thinking about how future AIs might be scheming, whereas this is not needed for AI R&D. I don’t think this makes a big difference—usually I find worst-case conceptual thinking to be substantially easier than average-case conceptual thinking—but I could imagine that causing issues.
Do you agree the feedback loops for capabilities are better right now?
For this argument it’s not a crux that it is asymmetric (though due to better feedback for AI R&D I think it actually is). E.g., suppose that in 10% of worlds safety R&D goes totally off the rails while capabilities proceed and in 10% of worlds capabilities R&D goes off totally the rails while safety proceeds. This still results in an additional 10% takeover risk from the subset of worlds where safety R&D goes off the rails. (Edit: though risk could be lower in the worlds where capabilities R&D goes off the rails due to having more time for safety, depending on whether this also applies to the next actor etc.)
Yes, primarily due to the asymmetry where capabilities can work with existing systems while alignment is mostly stuck waiting for future systems, but that should be much less true by the time of DAI.
I was thinking both of this, and also that it seems quite correlated due to lack of asymmetry. Like, 20% on exactly one going off the rails rather than zero or both seems very high to me; I feel like to get to that I would want to know about some important structural differences between the problems. (Though I definitely phrased my comment poorly for communicating that.)
I think studying scheming in current/future systems has ongoingly worse feedback loops? Like suppose our DAI level system wants to study scheming in a +3 SD DAI level system. This is structurally kinda tricky because schemers try to avoid detection. I agree having access to capable AIs makes this much easier to get good feedback loops, but there is an asymmetry.
Yeah, that’s fair for agendas that want to directly study the circumstances that lead to scheming. Though when thinking about those agendas, I do find myself more optimistic because they likely do not have to deal with long time horizons, whereas capabilities work likely will have to engage with that.
Note many alignment agendas don’t need to actually study potential schemers. Amplified oversight can make substantial progress without studying actual schemers (but probably will face the long horizon problem). Interpretability can make lots of foundational progress without schemers, that I would expect to mostly generalize to schemers. Control can make progress with models prompted or trained to be malicious.
(Though note that it’s unclear whether this progress will mitigate scheming risk.)
Seems like diminishing returns to capabiltiies r&d should be at least somewhat correlated with diminishing returns to safety r&d, which I believe should extremize your probability (because e.g. if before you were counting on worlds with slow takeoff and low alignment requirements, these become less likely; and the inverse if you’re optimistic)
I don’t think I understand this comment.
It sounds like you’re saying:
“Slower takeoff should be correlated with ‘harder’ alignment (in terms of cognitive labor requirements) because slower takeoff implies returns to cognitive labor in capabilities R&D are relatively lower and we should expect this means that alignment returns to cognitive labor are relatively lower (due to common causes like ‘small experiments and theory don’t generalize well and it is hard to work around this’). For the same reasons, faster takeoff should be correlated with ‘easier’ alignment.”
I think I agree with this mostly, though there are some reasons for anti-correlation, e.g., worlds where there is a small simple core to intelligence which can be found substantially from first principles make alignment harder, in practice there is an epistemic correlation among humans between absolute alignment difficulty (in terms of cognitive labor requirements) and slower takeoff.
I don’t really understand why this should extremize my probabilities, but I agree this correlation isn’t accounted for at all in my analysis.
Yes, that is what I’m saying. In general a lot of prosaic alignment activities seem pretty correlated with capabilities in terms of their effectiveness.
Good points.
For the “Does aligned DAI suffice?” section, as I understand it you define an alignment labor requirement, then you combine that with your uncertainty over takeoff speed to see if the alignment labor requirement would be met.
I guess I’m making a claim that if you added uncertainty over the alignment labor requirement, then you added the correlation, the latter change would extremize the probability.
This is because slower takeoff corresponds to better outcomes, while harder alignment corresponds to worse outcomes, so making them correlated results in more clustering toward worlds with median easiness, which means that if you think the easiness requirement to get alignment is low, the probability of success goes up, and vice versa. This is glossing a bit but I think it’s probably right.
Classic motte and baileys are situations where the motte is not representative of the bailey.
Defending that the universe probably has a god or some deity, and that we can feel connected to it, and then turning around and making extreme demands of people’s sex lives and financial support of the church when that is accepted, is a central motte and bailey.
Pointing out that if anyone builds it using current techniques the it would kill everyone, is not far apart from the policy claim to shut it down. It’s not some weird technicality that would of course never come up. Most of humanity is fully unaware that this is a concern and will happily sign off on massive ML training runs that would kill us all—as would many people in tech. This is because have little-to-no awareness of the likely threat! So it is highly relevant, as there is no simple setting for not that, and it takes a massive amount of work to get from this current situation to a good one, and is not a largely irrelevant but highly defensible claim.
The comment you’re replying to is explaining why the motte is not representative of the bailey in this case (in their view).
Yeah that’s fair.