(meta: I am the most outlier among MIRIans; despite being pretty involved in this piece, I would have approached it differently if it were mine alone, and the position I’m mostly defending here is one that I think is closest-to-MIRI-of-the-avilable-orgs, not one that is centrally MIRI)
Can’t you just discuss the strongest counterarguments and why you don’t buy them? Obviously this won’t address everyone’s objection, but you could at least try to go for the strongest ones.
Yup! This is in a resource we’re working on that’s currently 200k words. It’s not exactly ‘why I don’t buy them’ and more ‘why Nate doesn’t buy them’, but Nate and I agree on more than I expected a few months ago. This would have been pretty overwhelming for a piece of the same length as ‘The Problem’; it’s not an ‘end the conversation’ kind of piece, but an ‘opening argument’.
It also helps to avoid making false claims and generally be careful about over-claiming.
^I’m unsure which way to read this:
“Discussing the strongest counterarguments helps you avoid making false or overly strong claims.”
“You failed to avoid making false or overly strong claims in this piece, and I’m reminding you of that.”
1: Agreed! I think that MIRI is too insular and that’s why I spend time where I can trying to understand what’s going on with more, e.g., Redwood sphere people. I don’t usually disagree all that much; I’m just more pessimistic, and more eager to get x-risk off the table altogether, owing to various background disagreements that aren’t even really about AI.
2: If there are other, specific places you think the piece overclaims, other than the one you highlighted (as opposed to the vibes-level ‘this is more confident than Ryan would be comfortable with, even if he agreed with Nate/Eliezer on everything’), that would be great to hear. We did, in fact, put a lot of effort into fact-checking and weakening things that were unnecessarily strong. The process for this piece was unfortunately very cursed.
Also, insofar as you are actually uncertain (which I am, but you aren’t), it seems fine to just say that you think the situation is uncertain and the risks are still insanely high?
I am deeply uncertain. I like a moratorium on development because it solves the most problems in the most worlds, not because I think we’re in the worst possible world. I’m glad humanity has a broad portfolio here, and I think the moratorium ought to be a central part of it. A moratorium is exactly the kind of solution you push for when you don’t know what’s going to happen. If you do know what’s going to happen, you push for targeted solutions to your most pressing concerns. But that just doesn’t look to me to be the situation we’re in. I think there are important conditionals baked into the ‘default outcome’ bit, and these don’t often get much time in the sun from us, because we’re arguing with the public more than we’re arguing with our fellow internet weirdos.
The thing I am confident in is “if superintelligence tomorrow, then we’re cooked”. I expect to remain confident in something like this for a thousand tomorrows at the very least, maybe many times that.
We have a decent amount of time at roughly this level of capability
By what mechanism? This feels like ‘we get a pause’ or ‘there’s a wall’. I think this is precisely the hardest point in the story at which to get a pause, and if you expect a wall here, it seems like a somewhat arbitrary placement? (unless you think there’s some natural reason, e.g., the AIs don’t advance too far beyond what’s present in the training data, but I wouldn’t guess that’s your view)
(There are other sources of risk which can still occur in these worlds to be clear, like humanity collectively going crazy.)
[quoting as an example of ‘thing a moratorium probably mostly solves actually’; not that the moratorium doesn’t have its own problems, including ‘we don’t actually really know how to do it’, but these seem easier to solve than the problems with various ambitious alignment plans]
By what mechanism? This feels like ‘we get a pause’ or ‘there’s a wall’.
I just meant that takeoff isn’t that fast so we have like >0.5-1 year at a point where AIs are at least very helpful for safety work (if reasonably elicited) which feels plausible to me. The duration of “AIs could fully automate safety (including conceptual stuff) if well elicited+aligned but aren’t yet scheming due to this only occuring later in capabilites and takeoff being relatively slower” feels like it could be non-trivial in my views.
I don’t think this involves either a pause or a wall. (Though some fraction of the probability does come from actors intentionally spending down lead time.)
I’m unsure which way to read this
I meant it’s generally helpful and would have been helpful here for this specific issue, so mostly 2, but also some of 1. I’m not sure if there are other specific places where the piece overclaims (aside from other claims about takeoff speeds elsewhere). I do think this piece reads kinda poorly to my eyes in terms of it’s overall depiction of the situation with AI in a way that maybe comes across poorly to an ML audience, but idk how much this matters. (I’m probably not going to prioritize looking for issues in this particular post atm beyond what I’ve already done : ).)
The thing I am confident in is “if superintelligence tomorrow, then we’re cooked”. I expect to remain confident in something like this for a thousand tomorrows at the very least, maybe many times that.
This is roughly what I meant by “you are actually uncertain (which I am, but you aren’t)”, but my description was unclear. I meant like “you are confident in doom in the current regime (as in, >80% rather than <=60%) without a dramatic change that could occur over some longer duration”. TBC, I don’t mean to imply that being relatively uncertain about doom is somehow epistemically superior.
I just meant that takeoff isn’t that fast so we have like >0.5-1 year at a point where AIs are at least very helpful for safety work (if reasonably elicited) which feels plausible to me. The duration of “AIs could fully automate safety (including conceptual stuff) if well elicited+aligned but aren’t yet scheming due to this only occurring later in capabilites and takeoff being relatively slower” feels like it could be non-trivial in my views.
I want to hear more about this picture and why ‘stories like this’ look ~1/3 likely to you. I’m happy to leave scheming off the table for now, too. Here’s some info that may inform your response:
I don’t see a reason to think that models are more naturally or ~as useful for accelerating safety as capabilities, and I don’t see a reason to think the pile of safety work to be done is significantly smaller than the pile of capabilities work necessary to reach superintelligence (in particular if we’re already at ~human-level systems at this time). I don’t think the incentive landscape is such that it will naturally bring about this kind of state, and shifting the incentives of the space is Real Hard (indeed, it’s easier to imagine the end of the world).
I disagree with Carlsmith that there’s such a thing as a ‘safety feedback loop’, in a similar sense to the sense in which there’s obviously a capabilities feedback loop. In the current regime, it looks like safety R+D is at a permanent disadvantage; whatever advances in capabilities we see seem likely to increase that gap, rather than lessen it, and something like a wall/halt/pause seems like the by-far most plausible path, to me, to safety actually getting a relative edge.
I’m suspicious of efforts to shape the order in which capabilities are unlocked.
I’m not committed to fast takeoff. I’m not especially convinced of LLM-foom (or, at least, have really wide margins on the level of capabilities at which I’d expect runaway RSI), and if LLM’s don’t scale much beyond the current level, I’m ~70 percent ‘they don’t accelerate progress in the founding of a new paradigm by more than 2-3x’.
Rather than thinking in terms of takeoff, I’m concerned about something like ‘from this point, we seem to have lost many of the levers for changing course via technical intervention, and increased AI capabilities don’t make us differentially more likely to get them back, regardless of the pace of progress.’
Suppose that we ended up with AIs that were ~perfectly aligned (to what the company/project that trained these AIs wanted) which had capabilities that dominate top human experts in AI R&D, AI safety research, philosophy, and most digital research fields in general. These AIs are sufficiently aligned that they try hard to do the right thing even on hard-to-check, conceptual, and open-ended tasks. Let’s say these AIs aren’t much more capable than needed for this capability bar (and the capability profile works out such that these AIs aren’t qualitatively wildly superhuman as seems likely to me).
These AIs are sufficiently aligned and capable that they dominate humans at doing safety work and have better epistemics than groups of human experts.
I think the situation would probably be fine if the AI company tasked these AIs with proceeding with AI development with 20% of resources spent on alignment/safety of future models and 80% on advancing capabilities.
I claim that either you have to think that takeoff is very fast or that alignment is very difficult to think things are likely to go wrong given this situation:
At the point of handoff, you maybe have ~3 million AI labor years / year and each instance maybe runs at effectively 50x speed. (Obviously the error bars here are very wide.) 20% goes to alignment.
So, holding this level of capability fixed, in 2 months, you get the equivalent of ~8 years of work from 10k people at the level of top human experts. In 6 months, you get 25 years of work from 10k people. This works occurs with pretty limited compute per person-year, but anything purely conceptual/theoretical can be worked on for a very long time.
But, also capabilities are improving over this period, so you actually get substantially more labor than this, as long as the level of alignment can be preserved.
At some point, AIs end up scheming if you haven’t done enough additional work by this point. But, it seems pretty likely you’d still be safe after enough AI progress that it’s as though the AI’s are 5x faster (taking into account qualitative intelligence improvements, I just mean that the AIs are as productive as roughly 5x faster versions of our prior AIs).
So, within a short period (e.g. 2 months) after this additional AI progress (such that it’s as though the AIs are 5x faster), you have an insane amount of alignment work done.
You could just think takeoff is very fast, or that alignment is very compute bottlenecked.
These AIs might also advise different actions than an 80⁄20 split to be clear! Like trying to buy lead time to spend on alignment.
This overall makes me pretty optimistic about scenarios where we reach this level of alignment in these not-yet-ASI level systems which sounds like a clear disagreement with your perspective. I don’t think this is all of the disagreement, but it might drive a bunch of it.
(To be clear, I think this level of alignment could totally fail to happen, but we seem to disagree even given this!)
I think my response heavily depends on the operationalization of alignment for the initial AIs, and I’m struggling to keep things from becoming circular in my decomposition of various operationalizations. The crude response is that you’re begging the question here by first positing aligned AIs, but I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I think there’s a better-specified (from my end; you’re doing great) version of this conversation that focuses on three different categories of techniques, based on the capability level at which we expect each to be effective:
Current model-level
Useful autonomous AI researcher level
Superintelligence
However, I think that disambiguating between proposed agendas for 2 + 3 is very hard, and assuming agendas that plausibly work for 1 also work for 2 is a mistake. It’s not clear to me why the ‘it’s a god, it fucks you, there’s nothing you can do about that’ concerns don’t apply for models capable of:
hard-to-check, conceptual, and open-ended tasks
I feel pretty good about this exchange if you want to leave it here, btw! Probably I’ll keep engaging far beyond the point at which its especially useful (although we’re likely pretty far from the point where it stops being useful to me rn).
Ok, sound it sounds like your view is “indeed if we got ~totally aligned AIs capable of fully automating safety work (but not notably more capable than the bare minimum requirement for this), we’d probably fine (even if there is still a small fraction of effort spent on safety) and the crux is earlier than this”.
Is this right? If so, it seems notable if the problem can be mostly reduced to sufficiently aligning (still very capable) human-ish level AIs and handing off to these systems (which don’t have the scariest properties of an ASI from an alignment perspective).
I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I’d say my position is more like:
Scheming might just not happen: It’s basically a toss up whether systems at this level of capability would end up scheming “by default” (as in, without active effort researching preventing scheming and just work motivated by commercial utility along the way). Maybe I’m at ~40% scheming for such systems, though the details alter my view a lot.
The rest of the problem if we assume no scheming doesn’t obviously seem that hard: It’s unclear how hard it will be to make non-scheming AIs of the capability level discussed above be sufficiently aligned for the strong sense of alignment I discussed above. I think it’s unlikely that the default course gets us there, but it seems pretty plausible to me that modest effort along the way does. It just requires some favorable generalization of the sort that doesn’t seem that surprising and we’ll have some AI labor along the way to help. And, for this part of the problem, we totally can get multiple tries and study things pretty directly with empiricism using behavioral tests (though we’re still depending on some cleverness and transfer as we can’t directly verify the things we ultimately want the AI to do).
Further prosaic effort seems helpful for both avoiding scheming and the rest of the problem: I don’t see strong arguments for thinking that at the level of capability we’re discussing scheming will be intractable to prosaic methods or experimentation. I can see why this might happen and I can certainly imagine worlds where no on really tries. Similarly, I don’t see a strong argument for further effort on relatively straightforward methods can’t help a bunch in getting you sufficiently aligned systems (supposing they aren’t scheming): we can measure what we want somewhat well with a bunch of effort and I can imagine many things which could make a pretty big difference (again, this isn’t to say that this effort will happen in practice).
This isn’t to say that I can’t imagine worlds where pretty high effort and well orchestrated prosaic iteration totally fails. This seems totally plausible, especially given how fast this might happen, so risks seem high. And, it’s easy for me to imagine ways the world could be such that relatively prosaic methods and iteration is ~doomed without much more time than we can plausibly hope for, it’s just that these seem somewhat unlikely in aggregate to me.
So, I’d be pretty skeptical of someone claiming that the risk of this type of approach would be <3% (without at the very least preserving the optionality for a long pause during takeoff depending on empirical evidence), but I don’t see a case for thinking “it would be very surprising or wild if prosaic iteration sufficed”.
(going to reply to both of your comments here)
(meta: I am the most outlier among MIRIans; despite being pretty involved in this piece, I would have approached it differently if it were mine alone, and the position I’m mostly defending here is one that I think is closest-to-MIRI-of-the-avilable-orgs, not one that is centrally MIRI)
Yup! This is in a resource we’re working on that’s currently 200k words. It’s not exactly ‘why I don’t buy them’ and more ‘why Nate doesn’t buy them’, but Nate and I agree on more than I expected a few months ago. This would have been pretty overwhelming for a piece of the same length as ‘The Problem’; it’s not an ‘end the conversation’ kind of piece, but an ‘opening argument’.
^I’m unsure which way to read this:
“Discussing the strongest counterarguments helps you avoid making false or overly strong claims.”
“You failed to avoid making false or overly strong claims in this piece, and I’m reminding you of that.”
1: Agreed! I think that MIRI is too insular and that’s why I spend time where I can trying to understand what’s going on with more, e.g., Redwood sphere people. I don’t usually disagree all that much; I’m just more pessimistic, and more eager to get x-risk off the table altogether, owing to various background disagreements that aren’t even really about AI.
2: If there are other, specific places you think the piece overclaims, other than the one you highlighted (as opposed to the vibes-level ‘this is more confident than Ryan would be comfortable with, even if he agreed with Nate/Eliezer on everything’), that would be great to hear. We did, in fact, put a lot of effort into fact-checking and weakening things that were unnecessarily strong. The process for this piece was unfortunately very cursed.
I am deeply uncertain. I like a moratorium on development because it solves the most problems in the most worlds, not because I think we’re in the worst possible world. I’m glad humanity has a broad portfolio here, and I think the moratorium ought to be a central part of it. A moratorium is exactly the kind of solution you push for when you don’t know what’s going to happen. If you do know what’s going to happen, you push for targeted solutions to your most pressing concerns. But that just doesn’t look to me to be the situation we’re in. I think there are important conditionals baked into the ‘default outcome’ bit, and these don’t often get much time in the sun from us, because we’re arguing with the public more than we’re arguing with our fellow internet weirdos.
The thing I am confident in is “if superintelligence tomorrow, then we’re cooked”. I expect to remain confident in something like this for a thousand tomorrows at the very least, maybe many times that.
By what mechanism? This feels like ‘we get a pause’ or ‘there’s a wall’. I think this is precisely the hardest point in the story at which to get a pause, and if you expect a wall here, it seems like a somewhat arbitrary placement? (unless you think there’s some natural reason, e.g., the AIs don’t advance too far beyond what’s present in the training data, but I wouldn’t guess that’s your view)
[quoting as an example of ‘thing a moratorium probably mostly solves actually’; not that the moratorium doesn’t have its own problems, including ‘we don’t actually really know how to do it’, but these seem easier to solve than the problems with various ambitious alignment plans]
I just meant that takeoff isn’t that fast so we have like >0.5-1 year at a point where AIs are at least very helpful for safety work (if reasonably elicited) which feels plausible to me. The duration of “AIs could fully automate safety (including conceptual stuff) if well elicited+aligned but aren’t yet scheming due to this only occuring later in capabilites and takeoff being relatively slower” feels like it could be non-trivial in my views.
I don’t think this involves either a pause or a wall. (Though some fraction of the probability does come from actors intentionally spending down lead time.)
I meant it’s generally helpful and would have been helpful here for this specific issue, so mostly 2, but also some of 1. I’m not sure if there are other specific places where the piece overclaims (aside from other claims about takeoff speeds elsewhere). I do think this piece reads kinda poorly to my eyes in terms of it’s overall depiction of the situation with AI in a way that maybe comes across poorly to an ML audience, but idk how much this matters. (I’m probably not going to prioritize looking for issues in this particular post atm beyond what I’ve already done : ).)
This is roughly what I meant by “you are actually uncertain (which I am, but you aren’t)”, but my description was unclear. I meant like “you are confident in doom in the current regime (as in, >80% rather than <=60%) without a dramatic change that could occur over some longer duration”. TBC, I don’t mean to imply that being relatively uncertain about doom is somehow epistemically superior.
I want to hear more about this picture and why ‘stories like this’ look ~1/3 likely to you. I’m happy to leave scheming off the table for now, too. Here’s some info that may inform your response:
I don’t see a reason to think that models are more naturally or ~as useful for accelerating safety as capabilities, and I don’t see a reason to think the pile of safety work to be done is significantly smaller than the pile of capabilities work necessary to reach superintelligence (in particular if we’re already at ~human-level systems at this time). I don’t think the incentive landscape is such that it will naturally bring about this kind of state, and shifting the incentives of the space is Real Hard (indeed, it’s easier to imagine the end of the world).
I disagree with Carlsmith that there’s such a thing as a ‘safety feedback loop’, in a similar sense to the sense in which there’s obviously a capabilities feedback loop. In the current regime, it looks like safety R+D is at a permanent disadvantage; whatever advances in capabilities we see seem likely to increase that gap, rather than lessen it, and something like a wall/halt/pause seems like the by-far most plausible path, to me, to safety actually getting a relative edge.
I’m suspicious of efforts to shape the order in which capabilities are unlocked.
I’m not committed to fast takeoff. I’m not especially convinced of LLM-foom (or, at least, have really wide margins on the level of capabilities at which I’d expect runaway RSI), and if LLM’s don’t scale much beyond the current level, I’m ~70 percent ‘they don’t accelerate progress in the founding of a new paradigm by more than 2-3x’.
Rather than thinking in terms of takeoff, I’m concerned about something like ‘from this point, we seem to have lost many of the levers for changing course via technical intervention, and increased AI capabilities don’t make us differentially more likely to get them back, regardless of the pace of progress.’
Suppose that we ended up with AIs that were ~perfectly aligned (to what the company/project that trained these AIs wanted) which had capabilities that dominate top human experts in AI R&D, AI safety research, philosophy, and most digital research fields in general. These AIs are sufficiently aligned that they try hard to do the right thing even on hard-to-check, conceptual, and open-ended tasks. Let’s say these AIs aren’t much more capable than needed for this capability bar (and the capability profile works out such that these AIs aren’t qualitatively wildly superhuman as seems likely to me).
These AIs are sufficiently aligned and capable that they dominate humans at doing safety work and have better epistemics than groups of human experts.
I think the situation would probably be fine if the AI company tasked these AIs with proceeding with AI development with 20% of resources spent on alignment/safety of future models and 80% on advancing capabilities.
I claim that either you have to think that takeoff is very fast or that alignment is very difficult to think things are likely to go wrong given this situation:
At the point of handoff, you maybe have ~3 million AI labor years / year and each instance maybe runs at effectively 50x speed. (Obviously the error bars here are very wide.) 20% goes to alignment.
So, holding this level of capability fixed, in 2 months, you get the equivalent of ~8 years of work from 10k people at the level of top human experts. In 6 months, you get 25 years of work from 10k people. This works occurs with pretty limited compute per person-year, but anything purely conceptual/theoretical can be worked on for a very long time.
But, also capabilities are improving over this period, so you actually get substantially more labor than this, as long as the level of alignment can be preserved.
At some point, AIs end up scheming if you haven’t done enough additional work by this point. But, it seems pretty likely you’d still be safe after enough AI progress that it’s as though the AI’s are 5x faster (taking into account qualitative intelligence improvements, I just mean that the AIs are as productive as roughly 5x faster versions of our prior AIs).
So, within a short period (e.g. 2 months) after this additional AI progress (such that it’s as though the AIs are 5x faster), you have an insane amount of alignment work done.
You could just think takeoff is very fast, or that alignment is very compute bottlenecked.
These AIs might also advise different actions than an 80⁄20 split to be clear! Like trying to buy lead time to spend on alignment.
This overall makes me pretty optimistic about scenarios where we reach this level of alignment in these not-yet-ASI level systems which sounds like a clear disagreement with your perspective. I don’t think this is all of the disagreement, but it might drive a bunch of it.
(To be clear, I think this level of alignment could totally fail to happen, but we seem to disagree even given this!)
I think my response heavily depends on the operationalization of alignment for the initial AIs, and I’m struggling to keep things from becoming circular in my decomposition of various operationalizations. The crude response is that you’re begging the question here by first positing aligned AIs, but I think your position is that techniques which are likely to descend from current techniques could work well-enough for roughly human-level systems, and that’s where I encounter this sense of circularity.
I think there’s a better-specified (from my end; you’re doing great) version of this conversation that focuses on three different categories of techniques, based on the capability level at which we expect each to be effective:
Current model-level
Useful autonomous AI researcher level
Superintelligence
However, I think that disambiguating between proposed agendas for 2 + 3 is very hard, and assuming agendas that plausibly work for 1 also work for 2 is a mistake. It’s not clear to me why the ‘it’s a god, it fucks you, there’s nothing you can do about that’ concerns don’t apply for models capable of:
I feel pretty good about this exchange if you want to leave it here, btw! Probably I’ll keep engaging far beyond the point at which its especially useful (although we’re likely pretty far from the point where it stops being useful to me rn).
Ok, sound it sounds like your view is “indeed if we got ~totally aligned AIs capable of fully automating safety work (but not notably more capable than the bare minimum requirement for this), we’d probably fine (even if there is still a small fraction of effort spent on safety) and the crux is earlier than this”.
Is this right? If so, it seems notable if the problem can be mostly reduced to sufficiently aligning (still very capable) human-ish level AIs and handing off to these systems (which don’t have the scariest properties of an ASI from an alignment perspective).
I’d say my position is more like:
Scheming might just not happen: It’s basically a toss up whether systems at this level of capability would end up scheming “by default” (as in, without active effort researching preventing scheming and just work motivated by commercial utility along the way). Maybe I’m at ~40% scheming for such systems, though the details alter my view a lot.
The rest of the problem if we assume no scheming doesn’t obviously seem that hard: It’s unclear how hard it will be to make non-scheming AIs of the capability level discussed above be sufficiently aligned for the strong sense of alignment I discussed above. I think it’s unlikely that the default course gets us there, but it seems pretty plausible to me that modest effort along the way does. It just requires some favorable generalization of the sort that doesn’t seem that surprising and we’ll have some AI labor along the way to help. And, for this part of the problem, we totally can get multiple tries and study things pretty directly with empiricism using behavioral tests (though we’re still depending on some cleverness and transfer as we can’t directly verify the things we ultimately want the AI to do).
Further prosaic effort seems helpful for both avoiding scheming and the rest of the problem: I don’t see strong arguments for thinking that at the level of capability we’re discussing scheming will be intractable to prosaic methods or experimentation. I can see why this might happen and I can certainly imagine worlds where no on really tries. Similarly, I don’t see a strong argument for further effort on relatively straightforward methods can’t help a bunch in getting you sufficiently aligned systems (supposing they aren’t scheming): we can measure what we want somewhat well with a bunch of effort and I can imagine many things which could make a pretty big difference (again, this isn’t to say that this effort will happen in practice).
This isn’t to say that I can’t imagine worlds where pretty high effort and well orchestrated prosaic iteration totally fails. This seems totally plausible, especially given how fast this might happen, so risks seem high. And, it’s easy for me to imagine ways the world could be such that relatively prosaic methods and iteration is ~doomed without much more time than we can plausibly hope for, it’s just that these seem somewhat unlikely in aggregate to me.
So, I’d be pretty skeptical of someone claiming that the risk of this type of approach would be <3% (without at the very least preserving the optionality for a long pause during takeoff depending on empirical evidence), but I don’t see a case for thinking “it would be very surprising or wild if prosaic iteration sufficed”.