I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post). … But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
I’d love to hear more about this. To me, “egregious misalignment” feels extremely natural/normal/expected, perhaps due to convergent instrumental goals. You might as well have said “I really don’t want my AI to think about politics” or “I really don’t want my AI to think about distant superintelligences” or “I really don’t want my AI to break any laws.”
Separately, how much do you think your views would change if your feelings on this particular point changed?
But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips). Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time. Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field—or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul’s words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer—then they would come away from your paragraph thinking, “Oh, well, this isn’t something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an ‘extreme and somewhat strange failure mode’ must surely require that I add on some unusual extra special code to my model in order to produce it.”
I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you’re assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these “extreme and somewhat strange failure modes” from happening, as we agree they automatically would given any “naive” simple scheme, that you could actually sketch out concretely right now on paper. By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing. It’s not just a buffer overflow that’s the default for bad security, it’s the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail. “Strange” is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it’s just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.
I still feel fine about what I said, but that’s two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
Clarifying what I mean by way of analogy: suppose I’m worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I’d say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the logical facts shake out about how compression works. That said, I still agree that in some sense it’s the “default” behavior without extensive countermeasures. It’s rare for a failure to be so clearly different from what you want that you can actually hope to avoid them in the worst case. But that property is not enough to suggest that they are easily avoided.
I obviously don’t agree with the inference from “X is the default result of optimizing for almost anything” to “X is the default result of our attempt to build useful AI without exotic technology or impressive mitigation efforts.”
My overall level of optimism doesn’t mostly come from hopes about exotic alignment technology. I am indeed way more optimistic about “exotic alignment technology” than you and maybe that optimism cuts off 25-50% of the total alignment risk. I think that’s the most interesting/important disagreement between us since it’s the area we both work in. But more of the disagreement about P(alignment) comes from me thinking it is much, much more likely that “winging it” works long enough that early AI systems will have completely changed the game.
I spend a significant fraction of my time arguing with people who work in ML about why they should be more scared. The problem mostly doesn’t seem to be that I take a moderate or reassuring tone, it’s that they don’t believe the arguments I make (which are mostly strictly weaker forms your arguments, which they are in turn even less on board with).
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
I’m not saying this is an exact analogy for AGI alignment—there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we’re likely to try will break on us suddenly in ways we can’t fix as we go—treacherous turns, inner misalignment or reactions to distributional shift. It’s just that there are different answers to the question of what’s the default outcome depending on if you’re asking what to expect abstractly or in the context of how things are in fact done.
Instrumental Convergence plus a specific potential failure mode (like e.g. we won’t pay sufficient attention to out of distribution robustness), is like saying ‘you know the vast majority of physically possible bridge designs fall over straight away and also there’s a giant crack in that load-bearing concrete pillar over there’ - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn’t do much to help the case for expecting catastrophic misalignment and isn’t enough to establish that failure is a default outcome.
It seems like your reason for saying that catastrophic misalignment can’t be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis—that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story.
this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).
= ‘because strongly optimizing for almost anything leads to catastrophe via IC, we can’t call catastrophic misalignment a bizarre outcome’?
Maybe it’s just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between ‘IC alone raises an issue that might not be obvious but doesn’t give us a strong reason to expect a catastrophe’ and ‘IC alone suggests a catastrophe even though it’s not the whole story’ - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.
Ben Garfinkel argues for the first of these and against the second, in his objection to the ‘classic’ formulation of instrumental convergence/orthogonality—that these are just ‘measure based’ arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we’re actually likely to build such agents.
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn’t fall over, which is not the situation we are currently in.
Didn’t it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?
I think I’m responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.
In some sense changing this view would change my bottom line—e.g. if you ask me “Should you be able to design a bridge that doesn’t fall down even in the worst case?” my gut take would be “why would that be possible?”—but I don’t feel like there’s a load-bearing intuitive disagreement in the vague direction of convergent instrumental goals.
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about “thinking about politics” or “breaking laws” etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
(I didn’t find the ”...something has gone extremely wrong in a way that feels preventable” as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don’t know what you are doing, that’s totally preventable too because if you were an elite circus trainer you would have done it correctly.)
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about “thinking about politics” or “breaking laws” etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
I don’t really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking—law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons.
(I didn’t find the ”...something has gone extremely wrong in a way that feels preventable” as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don’t know what you are doing, that’s totally preventable too because if you were an elite circus trainer you would have done it correctly.)
I’m not really sure if or how this is a reductio. I don’t think it’s a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that’s really all I want to say—that this failure seems preventable, and that intuition doesn’t seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn’t empirically contingent.
Thinking about politics may not be a failure mode; my question was whether it feels “extreme and somewhat strange,” sorry for not clarifying. Like, suppose for some reason “doesn’t think about politics” was on your list of desiderata for the extremely powerful AI you are building. So thinking about politics would in that case be a failure mode. Would it be an extreme and somewhat strange one?
I’d be interested to hear more about the law-breaking stuff—what is it about some laws that makes AI breaking them unsurprising/normal/hard-to-avoid, whereas for others AI breaking them is perverse/strange/avoidable?
I wasn’t constructing a reductio, just explaining why the phrase didn’t help me understand your view/intuition. When I hear that phrase, it seems to me to apply equally to the grenade case, the lion-bites-head-off case, the AI-is-egregiously-misaligned case, etc. All of those cases feel the same to me.
(I do notice a difference between these cases and the bridge case. With the bridge, there’s some sense in which no way you could have made the bridge would be good enough to prevent a certain sufficiently heavy load. By contrast, with AI, lions, and rocket-armchairs, there’s at least some possible way to handle it well besides “just don’t do it in the first place.” Is this the distinction you are talking about?)
Is your claim just that the solubility of the alignment problem is not empirically contingent, i.e. there is no possible world (no set of laws of physics and initial conditions) such that someone like us builds some sort of super-smart AI, and it becomes egregiously misaligned, and there was no way for them to have built the AI without it becoming egregiously misaligned?
I’d love to hear more about this. To me, “egregious misalignment” feels extremely natural/normal/expected, perhaps due to convergent instrumental goals. You might as well have said “I really don’t want my AI to think about politics” or “I really don’t want my AI to think about distant superintelligences” or “I really don’t want my AI to break any laws.”
Separately, how much do you think your views would change if your feelings on this particular point changed?
Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips). Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time. Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field—or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul’s words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer—then they would come away from your paragraph thinking, “Oh, well, this isn’t something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an ‘extreme and somewhat strange failure mode’ must surely require that I add on some unusual extra special code to my model in order to produce it.”
I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you’re assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these “extreme and somewhat strange failure modes” from happening, as we agree they automatically would given any “naive” simple scheme, that you could actually sketch out concretely right now on paper. By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing. It’s not just a buffer overflow that’s the default for bad security, it’s the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail. “Strange” is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it’s just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.
I still feel fine about what I said, but that’s two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
Clarifying what I mean by way of analogy: suppose I’m worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I’d say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the logical facts shake out about how compression works. That said, I still agree that in some sense it’s the “default” behavior without extensive countermeasures. It’s rare for a failure to be so clearly different from what you want that you can actually hope to avoid them in the worst case. But that property is not enough to suggest that they are easily avoided.
I obviously don’t agree with the inference from “X is the default result of optimizing for almost anything” to “X is the default result of our attempt to build useful AI without exotic technology or impressive mitigation efforts.”
My overall level of optimism doesn’t mostly come from hopes about exotic alignment technology. I am indeed way more optimistic about “exotic alignment technology” than you and maybe that optimism cuts off 25-50% of the total alignment risk. I think that’s the most interesting/important disagreement between us since it’s the area we both work in. But more of the disagreement about P(alignment) comes from me thinking it is much, much more likely that “winging it” works long enough that early AI systems will have completely changed the game.
I spend a significant fraction of my time arguing with people who work in ML about why they should be more scared. The problem mostly doesn’t seem to be that I take a moderate or reassuring tone, it’s that they don’t believe the arguments I make (which are mostly strictly weaker forms your arguments, which they are in turn even less on board with).
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
I’m not saying this is an exact analogy for AGI alignment—there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we’re likely to try will break on us suddenly in ways we can’t fix as we go—treacherous turns, inner misalignment or reactions to distributional shift. It’s just that there are different answers to the question of what’s the default outcome depending on if you’re asking what to expect abstractly or in the context of how things are in fact done.
Instrumental Convergence plus a specific potential failure mode (like e.g. we won’t pay sufficient attention to out of distribution robustness), is like saying ‘you know the vast majority of physically possible bridge designs fall over straight away and also there’s a giant crack in that load-bearing concrete pillar over there’ - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn’t do much to help the case for expecting catastrophic misalignment and isn’t enough to establish that failure is a default outcome.
It seems like your reason for saying that catastrophic misalignment can’t be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis—that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story.
= ‘because strongly optimizing for almost anything leads to catastrophe via IC, we can’t call catastrophic misalignment a bizarre outcome’?
Maybe it’s just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between ‘IC alone raises an issue that might not be obvious but doesn’t give us a strong reason to expect a catastrophe’ and ‘IC alone suggests a catastrophe even though it’s not the whole story’ - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.
Ben Garfinkel argues for the first of these and against the second, in his objection to the ‘classic’ formulation of instrumental convergence/orthogonality—that these are just ‘measure based’ arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we’re actually likely to build such agents.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn’t fall over, which is not the situation we are currently in.
Didn’t it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?
Have you played Poly Bridge?
I think I’m responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.
In some sense changing this view would change my bottom line—e.g. if you ask me “Should you be able to design a bridge that doesn’t fall down even in the worst case?” my gut take would be “why would that be possible?”—but I don’t feel like there’s a load-bearing intuitive disagreement in the vague direction of convergent instrumental goals.
OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about “thinking about politics” or “breaking laws” etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?
(I didn’t find the ”...something has gone extremely wrong in a way that feels preventable” as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don’t know what you are doing, that’s totally preventable too because if you were an elite circus trainer you would have done it correctly.)
I don’t really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking—law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons.
I’m not really sure if or how this is a reductio. I don’t think it’s a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that’s really all I want to say—that this failure seems preventable, and that intuition doesn’t seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn’t empirically contingent.
Thinking about politics may not be a failure mode; my question was whether it feels “extreme and somewhat strange,” sorry for not clarifying. Like, suppose for some reason “doesn’t think about politics” was on your list of desiderata for the extremely powerful AI you are building. So thinking about politics would in that case be a failure mode. Would it be an extreme and somewhat strange one?
I’d be interested to hear more about the law-breaking stuff—what is it about some laws that makes AI breaking them unsurprising/normal/hard-to-avoid, whereas for others AI breaking them is perverse/strange/avoidable?
I wasn’t constructing a reductio, just explaining why the phrase didn’t help me understand your view/intuition. When I hear that phrase, it seems to me to apply equally to the grenade case, the lion-bites-head-off case, the AI-is-egregiously-misaligned case, etc. All of those cases feel the same to me.
(I do notice a difference between these cases and the bridge case. With the bridge, there’s some sense in which no way you could have made the bridge would be good enough to prevent a certain sufficiently heavy load. By contrast, with AI, lions, and rocket-armchairs, there’s at least some possible way to handle it well besides “just don’t do it in the first place.” Is this the distinction you are talking about?)
Is your claim just that the solubility of the alignment problem is not empirically contingent, i.e. there is no possible world (no set of laws of physics and initial conditions) such that someone like us builds some sort of super-smart AI, and it becomes egregiously misaligned, and there was no way for them to have built the AI without it becoming egregiously misaligned?