But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.
Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips). Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time. Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field—or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul’s words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer—then they would come away from your paragraph thinking, “Oh, well, this isn’t something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an ‘extreme and somewhat strange failure mode’ must surely require that I add on some unusual extra special code to my model in order to produce it.”
I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you’re assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these “extreme and somewhat strange failure modes” from happening, as we agree they automatically would given any “naive” simple scheme, that you could actually sketch out concretely right now on paper. By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing. It’s not just a buffer overflow that’s the default for bad security, it’s the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail. “Strange” is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it’s just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.
I still feel fine about what I said, but that’s two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
Clarifying what I mean by way of analogy: suppose I’m worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I’d say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the logical facts shake out about how compression works. That said, I still agree that in some sense it’s the “default” behavior without extensive countermeasures. It’s rare for a failure to be so clearly different from what you want that you can actually hope to avoid them in the worst case. But that property is not enough to suggest that they are easily avoided.
I obviously don’t agree with the inference from “X is the default result of optimizing for almost anything” to “X is the default result of our attempt to build useful AI without exotic technology or impressive mitigation efforts.”
My overall level of optimism doesn’t mostly come from hopes about exotic alignment technology. I am indeed way more optimistic about “exotic alignment technology” than you and maybe that optimism cuts off 25-50% of the total alignment risk. I think that’s the most interesting/important disagreement between us since it’s the area we both work in. But more of the disagreement about P(alignment) comes from me thinking it is much, much more likely that “winging it” works long enough that early AI systems will have completely changed the game.
I spend a significant fraction of my time arguing with people who work in ML about why they should be more scared. The problem mostly doesn’t seem to be that I take a moderate or reassuring tone, it’s that they don’t believe the arguments I make (which are mostly strictly weaker forms your arguments, which they are in turn even less on board with).
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
I’m not saying this is an exact analogy for AGI alignment—there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we’re likely to try will break on us suddenly in ways we can’t fix as we go—treacherous turns, inner misalignment or reactions to distributional shift. It’s just that there are different answers to the question of what’s the default outcome depending on if you’re asking what to expect abstractly or in the context of how things are in fact done.
Instrumental Convergence plus a specific potential failure mode (like e.g. we won’t pay sufficient attention to out of distribution robustness), is like saying ‘you know the vast majority of physically possible bridge designs fall over straight away and also there’s a giant crack in that load-bearing concrete pillar over there’ - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn’t do much to help the case for expecting catastrophic misalignment and isn’t enough to establish that failure is a default outcome.
It seems like your reason for saying that catastrophic misalignment can’t be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis—that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story.
this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).
= ‘because strongly optimizing for almost anything leads to catastrophe via IC, we can’t call catastrophic misalignment a bizarre outcome’?
Maybe it’s just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between ‘IC alone raises an issue that might not be obvious but doesn’t give us a strong reason to expect a catastrophe’ and ‘IC alone suggests a catastrophe even though it’s not the whole story’ - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.
Ben Garfinkel argues for the first of these and against the second, in his objection to the ‘classic’ formulation of instrumental convergence/orthogonality—that these are just ‘measure based’ arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we’re actually likely to build such agents.
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn’t fall over, which is not the situation we are currently in.
Didn’t it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?
Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips). Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time. Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field—or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul’s words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer—then they would come away from your paragraph thinking, “Oh, well, this isn’t something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an ‘extreme and somewhat strange failure mode’ must surely require that I add on some unusual extra special code to my model in order to produce it.”
I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you’re assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these “extreme and somewhat strange failure modes” from happening, as we agree they automatically would given any “naive” simple scheme, that you could actually sketch out concretely right now on paper. By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing. It’s not just a buffer overflow that’s the default for bad security, it’s the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail. “Strange” is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it’s just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.
I still feel fine about what I said, but that’s two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
Clarifying what I mean by way of analogy: suppose I’m worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I’d say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the logical facts shake out about how compression works. That said, I still agree that in some sense it’s the “default” behavior without extensive countermeasures. It’s rare for a failure to be so clearly different from what you want that you can actually hope to avoid them in the worst case. But that property is not enough to suggest that they are easily avoided.
I obviously don’t agree with the inference from “X is the default result of optimizing for almost anything” to “X is the default result of our attempt to build useful AI without exotic technology or impressive mitigation efforts.”
My overall level of optimism doesn’t mostly come from hopes about exotic alignment technology. I am indeed way more optimistic about “exotic alignment technology” than you and maybe that optimism cuts off 25-50% of the total alignment risk. I think that’s the most interesting/important disagreement between us since it’s the area we both work in. But more of the disagreement about P(alignment) comes from me thinking it is much, much more likely that “winging it” works long enough that early AI systems will have completely changed the game.
I spend a significant fraction of my time arguing with people who work in ML about why they should be more scared. The problem mostly doesn’t seem to be that I take a moderate or reassuring tone, it’s that they don’t believe the arguments I make (which are mostly strictly weaker forms your arguments, which they are in turn even less on board with).
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
I’m not saying this is an exact analogy for AGI alignment—there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we’re likely to try will break on us suddenly in ways we can’t fix as we go—treacherous turns, inner misalignment or reactions to distributional shift. It’s just that there are different answers to the question of what’s the default outcome depending on if you’re asking what to expect abstractly or in the context of how things are in fact done.
Instrumental Convergence plus a specific potential failure mode (like e.g. we won’t pay sufficient attention to out of distribution robustness), is like saying ‘you know the vast majority of physically possible bridge designs fall over straight away and also there’s a giant crack in that load-bearing concrete pillar over there’ - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn’t do much to help the case for expecting catastrophic misalignment and isn’t enough to establish that failure is a default outcome.
It seems like your reason for saying that catastrophic misalignment can’t be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis—that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story.
= ‘because strongly optimizing for almost anything leads to catastrophe via IC, we can’t call catastrophic misalignment a bizarre outcome’?
Maybe it’s just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between ‘IC alone raises an issue that might not be obvious but doesn’t give us a strong reason to expect a catastrophe’ and ‘IC alone suggests a catastrophe even though it’s not the whole story’ - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.
Ben Garfinkel argues for the first of these and against the second, in his objection to the ‘classic’ formulation of instrumental convergence/orthogonality—that these are just ‘measure based’ arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we’re actually likely to build such agents.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn’t fall over, which is not the situation we are currently in.
Didn’t it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?
Have you played Poly Bridge?