Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
I’m not saying this is an exact analogy for AGI alignment—there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we’re likely to try will break on us suddenly in ways we can’t fix as we go—treacherous turns, inner misalignment or reactions to distributional shift. It’s just that there are different answers to the question of what’s the default outcome depending on if you’re asking what to expect abstractly or in the context of how things are in fact done.
Instrumental Convergence plus a specific potential failure mode (like e.g. we won’t pay sufficient attention to out of distribution robustness), is like saying ‘you know the vast majority of physically possible bridge designs fall over straight away and also there’s a giant crack in that load-bearing concrete pillar over there’ - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn’t do much to help the case for expecting catastrophic misalignment and isn’t enough to establish that failure is a default outcome.
It seems like your reason for saying that catastrophic misalignment can’t be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis—that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story.
this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).
= ‘because strongly optimizing for almost anything leads to catastrophe via IC, we can’t call catastrophic misalignment a bizarre outcome’?
Maybe it’s just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between ‘IC alone raises an issue that might not be obvious but doesn’t give us a strong reason to expect a catastrophe’ and ‘IC alone suggests a catastrophe even though it’s not the whole story’ - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.
Ben Garfinkel argues for the first of these and against the second, in his objection to the ‘classic’ formulation of instrumental convergence/orthogonality—that these are just ‘measure based’ arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we’re actually likely to build such agents.
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn’t fall over, which is not the situation we are currently in.
Didn’t it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?
Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won’t collapse the moment they’re finished.
I’m not saying this is an exact analogy for AGI alignment—there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we’re likely to try will break on us suddenly in ways we can’t fix as we go—treacherous turns, inner misalignment or reactions to distributional shift. It’s just that there are different answers to the question of what’s the default outcome depending on if you’re asking what to expect abstractly or in the context of how things are in fact done.
Instrumental Convergence plus a specific potential failure mode (like e.g. we won’t pay sufficient attention to out of distribution robustness), is like saying ‘you know the vast majority of physically possible bridge designs fall over straight away and also there’s a giant crack in that load-bearing concrete pillar over there’ - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn’t do much to help the case for expecting catastrophic misalignment and isn’t enough to establish that failure is a default outcome.
It seems like your reason for saying that catastrophic misalignment can’t be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis—that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story.
= ‘because strongly optimizing for almost anything leads to catastrophe via IC, we can’t call catastrophic misalignment a bizarre outcome’?
Maybe it’s just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between ‘IC alone raises an issue that might not be obvious but doesn’t give us a strong reason to expect a catastrophe’ and ‘IC alone suggests a catastrophe even though it’s not the whole story’ - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.
Ben Garfinkel argues for the first of these and against the second, in his objection to the ‘classic’ formulation of instrumental convergence/orthogonality—that these are just ‘measure based’ arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we’re actually likely to build such agents.
Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn’t fall over, which is not the situation we are currently in.
Didn’t it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?
Have you played Poly Bridge?