Alignment by default is a minority opinion. Surveying the wide range range of even truly informed opinions, it seems clear to me that we collectively don’t know how hard alignment is.
Totally. I think it’s “arguable” in the sense of inside-views, not outside-views, if that makes sense? Like: it could be someone’s personal vibe that alignment-by-default is >99%. Should they have that as their all-things-considered view? Seems wrong to me, we should be considerably more uncertain here.
But okay, then: we should have some spread of bets across different possible worlds, and put a solid chunk of probability on alignment by default. Even if it’s a minority probability, this could matter a lot for what you actually try to do!
For example: I think worlds with short timelines, hard takeoff, and no alignment-by-default are pretty doomed. It’s easy to focus on those worlds and feel drawn to plans that are pretty costly and are incongruent with virtue and being-good-collaborators. e.g. “we should have One Winning AGI Project that’s Safe and Smart Enough to Get Things Right”, the theory of victory that brought you OpenAI.
My intuition is that worlds with at least one of those variables flipped tend to convergently favor solutions that are more virtuous / collaborative and are more likely to fail gracefully.
(I’m tired and not maximally articulate rn, but could try to say more if that feels useful.)
Totally. I think it’s “arguable” in the sense of inside-views, not outside-views, if that makes sense? Like: it could be someone’s personal vibe that alignment-by-default is >99%. Should they have that as their all-things-considered view? Seems wrong to me, we should be considerably more uncertain here.
But okay, then: we should have some spread of bets across different possible worlds, and put a solid chunk of probability on alignment by default. Even if it’s a minority probability, this could matter a lot for what you actually try to do!
For example: I think worlds with short timelines, hard takeoff, and no alignment-by-default are pretty doomed. It’s easy to focus on those worlds and feel drawn to plans that are pretty costly and are incongruent with virtue and being-good-collaborators. e.g. “we should have One Winning AGI Project that’s Safe and Smart Enough to Get Things Right”, the theory of victory that brought you OpenAI.
My intuition is that worlds with at least one of those variables flipped tend to convergently favor solutions that are more virtuous / collaborative and are more likely to fail gracefully.
(I’m tired and not maximally articulate rn, but could try to say more if that feels useful.)