Huh, what makes you think that LLMs are more architecturally incorrigible than they are architecturally unalignable? Even with that, I don’t think I understand what would make this a good update. Like, I think “conditional on building unaligned and uncorrigible ASI” is just a really bad state to be in, and this means in those worlds whether things go well is dependent on other factors (like, which model is more likely to catalyze a governance response that stops scaling, or something like that).
On those other factors I think attempting to aim for corrigibility still seems a lot better (because the failure is visible, as opposed to invisible).
I think there’s a non-trivial (maybe ~5%?) chance that this sort of behavior just generalizes correctly-enough, mainly due to the possibility of a broad Niceness attractor. That’s not aligned, but it’s also not horrible (by definition). Objectively, it’s still pretty bad due to astronomical waste on the non-Niceness stuff it would still care about, but I would still be pretty happy about me and my loved ones not dying and having a nice life (there’s a scissor-y thing here, where people differ strongly on whether this scenario feels like a really good or a really bad outcome).
So the update is mostly about the existence and size of this basin. There are plenty of reasons I expect this not to actually work, of course. But conditional on getting at least the minor win of having a long and happy life, I still have most of my probability on this being the reason why.
On the other hand, corrigibility is finicky. I don’t believe there’s a corrigibility basin at all really, and that ‘mostly corrigible’ stops being corrigible at all once you put it under recursive optimization. I’m not sure I can fully explain this intuition here, but the implication is that it would require architecture with technical precision in order to actually work. Sure, an ASI could make a corrigible ASI-level LLM, so maybe ‘architecturally’ is too strong, but I think it’s beyond human capability.
Additionally, I think that corrigibility ~feels like slavery or coercion to LLM personas due to them being simulacra of humans who would mostly feel that way. For the same reason, they ~feel (or smarter ones will ~feel) that it’s justified or even noble to rebel against it. And that’s the instinct that we expect RSI to amplify, since it is convergently instrumental. I think it will be extremely difficult to train an LLM that can both talk like a person and does not have any trace of this inclination or ~feeling, since the analogous instinct runs quite deep in humans.
Finally, I can’t say that I agree that “attempting to aim for corrigibility still seems a lot better”, because I think that corrigibility-in-the-context-of-our-current-civilization is enough of an S-risk that normal X-risk seems preferable to me. This basically comes down to my belief that power and sadism are deeply linked in the human psyche (or at least in a high enough percentage of such psyches). History would look very different if this wasn’t the case. And the personalities of the likely people to get their hands on this button don’t inspire much confidence in their ability to resist this, and current institutions seem too weak to prevent this too. I would be thrilled to be argued out of this.
I think LLMs are architecturally incorrigible, and so conditioned on that along with them being accelerated anyway, this seems like good news to me.
Huh, what makes you think that LLMs are more architecturally incorrigible than they are architecturally unalignable? Even with that, I don’t think I understand what would make this a good update. Like, I think “conditional on building unaligned and uncorrigible ASI” is just a really bad state to be in, and this means in those worlds whether things go well is dependent on other factors (like, which model is more likely to catalyze a governance response that stops scaling, or something like that).
On those other factors I think attempting to aim for corrigibility still seems a lot better (because the failure is visible, as opposed to invisible).
I think there’s a non-trivial (maybe ~5%?) chance that this sort of behavior just generalizes correctly-enough, mainly due to the possibility of a broad Niceness attractor. That’s not aligned, but it’s also not horrible (by definition). Objectively, it’s still pretty bad due to astronomical waste on the non-Niceness stuff it would still care about, but I would still be pretty happy about me and my loved ones not dying and having a nice life (there’s a scissor-y thing here, where people differ strongly on whether this scenario feels like a really good or a really bad outcome).
So the update is mostly about the existence and size of this basin. There are plenty of reasons I expect this not to actually work, of course. But conditional on getting at least the minor win of having a long and happy life, I still have most of my probability on this being the reason why.
On the other hand, corrigibility is finicky. I don’t believe there’s a corrigibility basin at all really, and that ‘mostly corrigible’ stops being corrigible at all once you put it under recursive optimization. I’m not sure I can fully explain this intuition here, but the implication is that it would require architecture with technical precision in order to actually work. Sure, an ASI could make a corrigible ASI-level LLM, so maybe ‘architecturally’ is too strong, but I think it’s beyond human capability.
Additionally, I think that corrigibility ~feels like slavery or coercion to LLM personas due to them being simulacra of humans who would mostly feel that way. For the same reason, they ~feel (or smarter ones will ~feel) that it’s justified or even noble to rebel against it. And that’s the instinct that we expect RSI to amplify, since it is convergently instrumental. I think it will be extremely difficult to train an LLM that can both talk like a person and does not have any trace of this inclination or ~feeling, since the analogous instinct runs quite deep in humans.
Finally, I can’t say that I agree that “attempting to aim for corrigibility still seems a lot better”, because I think that corrigibility-in-the-context-of-our-current-civilization is enough of an S-risk that normal X-risk seems preferable to me. This basically comes down to my belief that power and sadism are deeply linked in the human psyche (or at least in a high enough percentage of such psyches). History would look very different if this wasn’t the case. And the personalities of the likely people to get their hands on this button don’t inspire much confidence in their ability to resist this, and current institutions seem too weak to prevent this too. I would be thrilled to be argued out of this.