I think your intuitions are essentially correct—in particular, I think many people draw poor conclusions from the strange notion that humans are an example of alignment success.
However, you seem to be conflating:
Fully general SI alignment is unsolvable (I think most researchers would agree with this)
There is no case where SI alignment is solvable (I think almost everyone disagrees with this)
We don’t need to solve the problem in full generality—we only need to find one setup that works. Importantly, ‘works’ here means that we have sufficient understanding of the system to guarantee some alignment property with high probability. We don’t need complete understanding.
This is still a very high bar—we need to guarantee-with-high-probability some property that actually leads to things turning out well, not simply one which satisfies my definition of alignment.
I haven’t seen any existence proof for a predictably fairly safe path to an aligned SI (or aligned AGI...). It’d be nice to know that such a path existed.
Thanks for your response! Could you explain what you mean by “fully general”? Do you mean that alignment of narrow SI is possible? Or that partial alignment of general SI is good enough in some circumstance? If it’s the latter could you give an example?
By “fully general” I mean something like “With alignment process x, we could take the specification of any SI, apply x to it, and have an aligned version of that SI specification”. (I assume almost everyone thinks this isn’t achievable)
But we don’t need an approach that’s this strong: we don’t need to be able to align all, most, or even a small fraction of SIs. One is enough—and in principle we could build in many highly specific constraints by construction (given sufficient understanding).
This still seems very hard, but I don’t think there’s any straightforward argument for its impossibility/intractability. Most such arguments only work against the more general solutions—i.e. if we needed to be able to align any SI specification.
I think your intuitions are essentially correct—in particular, I think many people draw poor conclusions from the strange notion that humans are an example of alignment success.
However, you seem to be conflating:
Fully general SI alignment is unsolvable (I think most researchers would agree with this)
There is no case where SI alignment is solvable (I think almost everyone disagrees with this)
We don’t need to solve the problem in full generality—we only need to find one setup that works.
Importantly, ‘works’ here means that we have sufficient understanding of the system to guarantee some alignment property with high probability. We don’t need complete understanding.
This is still a very high bar—we need to guarantee-with-high-probability some property that actually leads to things turning out well, not simply one which satisfies my definition of alignment.
I haven’t seen any existence proof for a predictably fairly safe path to an aligned SI (or aligned AGI...).
It’d be nice to know that such a path existed.
Thanks for your response! Could you explain what you mean by “fully general”? Do you mean that alignment of narrow SI is possible? Or that partial alignment of general SI is good enough in some circumstance? If it’s the latter could you give an example?
By “fully general” I mean something like “With alignment process x, we could take the specification of any SI, apply x to it, and have an aligned version of that SI specification”. (I assume almost everyone thinks this isn’t achievable)
But we don’t need an approach that’s this strong: we don’t need to be able to align all, most, or even a small fraction of SIs. One is enough—and in principle we could build in many highly specific constraints by construction (given sufficient understanding).
This still seems very hard, but I don’t think there’s any straightforward argument for its impossibility/intractability. Most such arguments only work against the more general solutions—i.e. if we needed to be able to align any SI specification.
Here’s a survey of a bunch of impossibility results if you’re interested.
These also apply to stronger results than we need (which is nice!).