To start, it’s possible to know facts with confidence, without all the relevant info. For example I can’t fit all the multiplication tables into my head, and I haven’t done the calculation, but I’m confident that 2143*1057 is greater than 2,000,000.
Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.
I believe the necessary knowledge to be confident in each of these facts is not too big to fit in a human brain.
You may be referring to other things, which have similar paths to high confidence (e.g. “Why are you confident this alignment idea won’t work.” “I’ve poked holes in every alignment idea I’ve come across. At this point, Bayes tells me to expect new ideas not to work, so I need proof they will, not proof they won’t.”), but each path might be idea specific.
Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.
I’m not sure if I’ve ever seen this stated explicitly, but this is essentially a thermodynamic argument. So to me, arguing against “alignment is hard” feels a lot like arguing “But why can’t this one be a perpetual motion machine of the second kind?” And the answer there is, “Ok fine, heat being spontaneously converted to work isn’t literally physically impossible, but the degree to which it is super-exponentially unlikely is greater than our puny human minds can really comprehend, and this is true for almost any set of laws of physics that might exist in any universe that can be said to have laws of physics at all.”
To start, it’s possible to know facts with confidence, without all the relevant info. For example I can’t fit all the multiplication tables into my head, and I haven’t done the calculation, but I’m confident that 2143*1057 is greater than 2,000,000.
Second, the line of argument runs like this: Most (a supermajority) possible futures are bad for humans. A system that does not explicitly share human values has arbitrary values. If such a system is highly capable, it will steer the future into an arbitrary state. As established, most arbitrary states are bad for humans. Therefore, with high probability, a highly capable system that is not aligned (explicitly shares human values) will be bad for humans.
I believe the necessary knowledge to be confident in each of these facts is not too big to fit in a human brain.
You may be referring to other things, which have similar paths to high confidence (e.g. “Why are you confident this alignment idea won’t work.” “I’ve poked holes in every alignment idea I’ve come across. At this point, Bayes tells me to expect new ideas not to work, so I need proof they will, not proof they won’t.”), but each path might be idea specific.
I’m not sure if I’ve ever seen this stated explicitly, but this is essentially a thermodynamic argument. So to me, arguing against “alignment is hard” feels a lot like arguing “But why can’t this one be a perpetual motion machine of the second kind?” And the answer there is, “Ok fine, heat being spontaneously converted to work isn’t literally physically impossible, but the degree to which it is super-exponentially unlikely is greater than our puny human minds can really comprehend, and this is true for almost any set of laws of physics that might exist in any universe that can be said to have laws of physics at all.”